Automatic Text Alignment Using Recursive Hapax-Based Cut-Through Fragmentation

Loading...
Thumbnail Image
Date
2012-03
Authors
Ng, Pek Kuan
Journal Title
Journal ISSN
Volume Title
Publisher
Universiti Sains Malaysia
Abstract
Communication over the Internet becomes the necessity of life. Multi-lingual machine translation systems are developed to support such communication. One of the most commonly used approaches is the example-based approach which requires a large set of examples as reference. These examples are prepared by aligning the parallel texts either manually or semi-automatically with human intervention. This requires much effort and is time-consuming considering the large number of examples needed to ensure the quality of the translation. Moreover, the fact that humans make mistakes and has preferences raises the consistency issue. Hence, there is an urgent need to develop an automatic aligner. Malay is a language with scarce resources in terms of linguistic tools and data. It becomes a challenge to develop an alignment system that achieves the same level of accuracy as those for the resourceful languages in view of the resource constraint. The objective of this research is to design a novel alignment method with minimum linguistic resources but still achieves a reasonable level of accuracy to automate the alignment task and thus minimizing the effort and time consumed The methodology of the research involves two phases. The first phase involves linguistic resource preparation and tool enhancement. The second phase involves the design of the alignment algorithm. Our proposed algorithm adapts the concept of hapax, cut-through, fragmentation and recursivity. It recursively cuts the text into smaller fragments based on the hapax alignment regardless of the logical boundaries of a text and generates new alignments in each iteration. The cut-through approach is used to eliminate wrong alignments in order to maximize the number of fragments and word alignments. The results have shown that the proposed hapax-based algorithm performs well with a precision of 92.62% and recall of 74.53% for word alignment regardless of the major resource constraint. With such accuracy, human intervention could be minimized and thus increasing the consistency and at the same time decreasing effort, time and errors.
Description
Keywords
Automatic text alignment using recursive hapax-based , cut-through fragmentation
Citation