Natural language analysis in machine translation (mt) based on the string-tree correspondence grammar (stcg)

Date

1994

Authors

Enya Kong, Tang

Abstract

The String-Tree Correspond'e nce Grammar (STCG) [Zaharin 87a] is a grammar formalism for defining: • a set of strings (a language), • a set of trees (valid representation/interpretation structures), • the mapping between the two (to be interpreted for analysis & generation). The formalism is argued to be a totally declarative grammar formalism that can associate, to strings in a language, arbitrary tree structures as desired by the grammar writer to be the linguistic representation structures of the strings. More importantly is the facility to specify the correspondence between the string and the associated tree in a very natural manner. These features are very much desired in grammar writing, in particular for the treatment of certain linguistic phenomena which are 'non-standard', namely featurisation, lexicalisation and crossed dependencies. Furthermore, a grammar written in this way naturally inherits the desired property of bidirectionality (in fact non-directionality) such that the same grammar can be interpreted for both analysis and generation. In this thesis, we investigate the properties of the STCG for interpretation towards analysis (as is understood within the context of Machine Translation (MT)). Other than using STCG grammars as specifications for the automatic generation of analysis programs in the Specialised Languages for Linguistic Programming (SLLPs) of MT systems (a study reported in the Appendix), the work centres around the specification of a general analyser/parser for the STCG. The proposed STCG analyser is capable of mimicking some very useful features in various context-free parsing techniques. One such feature is the use of charts in tabular parsing algorithms, as el'.emplified in Earley's Algorithm [Earley 70], which is very helpful in avoiding redundancies that may otherwise result in a combinatorial explosion. Another is the compact way of representing possible parse trees for ambiguous sentences, such as the one seen in the GLR parser [Tomita 87]. We shall also provide a natural way for handling the kind of awkward phenomena mentioned above (namely lexicalisation, featurisation, and worst of all, crossed dependencies) while at the same time retaining much of the efficiency of standard context-free parsing algorithms. The thesis also discusses the treatment of attributes/features in the STCG, which to date has been very lacking in the published literature. In general, linguistic rules written in the STCG describe strings of terms with all the relevant information as one would expect from the result of a morphological analysis and reference to some lexical dictionary, and the associated representation structures are typically the m-structures [Vauquois 78] [Zaharin 87b] which support many levels of interpretation (morpho-syntagmatic, functional, logical, semantic features & relations, etc.). Such a large quantity of information would indeed require a very convenient form of expression and the corresponding means of manipulation.

Keywords

Machine translation (mt) , String-tree

URI

http://hdl.handle.net/123456789/1015

Collections

Pusat Pengajian Sains Matematik - Tesis

Full item page