Semantic Technology and Statistical Machine Translation: are there new perspectives? Part 1.: desperbcn

desperbcn

Semantic Technology and Statistical Machine Translation: are there new perspectives? Part 1.

Dec 15, 2008 18:43

Machine Translation (MT), as many other natural language processing (NLP) areas, lacks a new look to the perspectives of the technology, as well as to the barriers preventing further MT commercialization. In this study we express and give prove to support our point of view about the future of MT (mainly emphasizing on Statistical MT (SMT)), providing the reader with a survey of the technical problems insurmountable within the limits of the modern approach. We believe that the future of MT is with semantics-based SMT operating on the transfer-based (interlingual) level.

Nowadays, the general trend in MT is towards generalization of abstractions operating within translation system. The driving force behind this on-going paradigm shift is that the statistical state-of-the-art phrase-based approach is limited to the fully bilingual lexical instances extracted from the parallel training data, despite the existence of many purely statistical and hybrid translation systems with greater powers of generalization. A representative set of such systems include: SMT based on a formal grammar representation of the source, target or both languages{Yamada:01}; MT involving parse trees or other syntactical information into translation{imamura05}; or SMT abstraction based on hierarchical phrase model allowing for multiply generalization {Chiang:05}, recently augmented with syntax{Venugopal:06,Lane:07}. However, all these approaches provide generalization only on lexical level involving bilingual linguistic or syntactical information into interlingual mapping. It implies that the translation accuracy can dramatically fall in case of translation of an out-of-domain dataset. Purely generalist approaches, hardly applicable to any NLP algorithm, suffer from a number of exceptions that cannot be generalized and, besides, are easy to meet in any natural language. The alternatives are fully theoretical, like the parse trees mapping scheme proposed in {Melamed:04}.

A purely statistical MT deals with a choice of best translation on the basis of relative frequencies (phrase-based approach{Koehn:07}) or conditioning the choice of translation hypothesis by the surrounding context (N-gram-based approach{marino:2006:CL}). Natural languages are very complicated semantic systems, in many cases, a word in the source language does not mean exactly the same as its closest counterpart in the target language. In other words, the semantic ''spots'' for almost each word in the target and source languages do not coincide. For example, a Russian word "совпадать" can be translated into English as "coincide", "concur" or "agree", depending on the particular sentence where it was found. Moreover, a professional interpreter takes decision about translation based not only on the subject of the phrase, but also involving additional in-domain knowledge which can be contained in the preceding context.

Word homonymy/polysemy is another problem that complicates the issue of MT. In the state-of-the-arts SMT systems words disambiguation is implemented within the translation engine by factoring the lexical choice on the context and a set of feature functions. Some systems implement word disambiguation as an external module analyzing linguistic dependencies supporting the lexical choice prediction process, as shown in {carpuat05}; using additional monolingual corpus to enhance the disambiguation process (corpus-based approach), refer to {Miangah05} as an example; or exploiting a set of disambiguation rules (knowledge-based approach), example can be found in {Specia_exploitingrules}. Some SMT systems use target-side Part-of-Speech tags as a supporting model during decoding{marino06tcstar}, while the state-of-the-art MOSES-based model consider Part-of-Speech tags and word lemmas models as factors composing a factored phrase-based SMT model as described in the tutorial of factored model using\footnote{http://www.statmt.org/moses/?n=Moses.FactoredTutorial}.

Generally, any professional interpreter can state that a good translation is not just a translation of words and expressions as they are. It means that a good translation is a transfer of thoughts, conceptions, images and human vision of reality, which is highly influenced by personal and cultural experience. That is why we see the future of MT with systems based on semantics transfer-based or interlingual MT. This concept can be found on the top of the MT pyramid {Hutchins92} and provides the best possible transfer of information between languages. However, until now it has been considered as an extremely challenging and difficult approach.

smt