Exploiting Linguistic and Statistical Knowledge in a Text Alignment System

Please use this identifier to cite or link to this item:
Open Access logo originally created by the Public Library of Science (PLoS)
Title: Exploiting Linguistic and Statistical Knowledge in a Text Alignment System
Authors: Schrader, Bettina
Thesis advisor: Prof. Dr. Peter Bosch
Dr. habil. Helmar Gust
Prof. Dr. Stefan Evert
Thesis referee: Prof. Dr. Peter Bosch
Dr. habil. Helmar Gust
Prof. Dr. Stefan Evert
Prof. Dr. Martin Volk
Abstract: In machine translation, the alignment of corpora has evolved into a mature research area, aimed at providing training data for statistical or example-based machine translation systems. Moreover, the alignment information can be used for a variety of other purposes, including lexicography and the induction of tools for natural language processing. The alignment techniques used for these purposes fall roughly in two separate classes: sentence alignment approaches that often combine statistical and linguistic information, and word alignment models that are dominated by the statistical machine translation paradigm. Alignment approaches that use linguistic knowledge provided by corpus annotation are rare, as are as non-statistical word alignment strategies. Furthermore, parallel corpora are typically not aligned at all text levels simultaneously. Rather, a corpus is first sentence aligned, and in a subsequent step, the alignment information is refined to go below the sentence level. In this thesis, the distinction between the two alignment classes is withdrawn. Rather, a system is introduced that can simultaneously align at the paragraph, sentence, word, and phrase level. Furthermore, linguistic as well as statistical information can be combined. This combination of alignment cues from different knowledge sources, as well as the combination of the sentence and word alignment tasks, is made possible by the development of a modular alignment platform. Its main features are that it supports different kinds of linguistic corpus annotation, and furthermore aligns a corpus hierarchically, such that sentence and word alignments are cohesive. Alignment cues are not used within a global alignment model. Rather, different sub-models can be implemented and allowed to interact. Most of the alignment modules of the system have been implemented using empirical corpus studies, aimed at showing how the most common types of corpus annotation can be exploited for the alignment task.
URL: https://repositorium.ub.uni-osnabrueck.de/handle/urn:nbn:de:gbv:700-2009022517
Subject Keywords: Computerlinguistik; Maschinelle Übersetzung; Korpuslinguistik; Wortalignment; Satzalignment
Issue Date: 20-Feb-2009
Type of publication: Dissertation oder Habilitation [doctoralThesis]
Appears in Collections:FB08 - E-Dissertationen

Files in This Item:
File Description SizeFormat 
E-Diss853_thesis.tar.gz1,53 MBGZIP
E-Diss853_thesis.pdfPräsentationsformat1,27 MBAdobe PDF

Items in osnaDocs repository are protected by copyright, with all rights reserved, unless otherwise indicated. rightsstatements.org