The success of statistical machine translation systems such as Moses, Language Weaver and Google Translate has shown that it is possible to build high performance machine translation systems with a small amount of effort using statistical learning techniques.
This course will present the basic modeling behind statistical machine translation in a concise way. Participants will also learn how to use the Moses system, which is an open source toolkit for machine translation.
Email Address: SubstituteMyLastName@cis.uni-muenchen.de
DFG Project: Models of Morphosyntax for Statistical Machine Translation
October 10th | Part 6. Translating to morphologically rich languages: case study on German | powerpoint slides |
October 10th | Part 5. Advanced topics in SMT. Discriminative bitext alignment, morphological processing, syntax | powerpoint slides |
October 9th | Part 4. Log-linear Models for SMT and Minimum Error Rate Training | powerpoint slides |
October 8th | Part 3. Phrase-based Models and Decoding (automatically translating a text given an already learned model) | powerpoint slides |
October 7th | Part 2. Bitext alignment (extracting lexical knowledge from parallel corpora) | powerpoint slides |
October 7th | Part 1. Introduction, basics of statistical machine translation (SMT), evaluation of MT | powerpoint slides |
Further literature:
Philipp Koehn's book Statistical Machine Translation
Kevin Knight's tutorial on SMT (particularly look at IBM Model 1)
Koehn and Knight compound splitting paper. You can also take a look at Fritzinger and Fraser if you like.
new release of small German (with a better trigram language model)
(UPDATED) 50,000 sentences of German/English with trigram language model
BROKEN ALTE RECHTSCHREIBUNG 50,000 sentences of German/English with trigram language model
Original config.toy from Moses
mteval-v13a.pl (replace the one in MOSES-1.0 with this one!)
Also install imagemagick, and perl-xml-twig (these are the install commands for Ubuntu):
sudo apt-get install imagemagick
sudo apt-get install libxml-twig-perl
One final note on using experiment.perl - this configuration file
skips tuning (minimum error rate training). Tuning is time
consuming because the decoder is run repeatedly. The configuration
file instead uses weights which were precomputed and I have verified
that these weights work well for the 50k europarl dataset.
german_text.tok.vcb for compound splitting