EMA 2008 Statistical Machine Translation - Assignment 1
Alex Fraser
Data and pseudo-code from Philipp Koehn's forthcoming SMT textbook.
Implement the EM algorithm for IBM Model 1.
Start with the small toy set and then work on your choice of German/English or French/English:
Your program should output two different things:
- A table containing the word translation probabilities that were learned (note: think of an efficient data structure for such a sparse matrix)
- The most likely alignment (the Viterbi alignment) for each sentence pair in the training data
Pseudo-code of EM for IBM Model 1:
initialize t(e|f) uniformly
do until convergence
set count(e|f) to 0 for all e,f
set total(f) to 0 for all f
for all sentence pairs (e_s,f_s)
set total_s(e) = 0 for all e
for all words e in e_s
for all words f in f_s
total_s(e) += t(e|f)
for all words e in e_s
for all words f in f_s
count(e|f) += t(e|f) / total_s(e)
total(f) += t(e|f) / total_s(e)
for all f
for all e
t(e|f) = count(e|f) / total(f)
Study Questions
- Word alignments are usually calculated over lowercased data. Compare your alignments with mixed case versus lowercase. Do you seen an improvement? Where?
- How are non-compositional phrases aligned, do you seen any problems?
- Generate an alignment in the opposite direction (e.g. swap the English and French files (or English and German) and generate another alignment). Does one direction seem to work well to you?
- Look for the longest English token, and the longest French or German token. Are they aligned well? Why?
Advanced Study Questions
- Implement union and intersection of the two alignments you have generated. What are the differences between them? Consider the longest tokens again, is there an improvement?
- Calculate p(e|f) for each sentence
- Are all cognates aligned correctly? How could we force them to be aligned correctly?
- The Porter stemmer is a simple widely available tool for reducing English morphology, e.g., mapping a plural variant and singular variant to the same token. Download a porter stemmer package from the web and apply it to the English. Then compare an alignment with porter stemming versus one without. What are the differences?