EMA 2008 Statistical Machine Translation - Assignment 1

Alex Fraser

Data and pseudo-code from Philipp Koehn's forthcoming SMT textbook.

Implement the EM algorithm for IBM Model 1.

Start with the small toy set and then work on your choice of German/English or French/English:

Toy 6 short sentences
French--English: 1000 sentences, 292 short sentences
German--English: 1000 sentences, 292 short sentences

Your program should output two different things:

A table containing the word translation probabilities that were learned (note: think of an efficient data structure for such a sparse matrix)
The most likely alignment (the Viterbi alignment) for each sentence pair in the training data

Pseudo-code of EM for IBM Model 1:

 initialize t(e|f) uniformly
 do until convergence
   set count(e|f) to 0 for all e,f
   set total(f) to 0 for all f
   for all sentence pairs (e_s,f_s)
     set total_s(e) = 0 for all e
     for all words e in e_s
       for all words f in f_s
         total_s(e) += t(e|f)
     for all words e in e_s
       for all words f in f_s
         count(e|f) += t(e|f) / total_s(e)
         total(f)   += t(e|f) / total_s(e)
   for all f
     for all e
       t(e|f) = count(e|f) / total(f)

Study Questions

Word alignments are usually calculated over lowercased data. Compare your alignments with mixed case versus lowercase. Do you seen an improvement? Where?
How are non-compositional phrases aligned, do you seen any problems?
Generate an alignment in the opposite direction (e.g. swap the English and French files (or English and German) and generate another alignment). Does one direction seem to work well to you?
Look for the longest English token, and the longest French or German token. Are they aligned well? Why?

Advanced Study Questions

Implement union and intersection of the two alignments you have generated. What are the differences between them? Consider the longest tokens again, is there an improvement?
Calculate p(e|f) for each sentence
Are all cognates aligned correctly? How could we force them to be aligned correctly?
The Porter stemmer is a simple widely available tool for reducing English morphology, e.g., mapping a plural variant and singular variant to the same token. Download a porter stemmer package from the web and apply it to the English. Then compare an alignment with porter stemming versus one without. What are the differences?