Assignment 2 (OmegaT and IBM Model 1, no programming)
Alex Fraser
NSSNLP, University of Kathmandu
You must do both parts:
First part - do a small translation job using OmegaT
Second part - do a basic exercise and answer some basic questions about Model1 (and, optionally, some harder questions)
OmegaT
- Download OmegaT from omegat.org
- Create a new project (see the "instant start" guide to OmegaT, Chapter 2 of the manual, you can find a direct link in Google), call the project "mytest" (without quotes). Make the source language EN-US or EN-GB (depending on whether you prefer to write in American or British English). Make the target language be the two letter code for your language. Make a note of where the project was created (the path on disk).
- Go to the main directory of the project, then the source subdirectory of the project and create a text file called "text1.txt" containing 5 sentences in English. Make sure to use proper punctuation, OmegaT knows how to segment English sentences.
- Run OmegaT, and load the project. You should see the 5 sentences, which are queued up for translation. Click on the target part of each one, and enter the translation in your language.
- Select "generate translations" (the hotkey is control-D) to get OmegaT to output its database of translation to the target subdirectory
- Save and Exit OmegaT
- The results of your work are stored in the "target" subdirectory, using the same filename. Check the file there to make sure that the output looks OK.
- Go back to the source subdirectory of the project and create another text file "text2.txt". For the first sentence, take the same first English sentence as you used before (i.e., the first sentence in text1.txt). Add 3 new sentences, these should be similar to sentences two to four in the first file, change just one word per sentence.
- Run OmegaT, and load the project. You should see the 4 sentences. The first sentence should be an exact match. Accept this. Then click on the second sentence. You should see a "fuzzy match" to the right. Use right click to get to "Replace translation with match". Then edit it. Finish editing these sentences.
- Select "generate translations" (the hotkey is control-D) to get OmegaT to output its database of translations to the target subdirectory
- Save and Exit OmegaT
- IMPORTANT: look at the mytest-omegat.tmx file located in the main project directory and describe its contents. What is this file for? How should you modify it if you switch language directions (translating your language to English)? How much support for segmenting and fuzzy matching is there in your language (see the OmegaT manual)? Compare this with support for segmentation and fuzzy mapping in English.
- Turn in a short text answering these questions along with your source and target text files.
Model 1
Pseudo-code from Philipp Koehn's book.
Pseudo-code of EM for IBM Model 1:
initialize t(e|f) uniformly
do until convergence
set count(e|f) to 0 for all e,f
set total(f) to 0 for all f
for all sentence pairs (e_s,f_s)
set total_s(e) = 0 for all e
for all words e in e_s
for all words f in f_s
total_s(e) += t(e|f)
for all words e in e_s
for all words f in f_s
count(e|f) += t(e|f) / total_s(e)
total(f) += t(e|f) / total_s(e)
for all f
for all e
t(e|f) = count(e|f) / total(f)
Basic Exercise
Start by convincing yourself that the incredibly simple estimation you do by running the main loop of the pseudo-code once gives the same results as explicitly enumerating the alignments in slide 41 (the slide where we calculated counts by working on four alignment functions by explicitly enumerating each one). You have to start with the t values on slide 41 to do this, and you apply them to just the pair of two word sentences on slide 41. Please turn this in.
Basic Questions about Model 1
- What is the alignment structure modeled by IBM Model 1 in the pseudo-code presented above? Is the structure symmetric with respect to English and Foreign?
- How many entries does t(e|f) have after the initialization (line 1 of the pseudo-code)?
- Can you think of a way to initialize that would involve setting some of the parameters in t(e|f) to zero or any other constant without affecting the results? Remember that if N is the number of English types, then t(e|f)=1/N for all e and f. Think about whether any of the entries in t will not be used.
- Under what conditions will an English word e in a particular sentence pair be left unaligned in the Viterbi alignment? What about a French word f?
- Under what circumstances would we prefer that an English word e is unaligned (note that this question is about a gold standard alignment)?
Advanced Questions about Model 1 (Optional)
- Suppose you are given Model 1 parameters estimated by someone else. What is a short formula which determines the Viterbi alignment of a fixed sentence pair E and F?
- How could we force cognates (for a language pair like French/English) to be aligned correctly? (Warning, this is a trick question)
- Is there some simple way (either heuristically or by modifying the model; either one is fine) where we
could break the independence assumption in Model 1 and allow the
alignment of a word at position j to be influenced by the word at
position j-1 (of the Foreign side)?
- Look at the "grow" heuristic in the slides. If you know this will
be used on a pair of 1-to-N and M-to-1 alignments, is it possible to
systematically remove links from one of these alignments (for the sake of discussion assume the M-to-1 alignment) without affecting the final symmetrized alignment?