Natural Language Processing, Fall 2014 HW5 - IBM Model 1 Due Fri Dec 5 midnight on BB ---------------------------- In this HW you will implement the EM algorithm for IBM Model 1. You will use this dataset http://acl.cs.qc.edu/~lhuang/teaching/nlp/hw5-ibm1/data.tgz which contains the following bitexts 1) a toy French-English corpus (2 sentence pairs): toy.fr toy.en 2) a tiny Spanish-English corpus (12 sentence pairs): tiny.es tiny.en 3) a small Spanish-English corpus (5000 sentence pairs): es-en.5000.es.tok es-en.5000.en.tok Before you start, it would helpful to add the following lines to the beginning of your python code: #!/usr/bin/env python #coding=utf8 from __future__ import division from collections import defaultdict import sys 1. Implement the EM algorithm, which has the following command-line: paste -d "\n" toy.fr toy.en | ./ibm.py 5 here "5" is the number of iterations. You can refer to the pseudocode from the slides. For simplicity reasons, you do not need to insert "NULL". Without NULL, you should get something like this on the en->fr direction. iter #5 ------------------------------------------- cat -> chat: 0.76 le: 0.24 dog -> chien: 0.76 le: 0.24 the -> le: 0.84 chat: 0.08 chien: 0.08 2. Run it on the tiny 12-sentence-pair example on both es->en and en->es directions. You should get something similar to this after 10 iterations (es->en direction): iter #10 ------------------------------------------- . -> .: 0.97 the: 0.01 Carlos -> Carlos: 0.81 three: 0.09 has: 0.03 Garcia: 0.03 associates: 0.03 Europa -> Europe: 0.41 in: 0.41 its: 0.11 groups: 0.05 are: 0.01 Garcia -> Garcia: 0.93 .: 0.05 asociados -> associates: 0.94 .: 0.06 clientes -> clients: 0.91 are: 0.06 .: 0.01 its: 0.01 empresa -> company: 0.79 has: 0.17 .: 0.03 en -> Europe: 0.41 in: 0.41 its: 0.11 groups: 0.05 are: 0.01 enemigos -> enemies: 0.80 the: 0.07 clients: 0.06 and: 0.05 are: 0.01 enfadados -> angry: 0.82 are: 0.15 .: 0.02 its: 0.01 estan -> are: 0.62 its: 0.24 angry: 0.07 .: 0.07 fuertes -> strong: 0.75 his: 0.23 grupos -> groups: 0.86 the: 0.07 .: 0.07 la -> the: 0.69 company: 0.11 three: 0.10 groups: 0.09 has: 0.01 los -> the: 0.92 .: 0.06 are: 0.02 medicinas -> pharmaceuticals: 0.75 sell: 0.09 modern: 0.08 strong: 0.07 modernos -> modern: 0.83 groups: 0.11 the: 0.03 .: 0.01 no -> not: 0.93 .: 0.03 his: 0.02 pequenos -> small: 0.80 modern: 0.10 are: 0.05 not: 0.02 groups: 0.02 son -> are: 0.83 his: 0.05 .: 0.04 not: 0.04 associates: 0.02 the: 0.01 sus -> are: 0.42 its: 0.38 .: 0.12 his: 0.08 tambien -> also: 0.96 .: 0.02 tiene -> has: 0.79 .: 0.09 three: 0.05 company: 0.05 Garcia: 0.02 tres -> three: 0.79 has: 0.18 .: 0.02 una -> a: 0.80 company: 0.09 also: 0.04 Garcia: 0.03 has: 0.03 venden -> sell: 0.82 groups: 0.07 do: 0.03 zenzanine: 0.03 the: 0.02 .: 0.01 y -> and: 0.90 associates: 0.08 .: 0.02 zanzanina -> do: 0.49 zenzanine: 0.49 sell: 0.02 While most are reasonable, the following look bad: Europa -> Europe: 0.41 in: 0.41 ... en -> Europe: 0.41 in: 0.41 ... Explain why. Give another bad example. Now add some examples to fix the Europa problem. 3. Run it on the 5000 sentence data: paste -d "\n" es-en.5000.es.tok es-en.5000.en.tok | head -1000 | ./ibm.py 10 here "head -1000" means "take the first 500 sentence pairs". Gradually increase the data from the first 500 pairs to 1000, 2500, and 5000 pairs. Use specific examples to demonstrate that (1) the probabilities get better with more data. (2) the probabilities get better with more iterations. (3) some words are just bad in this dataset no matter what. explain why. 4. EM works better with (1) more data, and (2) shorter sentences. Try to the demonstrate the effect of (2) by concatenating two sentences into one: cat tiny.en | awk '{printf("%s ", $0); if (NR % 2 == 0) printf("\n")}' > tiny.en.2 You can similarly group every 3, 4, or 6 sentences, or even combine all sentences together: cat tiny.en | awk '{printf("%s ", $0); }' > tiny.en.all Demonstrate the gradual worsening of the learned model with longer sentences. Write a short paragraph summarizing all your observations in these problems.