CS430 Programming Assignment 2: Probabilistic Language Models

Due Date: Wednesday, October 29.

In this assignment, you will fit a tri-gram language model to English and then use it to generate new english text.

A unigram model of English consists of a single probability distribution P(W) over the set of all words.

A bigram model of English consists of two probability distributions: P(W₀) and P(W_i | W_i-1). The first distribution is just the probability of the first word in a document. The second distribution is the probability of seeing word W_i given that the previous word was W_i-1.

A trigram model of English consists of three probability distributions: P(W₀), P(W₁|W₀), and P(W_i|W_i-1,W_i-2). The first distribution is, as above, the probability of the first word in the document. The next distribution is the probability of the second word given the first one. And the third distribution is the probability of the ith word given the two preceding words.

Given a set of documents (in our case, various novels and short stories), your job in this assignment is to fit a trigram model of English. I recommend that you do this by using a hash table in which you hash on word W_i-2. The contents of the hash table cells consist of linked lists as shown below. Each item in the main list links the words that appeared at position W_i-1. It also contains a pointer to a second level of linked lists that link the words that appeared at position W_i.

In particular, this structure encodes the fact that in our training data, we observed the following three word sequences:

finger remarked holmes
finger on it
finger on it
finger in the
finger . then
finger . all
finger . it

Notice that "finger on it" was observed twice. Also notice that the period is treated as a separate word.

Given the information in this data structure, we can compute the probability P(it | finger, on) as 2/2 = 1. Similarly, we can compute the probability P(it | finger, .) as 1/3.

Data Files

I have obtained and processed the following data files:

Each file contains the lower and uppercase letters, blanks, and periods. All other punctuation has been removed. Question marks and exclamation marks were converted to periods. When you read in the files, please convert all upper case to lower case.

Assignment

Using the two Sherlock Holmes books, train a tri-gram language model by constructing the hash/linked list data structure described above. Then use this data structure to generate a new "story" 1000 words long. You can do this very simply by first choosing a word at random from the hash table. Then using it to choose a subsequent word, and then extending the text by looking up the two words and choosing at random from among the following words in proportion to their frequency of appearance.

Repeat this process, but now train on all six books and then generate a new "story" of 1000 words .

What to Turn In

Turn in your two generated texts and your source code listing. In addition, please write a one-page analysis of the generated texts addressing the following points:

What knowledge do human beings have that is NOT captured by the tri-gram language model and that therefore causes the tri-gram model to generate nonsense. Write down at least 5 facts that, if the computer could somehow know them, would allow it to generate more sensible language.
Compare the two different texts. If your experience is like mine, the first text will be more coherent than the second one. Speculate about why this is true.