CS 381, Spring 2002
Programming Assignment 6
Due: Monday, May 20

Yet More on Python: Imatation

The third Python programming assignment we will do is also a classic problem, one that has been rewritten many times in many languages as an illustration of the use of associative arrays (or dictionaries). The program is called ``imitation'', since it builds an imitation of natural language.

Like the last assignment, this program works in two phases. There is a learning phase, where the program builds up a large database of knowledge. This is followed by the printing phase, where the output is produced.

In the learning phase, the program creates a large associative array of N-letter ``words''. (words being quoted because we simply mean strings of characters, and not words in the english sense). For our assignment we will use N=4. (You might want to experiment with other values of N). As in the last programming assignment, our basic data structure will be an associative array (or dictionary), whose elements are an array of strings. The elements of this array will be those letters that have been observed to follow the key in the text over which learning was performed.

To illustrate, suppose we were learning using the first paragraph of this assignment. We first translate the line into all lower case. We then observe that following the characters ``the '' comes the letter t. Following the characters ``he t'' comes the letter h. Following the characters ``e th'' comes the letter i. And so on. Some four letter sequences eventually have two different followers. For example, the characters " pro" can be followed either by g (as in programming) or by b (as in problem). If you have your program learn over a longer input set, then the follow sets become even longer.

You have almost all the Python background you need to do the first part, the learning step, except for a few items concerning the manipulation of strings. The following starting point will illustrate the few features you require:

#
# yet another Python program
#

import sys
import string

f = open("text2")
line = f.readline()
text = "    "
while line:
	line = string.rstrip(line)  # strip newline
	text = text + string.lower(line) # make lowercase
	length = len(text) - 4
	i = 0
	while i < length:
		str = text[i:i+4]  # get base
		nextChar = text[i+4:i+5] # get follow
		print "following:" + str + ": is :" + nextChar + ":"
		i = i + 1
	text = text[i:] + " "
	line = f.readline()
print "done, remaining text is:" + text
After opening a file, we read the contents line by line. The rstrip for removing the trailing newline you have seen already in previous assignments. The function lower converts the contents to all lower case. A while loop then cycles over the input. Notice we have seeded the input with a text line that contains four blank spaces. The line read from the input is appended to the end of this. We strip off each block of four characters. The syntax str[i:j] is used to extract the values of a string with index positions greater than or equal to i and less than j. If the upper bound is left off, the end of string is assumed. We have used this at the end of the while loop, to maintain the last few characters of a line, and subsequently append them to the text of the next line.

Try running this program with an input file of two lines, and carefully observe the output.

To create the first step of the imatation program, create a dictionary named follow. For each substring you create in the inner loop, make sure the string is a legal index in follow, making the entry into an empty list if not. This is just like we did in the previous programming assignment. Then append the following character to the list you get by indexing follow with the string.

When you are finished with the loop, the variable text will still have the last five characters of the file. Take the first four of these, and add the end marker character "$" to the follow set. (The program tends to get into an infinite loop if you don't have an end marker to stop things.)

I suggest just writing this part and printing out your follow set when you have built it.

The second phase is the output step. You are going to maintain two string variables. The first is the current four letter ``word'', and the second is the result. Initialize the current word to those initial four blank spaces. We then look in the follow set for the word, and select a random element from the list of possibilities. (You learned in the previous programming assignment how to select a random element from a list). This random element should be one letter long. Catenate the letter both to the result and to the current word, then use the substring operation to strip the first letter from the current word. Loop as long as the current word is a key in your follow dictionary. When you finally reach a word that has no follow entry in the dictionary, terminate the loop and print the result.

Hints on debugging. Debugging, as always, is tricky. Work with a very small input file. Typically the smaller the input file the less variation you will see in the results. (Since most follow sets will have only one entry). After you think you have it working, try using a larger input file.

I've put one example input in a file named text in /usr/local/classes/cs/cs381/data Here is an example output of my program on the given text:

of social activity as we do not live in the use the real world alone, 
nor alone adjusts to imagine the use the real world is to a largely as 
we do become that language habits of experience very much at the 
real world is merely an illusion the language habits of society.
it is quite and that language habits of social active world is that 
one in the matter is to a large extent

Of course, not all outputs are so long. Here is the output from another run:

and hear

Some other interesting input files are the files Doyle and hamlet in /usr/local/classes/cs/cs/381/data. If anybody comes up with another file I'll post it.

Name your program progSix, and hand in as usual.