CS 381, Spring 2002
Programming Assignment 4
Due: Monday, May 6

Introduction to Python

In this assignment we will get introduced to the language Python. Python is the creation of a friend of mine named Guido van Rossum. Python is a scripting language, is the spirit of csh, Perl, Awk, Ruby, and many others. Like Lisp, it can be interactive (although we will not be using it in that fashion). Python is a dynamically typed language; meaning that you don't need to declare variables, just assign them a value before you use them.

Our task will be to produce a summary from log files generated by our departmental web server, compressing the information into summary statistics. But before we get around to talking about the actual application, lets work through an introduction to Python.

Running Python Programs

I've placed a short file named colors in the directory /usr/local/classes/cs/cs381/data/colors. you may want to copy the file into your own directory to do this part. It contains lines like the following:

red:blue:red:green:yellow
To experiment with python, enter the following program into a file named pytest:
#
# 	my first python program
#
import sys
import string
print "hello world!"
f = open("colors")
line = f.readline()
while line:
	line = string.rstrip(line)
	print line
	line = f.readline()
print "thats all, folks!"
f.close()
All this program does is print a line of test, open the file named colors, then loop over the file, printing each line in turn. A few things to note about the language. The sharp sign in a comment character, and after this character the rest of the line is ignored. The import statement is similar to Java, indicating we want to include a library named sys. The language is somewhat unusual in that it does not use open and closing braces for statement bracketing, instead using indenetation for this purpose. (Actually Guido copied this idea from an earlier language, named ABC. I met Guido when he and I were both working on the ABC project in Amsterdam. Unfortunately, the language ABC was never widely used.) The call on rstrip is designed to remove the trailing newline character from the string. The syntax indicates that this is the function named rstrip from the library named string. (The same library included by the import statement). You can try removing this and see what the difference is. (Can you explain the difference?)

To run the program, execute the following statement on one of the engr machines:

	python pytest
You should get several lines of output.

Breaking Strings Apart

The next step will be to break a string into parts. Our main tool for this is another function from the string library, this one named splitfields. Try placing the following three lines in the body of the loop, after you have printed the line of text:

	words = string.splitfields(line, ":")
	print words
	print words[0]
Now run your program and see what happens. The splitfields function is returning a list. The arguments are the string being split, and the delimiters used to determine the splits. In this case, we are using the colon character as our separator. A list is a bit of a cross between the Lisp style lists and an array. Like an array, it can be indexed. Here we are printing out the zeroth element in the collection. (Like C and Java, list index values begin at zero).

Looping over elements in a list

You can make a loop that iterates over the elements of a list using a for statement. Try adding the following to the body of your loop:

	words = string.splitfields(line, ":")
	for word in words:
		print "word is:" + word
This also shows the use of the plus operator as string catenation. Run your program and examine the output.

We have seen already one while loop. Here is another, one that will print out each element in the list. This loop also illustrates how to determine the length of a list, and the fact that lists can be indexed like an array.

	words = string.splitfields(line, ":")
	i = 0
	while i < len(words):
		print "word " + words [ i ]
		i = i + 1
	print "ended loop"

Counting Values

Our next change will be to count the number of occurrences of each color word in our input. To do this we will use yet another type of data structure, a dictionary (sometimes called a map or an associative array). A dictionary is an indexed data structure, like an array. However, unlike an array, the elements of a dictionary can be anything at all.

The following shows how a new dictionary is declared, and how values can be added into a dictionary. It also shows the syntax for the if statement. Notice how indentation is used to indicate the body of all statements.

line = f.readline()
counts = {}
while line:
	line = string.rstrip(line);
	words = string.splitfields(line,":")
	for word in words:
		if counts.has_key(word):
			counts [ word ] = counts [ word ] + 1
		else:
			counts [ word ] = 1
	line = f.readline()
print counts
print counts.keys()
The next to the last statement will print out the entire counts dictionary. This is followed by a statement that prints out only the key portion of this dictionary.

If we wanted to print out the counts, we could use a loop like the following. Since the plus operator works only with two strings, we have to use another built-in, named str(), to convert an integer into a string:

sc = counts.keys()
for k in sc:
	print "word " + k + " count " + str(counts[k])

Making a sorted list

Notice the keys printed out in the last program were not sorted. We can fix that, using the fact that a list has a method named sort that rearranges the values in order. To see how this is done, replace the final print statement above with the following:
sc = counts.keys()
sc.sort()
print sc
Notice that this time, the elements are sorted.

Thats fine if we want to sort based on the keys. But sometimes we want to sort based on some other criteria. For example, suppose we wanted to print the colors in decreasing order of occurrence. To do this, we need to define a special comparison function. Functions in Python are written using the def keyword. Typically these are written at the beginning of the file, after the imports but before the main program. Try entering the following definition:

def countcomp(a, b):
	if counts[a] < counts[b]:
		return 1
	else:
		return -1

This function takes two arguments, and returns 1 if the value in the counts array indexed by the argument is less than the value in the counts array indexed by the second argument. Otherwise, it return -1.

Now, pass the name of this function to the sort procedure

sc.sort(countcomp)
Now look at the output. How has it changed? You can experiment with this to see if you can get other effects.

Warm Up Exercise

Now, as a warm up exercise to the programming assignment, see if you can figure out how to print, in sorted order, each color plus the number of occurrences of the color in the input file. That is, we want output such as the following:
blue : 2
green : 2
red : 6
yellow : 3
Then, see if you can print the values in descreasing order of occurrence.
red : 6
yellow : 3
blue : 2
green : 2

The Programming Assignment

We are finally ready to describe the programming assignment. The file /usr/local/classes/cs/cs381/data/weblog has a log file produced by our web server. Warning, this file is quite large. In order to help your debugging, I have generated two smaller files. The file /usr/local/classes/cs/cs381/data/firstthree has just the first three lines of data, while the file /usr/local/classes/cs/cs381/data/firsthundred has approximately one hundred lines of input. You are strongly urged to debug using these smaller files, since otherwise you risk generating huge amounts of debugging information.

Each line of the log file consists of seven fields, separated by spaces. These represent the following:

  1. The remote host machine
  2. The remote language. (Often this is unknown, in which case it is represented by a dash).
  3. The remote user id. (Again, often just a dash).
  4. The date and time of access
  5. The first line of the command received
  6. The status of the access
  7. The number of bytes transmitted.

Your first task is to see if you can read a file of input, and break each line into these seven parts.

Next, notice the data and time field. This is a compound field, separted by colons. Since this is a log file from a single day, the date part is not very interesting. But the other parts are useful. See if you can extract the hour and minuit parts of this entry for each line of input.

For this assignment, I want you create a program named progFour. This program should read the log file, and generate two tables. One table is indexed by the remote host name, and counts the number of connections from that host. The second table (unrelated to the first) is indexed by the hourly time, and counts the number of accesses each hour. When you have finished gathering the information, I want you to print the ten remote host sites that have generated the most accesses, in order. This should be followed by a table of hourly access counts, showing the number of accesses each hour.

Here is an example of the output:

top ten remote host sites
1. ca.engr.orst.edu:9872
2. joe.bre.orst.edu:2091
3. marvin.northernlight.com:799
4. 217.33.60.249:791
5. www.orst.edu:753
6. dh119-16.engr.orst.edu:603
7. 12.111.165.88:588
8. dh119-15.engr.orst.edu:517
9. dh119-4.engr.orst.edu:471
10. dh119-24.engr.orst.edu:429

Hourly access counts
time 00 accesses 601
time 01 accesses 209
time 02 accesses 346
time 03 accesses 10257
time 04 accesses 220
time 05 accesses 330
time 06 accesses 201
time 07 accesses 356
time 08 accesses 476
time 09 accesses 961
time 10 accesses 1257
time 11 accesses 1277
time 12 accesses 2290
time 13 accesses 3158
time 14 accesses 3118
time 15 accesses 2678
time 16 accesses 2593
time 17 accesses 3010
time 18 accesses 3647
time 19 accesses 2766
time 20 accesses 3579
time 21 accesses 3617
time 22 accesses 2494
time 23 accesses 2144

As always, once you are finished I want you to submit the assignment using the college of engineering web-based submission form.