Introduction to Python
In this assignment we will get introduced to the language Python. Python is the creation of a friend of mine named Guido van Rossum. Python is a scripting language, is the spirit of csh, Perl, Awk, Ruby, and many others. Like Lisp, it can be interactive (although we will not be using it in that fashion). Python is a dynamically typed language; meaning that you don't need to declare variables, just assign them a value before you use them.
Our task will be to produce a summary from log files generated by our departmental web server, compressing the information into summary statistics. But before we get around to talking about the actual application, lets work through an introduction to Python.
I've placed a short file named colors in the directory /usr/local/classes/cs/cs381/data/colors. you may want to copy the file into your own directory to do this part. It contains lines like the following:
red:blue:red:green:yellowTo experiment with python, enter the following program into a file named pytest:
#
# my first python program
#
import sys
import string
print "hello world!"
f = open("colors")
line = f.readline()
while line:
line = string.rstrip(line)
print line
line = f.readline()
print "thats all, folks!"
f.close()
All this program does is print a line of test, open the file named
colors, then loop over the file, printing each line in turn.
A few things to note about the language. The sharp sign in
a comment character, and after this character the rest of the line
is ignored.
The import statement is similar to Java, indicating we want to
include a library named sys. The language is somewhat unusual
in that it does not use open and closing braces for statement
bracketing, instead using indenetation for this purpose.
(Actually Guido copied this idea from an earlier language,
named ABC. I met Guido when he and I were both working on
the ABC project in Amsterdam. Unfortunately, the language ABC
was never widely used.) The call on rstrip is designed to
remove the trailing newline character from the string.
The syntax indicates that this is the function named
rstrip from the library named string. (The same library
included by the import statement).
You can try removing this and see what the difference is.
(Can you explain the difference?)
To run the program, execute the following statement on one of the engr machines:
python pytestYou should get several lines of output.
The next step will be to break a string into parts. Our main tool for this is another function from the string library, this one named splitfields. Try placing the following three lines in the body of the loop, after you have printed the line of text:
words = string.splitfields(line, ":") print words print words[0]Now run your program and see what happens. The splitfields function is returning a list. The arguments are the string being split, and the delimiters used to determine the splits. In this case, we are using the colon character as our separator. A list is a bit of a cross between the Lisp style lists and an array. Like an array, it can be indexed. Here we are printing out the zeroth element in the collection. (Like C and Java, list index values begin at zero).
You can make a loop that iterates over the elements of a list using a for statement. Try adding the following to the body of your loop:
words = string.splitfields(line, ":") for word in words: print "word is:" + wordThis also shows the use of the plus operator as string catenation. Run your program and examine the output.
We have seen already one while loop. Here is another, one that will print out each element in the list. This loop also illustrates how to determine the length of a list, and the fact that lists can be indexed like an array.
words = string.splitfields(line, ":") i = 0 while i < len(words): print "word " + words [ i ] i = i + 1 print "ended loop"
The following shows how a new dictionary is declared, and how values can be added into a dictionary. It also shows the syntax for the if statement. Notice how indentation is used to indicate the body of all statements.
line = f.readline()
counts = {}
while line:
line = string.rstrip(line);
words = string.splitfields(line,":")
for word in words:
if counts.has_key(word):
counts [ word ] = counts [ word ] + 1
else:
counts [ word ] = 1
line = f.readline()
print counts
print counts.keys()
The next to the last statement will print out the entire counts dictionary.
This is followed by a statement that prints out only the key portion
of this dictionary.
If we wanted to print out the counts, we could use a loop like the following. Since the plus operator works only with two strings, we have to use another built-in, named str(), to convert an integer into a string:
sc = counts.keys() for k in sc: print "word " + k + " count " + str(counts[k])
sc = counts.keys() sc.sort() print scNotice that this time, the elements are sorted.
Thats fine if we want to sort based on the keys. But sometimes we want to sort based on some other criteria. For example, suppose we wanted to print the colors in decreasing order of occurrence. To do this, we need to define a special comparison function. Functions in Python are written using the def keyword. Typically these are written at the beginning of the file, after the imports but before the main program. Try entering the following definition:
def countcomp(a, b): if counts[a] < counts[b]: return 1 else: return -1This function takes two arguments, and returns 1 if the value in the counts array indexed by the argument is less than the value in the counts array indexed by the second argument. Otherwise, it return -1.
Now, pass the name of this function to the sort procedure
sc.sort(countcomp)Now look at the output. How has it changed? You can experiment with this to see if you can get other effects.
blue : 2 green : 2 red : 6 yellow : 3Then, see if you can print the values in descreasing order of occurrence.
red : 6 yellow : 3 blue : 2 green : 2
Each line of the log file consists of seven fields, separated by spaces. These represent the following:
Your first task is to see if you can read a file of input, and break each line into these seven parts.
Next, notice the data and time field. This is a compound field, separted by colons. Since this is a log file from a single day, the date part is not very interesting. But the other parts are useful. See if you can extract the hour and minuit parts of this entry for each line of input.
For this assignment, I want you create a program named progFour. This program should read the log file, and generate two tables. One table is indexed by the remote host name, and counts the number of connections from that host. The second table (unrelated to the first) is indexed by the hourly time, and counts the number of accesses each hour. When you have finished gathering the information, I want you to print the ten remote host sites that have generated the most accesses, in order. This should be followed by a table of hourly access counts, showing the number of accesses each hour.
Here is an example of the output:
top ten remote host sites 1. ca.engr.orst.edu:9872 2. joe.bre.orst.edu:2091 3. marvin.northernlight.com:799 4. 217.33.60.249:791 5. www.orst.edu:753 6. dh119-16.engr.orst.edu:603 7. 12.111.165.88:588 8. dh119-15.engr.orst.edu:517 9. dh119-4.engr.orst.edu:471 10. dh119-24.engr.orst.edu:429 Hourly access counts time 00 accesses 601 time 01 accesses 209 time 02 accesses 346 time 03 accesses 10257 time 04 accesses 220 time 05 accesses 330 time 06 accesses 201 time 07 accesses 356 time 08 accesses 476 time 09 accesses 961 time 10 accesses 1257 time 11 accesses 1277 time 12 accesses 2290 time 13 accesses 3158 time 14 accesses 3118 time 15 accesses 2678 time 16 accesses 2593 time 17 accesses 3010 time 18 accesses 3647 time 19 accesses 2766 time 20 accesses 3579 time 21 accesses 3617 time 22 accesses 2494 time 23 accesses 2144
As always, once you are finished I want you to submit the assignment using the college of engineering web-based submission form.