CS430 Class Project: Spam Filter

The goal of this project is to construct an email spam filter using machine learning techniques.

Minimum requirements

Your program should accept as input (on stdin) an email message (including headers). The program will be invoked in two modes: training mode and prediction mode.
Training Mode. Your program should process the email message and look for the special header "X-SPAM-LABEL: TRUE" or "X-SPAM-LABEL: FALSE". If either of these headers is present, the message should be treated as a training example for improving the program. Your program should process the email message and then exit normally with no output.
Prediction Mode If the special header is not present, then the program is being invoked in prediction mode. It should analyze the email and then produce two outputs:
- a revised email message (sent to stdout) with an extra header that says "X-SPAM-PREDICTION: TRUE" or "X-SPAM-PREDICTION: FALSE".
- a return code that is 0 if the message is spam and 1 otherwise.
Your program should keep its learned spam statistics in a file and read them in upon execution and write them out after execution.
Your program should implement some form of adaptation so that more recent training examples are given higher weight than older ones.
Your program should have external documentation that describes how to invoke it. Your code should be well-documented internally as well.
Your program should provide some way for the user to adjust the tradeoff between false positive and false negative errors. This could be done either through a command line argument or through a configuration file.

Suggested Approaches

Naive Bayes Classifier. In class, we have studied the Naive Bayes classifier. This is a good place to start. Your major design decision will be what tokens to analyze and whether to separately analyze the headers and the body of the email.
Tokens. Read the pointers on the web page "A Plan for Spam" and the "CRM114" system. Both of them suggest good ways of defining tokens.
Tuning the classification threshold. Naive Bayes estimates P(spam|message), but you must choose a cutoff for making your classification decision. You can adjust that cutoff to try to get the false positives to zero. My evaluation criterion for a spam filter is "Percent of false negatives when the false positives are zero". That is, I suggest that you tune the filter to make no false positive errors and then measure the number of false negatives.
Training on public collections. There are some collections of spam and non-spam email available on the web. You may use these for development and testing. Example: LingSpam.

Using Existing Sources

You must implement all of your code yourself. You may not copy verbatim code from other packages.
You may borrow ideas from other packages available on the web. You must cite and give full details (URL, authors' names) to any ideas that you incorporate.

Grading

The following criteria will be considered when assigning the project grade.

Functionality (40 points). Does the program execute successful and filter email?
Code Quality and Documentation (20 points). Is the program well-documented (internally and externally)?
Accuracy (20 points). How accurate is the program? The TA will execute your spam filter on a training set and a test set of email messages. Your submission can include a "pre-trained" filter which will then be further trained by the TA as part of the grading process.
Novelty (20 points). Does the program include some innovations beyond the basic naive bayes classifier and beyond publically-available open-source spam filters?

Alternative Projects

If you wish, you may propose an alternative project on a topic of your choice. Projects should either be programming-oriented or research-oriented. For a programming project, I am looking for software that incorporates significant reasoning or learning (or both). For a research project, I am looking for a project that would involve reading at least 10 scientific papers and writing a paper (of at least 10 pages) that summarizes and critically analyzes those papers.

Here are some examples of possible projects

Write a reinforcement learning program for playing Tetris. There are several web-published programs that learn to play Tetris. You could implement and compare two methods for this.
Minesweeper. The textbook suggests a project to implement a Minesweeper agent. This is quite similar to the Wumpus world and would require significant logical reasoning. There is code available with the textbook that implements some forms of first-order logical inference.
Prediction Suffix Trees. In her PhD thesis, Dana Ron developed a data structure for language modelling called the Prediction Suffix Tree. It is a kind of variable-depth Markov Model that adaptively decides how far back in time to consider previous words in order to predict the next word. It generally gives much better predictive accuracy than simple n-gram language models. You could implement PSTs and train and test them on our novels from Programming Assignment 2. See The Power of Amnesia: Learning Probabilistic Automata with Variable Memory Length by Dana Ron, Yoram Singer, and Naftali Tishby (Machine Learning, 1996).
Relational Probabilistic Models (RPMs/PRMs). In her PhD thesis, Lise Getoor developed an extension of Bayesian networks that can handle relational data. This is a step towards first-order probabilistic logics. I am currently teaching a class CS 539: Probabilistic Relational Models on this subject. The class web page has pointers to a wide variety of papers on RPMs. You could read a subset of these and write a paper that critically evaluates them (e.g., by showing examples of cases that they can and cannot handle).