CS430 Class Project: Spam Filter
The goal of this project is to construct an email spam filter using machine learning
techniques.
Minimum requirements
- Your program should accept as input (on stdin) an email message (including
headers). The program will be invoked in two modes: training mode and prediction mode.
- Training Mode. Your program should process the email message and look for the
special header "X-SPAM-LABEL: TRUE" or "X-SPAM-LABEL: FALSE". If either of these headers
is present, the message should be treated as a training example for improving the program.
Your program should process the email message and then exit normally with no output.
- Prediction Mode If the special header is not present, then the program is
being invoked in prediction mode. It should analyze the email and then produce two
outputs:
- a revised email message (sent to stdout) with an extra header that says
"X-SPAM-PREDICTION: TRUE" or "X-SPAM-PREDICTION: FALSE".
- a return code that is 0 if the message is spam and 1 otherwise.
- Your program should keep its learned spam statistics in a file and read them in upon
execution and write them out after execution.
- Your program should implement some form of adaptation so that more recent training
examples are given higher weight than older ones.
- Your program should have external documentation that describes how to invoke it.
Your code should be well-documented internally as well.
- Your program should provide some way for the user to adjust the tradeoff between
false positive and false negative errors. This could be done either through a command
line argument or through a configuration file.
Suggested Approaches
- Naive Bayes Classifier. In class, we have studied the Naive Bayes
classifier. This is a good place to start. Your major design decision will be what
tokens to analyze and whether to separately analyze the headers and the body of the
email.
- Tokens. Read the pointers on the web page "A Plan for Spam" and the "CRM114"
system. Both of them suggest good ways of defining tokens.
- Tuning the classification threshold. Naive Bayes estimates
P(spam|message), but you must choose a cutoff for making your classification decision.
You can adjust that cutoff to try to get the false positives to zero. My evaluation
criterion for a spam filter is "Percent of false negatives when the false positives are
zero". That is, I suggest that you tune the filter to make no false positive errors and
then measure the number of false negatives.
- Training on public collections. There are some collections of spam and
non-spam email available on the web. You may use these for development and testing.
Example: LingSpam.
Using Existing Sources
- You must implement all of your code yourself. You may not copy verbatim code from
other packages.
- You may borrow ideas from other packages available on the web. You must cite and
give full details (URL, authors' names) to any ideas that you incorporate.
Grading
The following criteria will be considered when assigning the project grade.
- Functionality (40 points). Does the program execute successful and filter email?
- Code Quality and Documentation (20 points). Is the program well-documented
(internally and externally)?
- Accuracy (20 points). How accurate is the program? The TA will execute your
spam filter on a training set and a test set of email messages. Your submission can
include a "pre-trained" filter which will then be further trained by the TA as part of the
grading process.
- Novelty (20 points). Does the program include some innovations beyond the
basic naive bayes classifier and beyond publically-available open-source spam filters?
Alternative Projects
If you wish, you may propose an alternative project on a topic of your choice. Projects
should either be programming-oriented or research-oriented. For a programming project, I
am looking for software that incorporates significant reasoning or learning (or both).
For a research project, I am looking for a project that would involve reading at least 10
scientific papers and writing a paper (of at least 10 pages) that summarizes and
critically analyzes those papers.
Here are some examples of possible projects
- Write a reinforcement learning program for playing Tetris. There are several
web-published programs that learn to play Tetris. You could implement and compare two
methods for this.
- Minesweeper. The textbook suggests a project to implement a Minesweeper agent.
This is quite similar to the Wumpus world and would require significant logical
reasoning. There is code available with the textbook that implements some forms of
first-order logical inference.
- Prediction Suffix Trees. In her PhD thesis, Dana Ron developed a data structure for
language modelling called the Prediction Suffix Tree. It is a kind of variable-depth
Markov Model that adaptively decides how far back in time to consider previous words in
order to predict the next word. It generally gives much better predictive accuracy than
simple n-gram language models. You could implement PSTs and train and test them on our
novels from Programming Assignment 2. See The Power of Amnesia:
Learning Probabilistic Automata with Variable Memory Length by Dana Ron, Yoram Singer,
and Naftali Tishby (Machine Learning, 1996).
- Relational Probabilistic Models (RPMs/PRMs). In her PhD thesis, Lise Getoor
developed an extension of Bayesian networks that can handle relational data. This is a
step towards first-order probabilistic logics. I am currently teaching a class CS 539: Probabilistic Relational Models on
this subject. The class web page has pointers to a wide variety of papers on RPMs. You
could read a subset of these and write a paper that critically evaluates them (e.g., by
showing examples of cases that they can and cannot handle).