CS434 Final Project

This project assignment will give you an opportunity to work with a "realworld" machine learning problem and explore various design choices that are involved in machine learning application and/or research. This project is intended to be more open-ended than your previous experimental assignments and give you more hands-on exprience with realworld machine learning problems.

Key dates:

Team and Problem selection: Wednesday Nov 5th 11:59PM. Form teams and select learning problems. Send the team members and a short (no more than one-page) description of the proposed project to me via email.
Final report due: Friday, Dec 5th 11:59PM. (no late submission)

What you need to do

Find a partner to work on this project. (Two person teams are encouraged, though you may work alone.)
Choose your application domain and your learning problem within it.
As a guideline, you will need to go through the following questions and make your decisions on each one.

How to formulate the learning problem. Is this a classification problem, a clustering problem, frequent pattern mining or reinforcement learning problem? Are the data iid (independent and identically-distributed) or are they sequential?
Feature design. How should the "raw" data be transformed into proper features (inputs) so that the data is suitable for machine learning? Should the data be aggregated in some way? Should the data be transformed so that it has a zero mean and unit variance? Can we apply dimension reduction / feature subset selection to improve learning performance? If so, how can we go about it?
Algorithm choice. What learning algorithms would be appropriate for this problem? Factors to consider: data set size, noise level, continuous versus discrete features, missing values, supervised vs unsupervsed.
Algorithm tuning. If the algorithm has user-set parameters, what strategy should be used for setting them?
Overfitting Avoidance. Is there a risk of overfitting? If so, what overfitting avoidance methods should be applied? How should they be tuned?
Performance criterion. How should performance be measured? Error rate? Expected misclassification cost? Cross-validation Likelihood?

Perform the work, run the experiments!
Turn in a short report (no more than 5 pages). Each team should turn in a single report and please email me your report before the deadline. Your report should precisely describe the following:

The application domain
The formulation of your learning task(s)
The data collection process
A precise description of your approach, and the design choices that you made. For example, what preprocessing steps are involved? What features did you use? What algorithm did you choose and why? What software package was used and what was programmed by you? NOTE: no restrictions on using existing software packages and no restrictions on what programming language you use if you decide you need code your own.

Describe your experiments, evaluation results, and any conclusions your draw from this.

The clarity and content of the report will have a primary impact on your grade. The report must not be more than 6 pages, 10 point font, including figures and tables.

Grading and determining when you have done enough

A project that does a solid job building a base learning system and carefully evaluating and describing it might get 75–80% credit. To be considered as a solid base learning system, it requires appropriate learning task formulation, preprocessing, application, and evaluation of one or more existing learning algorithms. If you have trouble determing if you have enough for a base learning system, contact the instructor to clarify for your specific case.

A project that includes additional pursuit of interesting extensions/alternatives or investigations into important issues (such as different feature exatraction and selection methods, how to handle overfitting, noise tolerance, etc.), or achieve impressive results might get 90–100% credit. Weight will also be given based on the interestingness and novelty of the learning task considered.

Be creative! Exploring your own interesting ideas and comparing them with the baseline approaches will receive credits whether they beat the baseline or not.

Some Example Learning Problems

Text Classification and Clustering
- Spam Prediction
- Email Folder or Tag Prediction
- Newgroup document classifier
- Sentiment analysis
- Author classifier (i.e. take latex files from different authors and try to classify according to author)
Computer Vision Recognition Tasks
- Optical Character Recognition
- Face recognition tasks
- Scene recognition (e.g. house vs. no house)
- Object recognition in aerial/satellite images
- Image segmentation
- etc
Audio Recognition Tasks
- Speaker identity
- Speaker sentiment
- Music genre
- Recognition of bird species based on bird songs
- etc
Prediction Tasks from Games
- Move predictor for various board/card games (e.g. chess, go, checkers, solitaire)
- Action predictor for video games (e.g. learning to predict pac-man movements based on demonstrations)
- Fold predictor for poker
Predict Outcomes of Sports Games
Predict Movie Ratings
Predict Stock Values
Predict Selling Prices in Auctions