Find a partner to work on this project. (Two person teams are encouraged, though you may work alone.)
Choose your application domain and your learning problem within
it.
As a guideline, you will need to go through the following
questions and make your decisions on each one.
Feature design. How should the "raw" data be transformed into proper features (inputs) so that the data is suitable for machine learning? Should the data be aggregated in some way? Should the data be transformed so that it has a zero mean and unit variance? Can we apply dimension reduction / feature subset selection to improve learning performance? If so, how can we go about it?
Algorithm choice. What learning algorithms would be
appropriate for
this problem? Factors to consider: data set size, noise level,
continuous versus discrete features, missing values, supervised vs
unsupervsed.
Overfitting Avoidance. Is there a risk of overfitting? If so, what overfitting avoidance methods should be applied? How should they be tuned?
Performance criterion. How should performance be measured? Error rate? Expected misclassification cost? Cross-validation Likelihood?
Perform the work, run the experiments!
Turn in a short report (no more than 5 pages).
Each team should turn in a single report and please email me
your report before the deadline. Your report should precisely describe
the following:
The application domain
The formulation of your
learning task(s)
A precise description of your approach, and the design choices
that you made. For example, what preprocessing steps are involved? What
features did you use? What
algorithm
did you choose and why? What software package was used and what
was programmed by you? NOTE: no restrictions on using existing software
packages and no restrictions on what programming language you use if
you decide you need code your own.
Be creative! Exploring your own interesting ideas and comparing them with the baseline approaches will receive credits whether they beat the baseline or not.