CS534 Homework 8: Due Friday June 3, 9:00am

This assignment will take more time than the previous homework assignments. The purpose of this assignment is to give you experience with a "raw" data set. With a raw data set, you must consider the following:

How to formulate the learning problem. Is it a classification problem or a regression problem? Are the data iid (independent and identically-distributed) or are they sequential, spatial, or relational? Is all of the data labeled, or is some of it unlabeled?
Performance criterion. How should performance be measured? Error rate? Precision and Recall? AUC? Expected misclassification cost? What loss function should be used for training the learning algorithm?
Feature design. How should the raw features be transformed so that the data is suitable for machine learning? Should the data be aggregated in some way? Should the data be transformed so that it has a more gaussian distribution?
Algorithm choice. Which learning algorithms would be best for this problem? Factors to consider: data set size, noise level, continuous versus discrete features, missing values, semi-supervised.
Algorithm tuning. If the algorithm has user-set parameters, what strategy should be used for setting them?
Overfitting Avoidance. Is there a risk of overfitting? If so, what overfitting avoidance methods should be applied? How should they be tuned?

Data Sets

For the assignment, you can choose one of the following three data sets:

Hyphenation Data. The problem is to construct a program for dividing English words into syllables by hyphenation. Such programs are used by word-processing programs to decide where to insert hyphens into words.
BodyMedia Data. The problem is to predict when a person is performing a particular activity based on 9 physiological readings taken from a wearable sensor.

TaskTracer Data. The problem is to predict which task a user is executing at the desktop.

If you would like to study some other data set, please send me email immediately and we can discuss it. I hope to have a third data set from the TaskTracer project which I will make available later this week.

Hyphenation Problem

I have placed two data files in /usr/local/classes/eecs/spring2005/cs534/weka/data: hyphen.train and hyphen.eval. These are not in arff format. Each file has one word per line. The word contains a hyphen at each point where it is legal to insert a hyphen. Your task is to create a program that can take a file of new words and output those words (one per line) with hyphens inserted in all of the legal places.

I constructed this dataset myself, so it may contain errors. I have checked most of the words against a dictionary, but I may have missed some and I may have made typing errors also. If you find training examples that seem to be wrong, please let me know.

On Wednesday, June 1, I will make available a test data set that does not have any hyphens in it. You should run your program on this data to predict the hyphen locations and submit a file to me that contains your predictions (in exactly the same format as the training and evaluation files).

BodyMedia Problem

You can read more about the BodyMedia data by visiting the web site http://www.cs.utexas.edu/users/sherstov/pdmc. This describes a competition using this data and a workshop that will be held at the International Conference on Machine Learning in July.

The BodyMedia data consist of data collected from several users. Each user wore the BodyMedia device that measures a set of 9 variables every minute. The users also kept a log where they indicated what activity they were performing at each point in time. However, often they forgot to assign an activity. For the competition, two of these activities have been chosen for prediction. BodyMedia is not telling us what these activities are, so we will call them Context 1 and Context 2. I have provided separate data files for each context. The data files have the following naming pattern: context-usernn-type.arff where

context is either c1- or c2- indicating which of the two prediction problems the file relates to.
usernn is the identifier for the particular user, where nn is one of 01, 05, 06, 15, and 25. I chose these five users because they have the most data available. You may want to try training classifiers for individual users or classifiers that are supposed to generalize across
type is one of "train" or "eval", which indicates the training data set or the evaluation (holdout) data set. On June 2, I will make a separate test set available to you.

The class label on these examples is one of

0: The user was definitely not performing the target activity during this minute
1: The user was performing the target activity
?: It is unknown whether the user was performing the target activity or not.

TaskTracer Problem

TaskTracer is a project led by Jon Herlocker and myself with the goal of supporting multi-tasking knowledge workers under Microsoft Windows. The central hypothesis of TaskTracer is that at each point in time, the user is working on one "task" or "activity". The user defines a set of activities and provides training data telling the system which activity he/she is currently working on at all times. TaskTracer then attempts to learn a classifier that can predict the current task given information about the window that is currently in focus.

The training data consists of a long sequence of examples. Each training example corresponds to a period of time in which one window was "in focus" and was open on one document. Each feature corresponds to a word that appears in the title of the window or in the path name of the file.

The data are available in /usr/local/classes/eecs/spring2005/cs534/weka/data:

Training set: UnaryTrain983.arff
Development set: UnaryDevelop983.arff
Test set (to be released later): UnaryTest983.arff

The intended application of this classifier will be to signal to the user when the user has forgotten to update the current task. This means that false predictions are very bad, whereas rejections are probably not too serious.

Tools To Consider

In addition to writing your own programs, you may wish to consider the following software packages:

Mirage. This is a data visualization package developed by Tin Kam Ho at Bell Labs. It is not a machine learning system, but it may be very helpful for visualizing the BodyMedia data and developing features.
WEKA. For hyphenation, you could just convert this into some form of sliding window data, convert to ARFF format, and then apply standard supervised learning algorithms.
WEKA Recurrent Sliding Windows (RSW) package. This was written by my former student Saket Joshi, and it provides support for sliding windows and recurrent sliding windows (which we will discuss in class).
MALLET. This is a JAVA package for fitting Conditional Random Fields to sequential data. CRFs should work well for the hyphenation data.
BNT. This is the Bayes Net Toolkit for Matlab. It supports Hidden Markov Models, which might be a very good approach to the BodyMedia data

What to Turn In

On or before the due date, you should send me an email message with the following:

Tell me which data set you chose to study.
A description of your approach: How did you formulate the problem? What features did you use? What loss function? What algorithm did you choose and why?
A brief description of any other methods that you tried and why you decided not to pursue them.
If you chose the hyphenation problem, send me the hyphenated test data. You should also send me your predicted false positive and false negative rates and 95% confidence intervals on them. I will score your predictions and let you know how you did.
If you chose the BodyMedia problem, report how well you did training and testing on each of the five users for each of the two problems. Report 95% confidence intervals for false positive and false negative rates.