CS534 Homework 8: Due Friday June 3, 9:00am
This assignment will take more time than the previous homework
assignments. The purpose of this assignment is to give you experience
with a "raw" data set. With a raw data set, you must consider the
- How to formulate the learning problem. Is it a classification
problem or a regression problem? Are the data iid (independent and
identically-distributed) or are they sequential, spatial, or
relational? Is all of the data labeled, or is some of it unlabeled?
- Performance criterion. How should performance be
measured? Error rate? Precision and Recall? AUC? Expected
misclassification cost? What loss function should be used for
training the learning algorithm?
- Feature design. How should the raw features be transformed so
that the data is suitable for machine learning? Should the data be
aggregated in some way? Should the data be transformed so that it has
a more gaussian distribution?
- Algorithm choice. Which learning algorithms would be best for
this problem? Factors to consider: data set size, noise level,
continuous versus discrete features, missing values, semi-supervised.
- Algorithm tuning. If the algorithm has user-set
parameters, what strategy should be used for setting them?
- Overfitting Avoidance. Is there a risk of overfitting?
If so, what overfitting avoidance methods should be applied? How
should they be tuned?
For the assignment, you can choose one of the following three data sets:
TaskTracer Data. The problem is to predict which task
a user is executing at the desktop.
- Hyphenation Data. The problem is to construct a program
for dividing English words into syllables by hyphenation. Such
programs are used by word-processing programs to decide where to
insert hyphens into words.
- BodyMedia Data. The problem is to predict when a
person is performing a particular activity based on 9 physiological
readings taken from a wearable sensor.
If you would like to study some other data set, please send me email
immediately and we can discuss it. I hope to have a third data set
from the TaskTracer project which I will make available later this
I have placed two data files in
hyphen.eval. These are not
in arff format. Each file has one word per line. The word contains a
hyphen at each point where it is legal to insert a hyphen. Your task
is to create a program that can take a file of new words and output
those words (one per line) with hyphens inserted in all of the legal
I constructed this dataset myself, so it may contain errors. I have
checked most of the words against a dictionary, but I may have missed
some and I may have made typing errors also. If you find training
examples that seem to be wrong, please let me know.
On Wednesday, June 1, I will make available a test data set that does
not have any hyphens in it. You should run your program on this data
to predict the hyphen locations and submit a file to me that contains
your predictions (in exactly the same format as the training and
You can read more about the BodyMedia data by visiting the web site http://www.cs.utexas.edu/users/sherstov/pdmc.
This describes a competition using this data and a workshop that will
be held at the International Conference on Machine Learning in July.
The BodyMedia data consist of data collected from several users.
Each user wore the BodyMedia device that measures a set of 9 variables
every minute. The users also kept a log where they indicated what
activity they were performing at each point in time. However, often
they forgot to assign an activity. For the competition, two of these
activities have been chosen for prediction. BodyMedia is not telling
us what these activities are, so we will call them Context 1 and
Context 2. I have provided separate data files for each context.
The data files have the following naming pattern: context-usernn-type.arff
The class label on these
examples is one of
context is either
c2- indicating which of the two prediction problems the
file relates to.
usernn is the identifier for the particular user,
where nn is one of 01, 05, 06, 15, and 25. I chose these five users
because they have the most data available. You may want to try
training classifiers for individual users or classifiers that are
supposed to generalize across
type is one of "train" or "eval", which indicates
the training data set or the evaluation (holdout) data set. On June
2, I will make a separate test set available to you.
- 0: The user was definitely not performing the target activity
during this minute
- 1: The user was performing the target activity
- ?: It is unknown whether the user was performing the target
activity or not.
TaskTracer is a project led by Jon Herlocker and myself with the goal
of supporting multi-tasking knowledge workers under Microsoft
Windows. The central hypothesis of TaskTracer is that at each point
in time, the user is working on one "task" or "activity". The user
defines a set of activities and provides training data telling the
system which activity he/she is currently working on at all times.
TaskTracer then attempts to learn a classifier that can predict the
current task given information about the window that is currently in
The training data consists of a long sequence of examples. Each
training example corresponds to a period of time in which one window
was "in focus" and was open on one document. Each feature corresponds
to a word that appears in the title of the window or in the path name
of the file.
The data are available in
- Training set:
- Development set:
- Test set (to be released later):
The intended application of this classifier will be to signal to the
user when the user has forgotten to update the current task. This
means that false predictions are very bad, whereas rejections are
probably not too serious.
Tools To Consider
In addition to writing your own programs, you may wish to consider the
following software packages:
This is a data visualization package developed by Tin Kam Ho at Bell
Labs. It is not a machine learning system, but it may be very helpful
for visualizing the BodyMedia data and developing features.
- WEKA. For hyphenation, you could just convert this into some
form of sliding window data, convert to ARFF format, and then apply
standard supervised learning algorithms.
Recurrent Sliding Windows (RSW) package. This was written by my
former student Saket Joshi, and it provides support for sliding
windows and recurrent sliding windows (which we will discuss in
- MALLET. This is a
JAVA package for fitting Conditional Random Fields to sequential
data. CRFs should work well for the hyphenation data.
This is the Bayes Net Toolkit for Matlab. It supports Hidden Markov
Models, which might be a very good approach to the BodyMedia data
What to Turn In
On or before the due date, you should send me an email message with
- Tell me which data set you chose to study.
- A description of your approach: How did you formulate the
problem? What features did you use? What loss function?
What algorithm did you choose and why?
- A brief description of any other methods that you tried and why
you decided not to pursue them.
- If you chose the hyphenation problem, send me the hyphenated test
data. You should also send me your predicted false positive and false
negative rates and 95% confidence intervals on them. I will score your
predictions and let you know how you did.
- If you chose the BodyMedia problem, report how well you did
training and testing on each of the five users for each of the two
problems. Report 95% confidence intervals for false positive and
false negative rates.