TaskTracer Data. The problem is to predict which task
a user is executing at the desktop.
If you would like to study some other data set, please send me email
immediately and we can discuss it. I hope to have a third data set
from the TaskTracer project which I will make available later this
week.
Hyphenation Problem
I have placed two data files in
/usr/local/classes/eecs/spring2005/cs534/weka/data
:
hyphen.train
and hyphen.eval
. These are not
in arff format. Each file has one word per line. The word contains a
hyphen at each point where it is legal to insert a hyphen. Your task
is to create a program that can take a file of new words and output
those words (one per line) with hyphens inserted in all of the legal
places.
I constructed this dataset myself, so it may contain errors. I have
checked most of the words against a dictionary, but I may have missed
some and I may have made typing errors also. If you find training
examples that seem to be wrong, please let me know.
On Wednesday, June 1, I will make available a test data set that does
not have any hyphens in it. You should run your program on this data
to predict the hyphen locations and submit a file to me that contains
your predictions (in exactly the same format as the training and
evaluation files).
BodyMedia Problem
You can read more about the BodyMedia data by visiting the web site http://www.cs.utexas.edu/users/sherstov/pdmc.
This describes a competition using this data and a workshop that will
be held at the International Conference on Machine Learning in July.
The BodyMedia data consist of data collected from several users.
Each user wore the BodyMedia device that measures a set of 9 variables
every minute. The users also kept a log where they indicated what
activity they were performing at each point in time. However, often
they forgot to assign an activity. For the competition, two of these
activities have been chosen for prediction. BodyMedia is not telling
us what these activities are, so we will call them Context 1 and
Context 2. I have provided separate data files for each context.
The data files have the following naming pattern: context-usernn-type.arff
where
-
context
is either c1-
or
c2-
indicating which of the two prediction problems the
file relates to.
-
usernn
is the identifier for the particular user,
where nn is one of 01, 05, 06, 15, and 25. I chose these five users
because they have the most data available. You may want to try
training classifiers for individual users or classifiers that are
supposed to generalize across
-
type
is one of "train" or "eval", which indicates
the training data set or the evaluation (holdout) data set. On June
2, I will make a separate test set available to you.
The class label on these
examples is one of
- 0: The user was definitely not performing the target activity
during this minute
- 1: The user was performing the target activity
- ?: It is unknown whether the user was performing the target
activity or not.
TaskTracer Problem
TaskTracer is a project led by Jon Herlocker and myself with the goal
of supporting multi-tasking knowledge workers under Microsoft
Windows. The central hypothesis of TaskTracer is that at each point
in time, the user is working on one "task" or "activity". The user
defines a set of activities and provides training data telling the
system which activity he/she is currently working on at all times.
TaskTracer then attempts to learn a classifier that can predict the
current task given information about the window that is currently in
focus.
The training data consists of a long sequence of examples. Each
training example corresponds to a period of time in which one window
was "in focus" and was open on one document. Each feature corresponds
to a word that appears in the title of the window or in the path name
of the file.
The data are available in
/usr/local/classes/eecs/spring2005/cs534/weka/data
:
- Training set:
UnaryTrain983.arff
- Development set:
UnaryDevelop983.arff
- Test set (to be released later):
UnaryTest983.arff
The intended application of this classifier will be to signal to the
user when the user has forgotten to update the current task. This
means that false predictions are very bad, whereas rejections are
probably not too serious.
Tools To Consider
In addition to writing your own programs, you may wish to consider the
following software packages:
- Mirage.
This is a data visualization package developed by Tin Kam Ho at Bell
Labs. It is not a machine learning system, but it may be very helpful
for visualizing the BodyMedia data and developing features.
- WEKA. For hyphenation, you could just convert this into some
form of sliding window data, convert to ARFF format, and then apply
standard supervised learning algorithms.
- WEKA
Recurrent Sliding Windows (RSW) package. This was written by my
former student Saket Joshi, and it provides support for sliding
windows and recurrent sliding windows (which we will discuss in
class).
- MALLET. This is a
JAVA package for fitting Conditional Random Fields to sequential
data. CRFs should work well for the hyphenation data.
- BNT.
This is the Bayes Net Toolkit for Matlab. It supports Hidden Markov
Models, which might be a very good approach to the BodyMedia data
What to Turn In
On or before the due date, you should send me an email message with
the following:
- Tell me which data set you chose to study.
- A description of your approach: How did you formulate the
problem? What features did you use? What loss function?
What algorithm did you choose and why?
- A brief description of any other methods that you tried and why
you decided not to pursue them.
- If you chose the hyphenation problem, send me the hyphenated test
data. You should also send me your predicted false positive and false
negative rates and 95% confidence intervals on them. I will score your
predictions and let you know how you did.
- If you chose the BodyMedia problem, report how well you did
training and testing on each of the five users for each of the two
problems. Report 95% confidence intervals for false positive and
false negative rates.