CS534 Homework 8: Due Friday June 3, 9:00am

This assignment will take more time than the previous homework assignments. The purpose of this assignment is to give you experience with a "raw" data set. With a raw data set, you must consider the following:

Data Sets

For the assignment, you can choose one of the following three data sets:

  • TaskTracer Data. The problem is to predict which task a user is executing at the desktop.

    If you would like to study some other data set, please send me email immediately and we can discuss it. I hope to have a third data set from the TaskTracer project which I will make available later this week.

    Hyphenation Problem

    I have placed two data files in /usr/local/classes/eecs/spring2005/cs534/weka/data: hyphen.train and hyphen.eval. These are not in arff format. Each file has one word per line. The word contains a hyphen at each point where it is legal to insert a hyphen. Your task is to create a program that can take a file of new words and output those words (one per line) with hyphens inserted in all of the legal places.

    I constructed this dataset myself, so it may contain errors. I have checked most of the words against a dictionary, but I may have missed some and I may have made typing errors also. If you find training examples that seem to be wrong, please let me know.

    On Wednesday, June 1, I will make available a test data set that does not have any hyphens in it. You should run your program on this data to predict the hyphen locations and submit a file to me that contains your predictions (in exactly the same format as the training and evaluation files).

    BodyMedia Problem

    You can read more about the BodyMedia data by visiting the web site http://www.cs.utexas.edu/users/sherstov/pdmc. This describes a competition using this data and a workshop that will be held at the International Conference on Machine Learning in July.

    The BodyMedia data consist of data collected from several users. Each user wore the BodyMedia device that measures a set of 9 variables every minute. The users also kept a log where they indicated what activity they were performing at each point in time. However, often they forgot to assign an activity. For the competition, two of these activities have been chosen for prediction. BodyMedia is not telling us what these activities are, so we will call them Context 1 and Context 2. I have provided separate data files for each context. The data files have the following naming pattern: context-usernn-type.arff where

    The class label on these examples is one of

    TaskTracer Problem

    TaskTracer is a project led by Jon Herlocker and myself with the goal of supporting multi-tasking knowledge workers under Microsoft Windows. The central hypothesis of TaskTracer is that at each point in time, the user is working on one "task" or "activity". The user defines a set of activities and provides training data telling the system which activity he/she is currently working on at all times. TaskTracer then attempts to learn a classifier that can predict the current task given information about the window that is currently in focus.

    The training data consists of a long sequence of examples. Each training example corresponds to a period of time in which one window was "in focus" and was open on one document. Each feature corresponds to a word that appears in the title of the window or in the path name of the file.

    The data are available in /usr/local/classes/eecs/spring2005/cs534/weka/data:

    The intended application of this classifier will be to signal to the user when the user has forgotten to update the current task. This means that false predictions are very bad, whereas rejections are probably not too serious.

    Tools To Consider

    In addition to writing your own programs, you may wish to consider the following software packages:

    What to Turn In

    On or before the due date, you should send me an email message with the following: