hw2-1
, hw2-2
, and br
. These data sets are in
the data
subdirectory of the Weka directory
(/usr/local/classes/eecs/spring2005/cs534/weka/data
).
This folder is available on COE windows machines under drive W:.
I have also made the data available on the ENGR web server:
http://classes.engr.oregonstate.edu/eecs/spring2005/cs534/data/.
Each data set has one or more training data files and one test data file:
br data files: br-test.arff br test data file br-train.arff br training data file hw2-1 data files hw2-1-10.arff 10 training examples hw2-1-20.arff 20 training examples hw2-1-50.arff 50 training examples hw2-1-100.arff 100 training examples hw2-1-200.arff 200 training examples hw2-1-test.arff test data file hw2-2 data files hw2-2-25.arff 25 training examples hw2-2-50.arff 50 training examples hw2-2-100.arff 100 training examples hw2-2-200.arff 200 training examples hw2-2-600.arff 600 training examples hw2-2-test.arff test data file
You will run the three learning algorithms on each training data file and evaluate the results on the corresponding test data files.
hw2-1
and hw2-2
you should turn the
following:
hw2-1: N Perceptron NaiveBayesSimple LogisticRegression 10 xxx yyy zzz 20 xxx yyy zzz 50 xxx yyy zzz 100 xxx yyy zzz 200 xxx yyy zzz hw2-2: N Perceptron NaiveBayesSimple LogisticRegression 25 xxx yyy zzz 50 xxx yyy zzz 100 xxx yyy zzz 200 xxx yyy zzz 600 xxx yyy zzz br: N Perceptron NaiveBayesSimple LogisticRegression 614 xxx yyy zzzWhere
xxx
gives the error rate of the perceptron,
yyy
gives the error rate of NaiveBayesSimple and
zzz
gives the error rate of LogisticRegression. We will
measure error rates on separate files of test points.
hw2-1
and hw2-2
plotting the
performance of the three algorithms as a function of the size of the training data set
(known as a "learning curve"). I recommend using gnuplot or excel for
constructing the graphs -- I don't think WEKA provides an easy way to
do this.
hw2-1-20
with lines
showing the decision boundary learned by Logistic Regression. That
is, you should plot the data as points in the x/y plane and then plot
the decision boundary learned by the algorithm. (Computing the
boundary for Naive Bayes is more difficult, so you do not have to do
that. Computing the boundary for the Voted Perceptron is even more
difficult.) I recommend gnuplot for this, since it can plot equations
as well as data points.
To compute the decision boundary for Logistic Regression, recall that
the logistic regression model has the form
log [ P(y=1|X) / P(y=0|X) ] = w0 + w1*x1 + w2*x2WEKA produces a table that looks like
Variable Coeff. 1 w1 2 w2 Intercept w0
hw2-2-50
with a line
showing the learned decision boundary for Logistic Regression.
hw2-1
and
hw2-2
data sets.
The best possible error rate is sometimes called the Bayes Rate. We can only know the
Bayes Rate for artificial data sets for which we know the procedure that generated the
data. The data set hw2-1
is generated from two
gaussian distributions. One is centered as (1,0) and the other at (0,1). Both have the
same co-variance matrix:
[ 2 0 ] [ 0 1 ]
hw2-2
is generated as follows. The x coordinate is
generated from an exponential distribution with parameter 1.0. The y
coordinate is generated from a uniform random variable in the interval
[0,1]. The class is assigned as follows. If (x > 0.5) the example
belong to the positive class, otherwise to the negative class.
However, this class label is flipped with probability 0.1 (so-called
"10% label noise").
To understand how to compute the Bayes rate, consider a simpler problem where there is only one feature x and two equally-likely classes. Suppose data points for class 1 are drawn from a one-dimensional gaussian distribution with mean 1 and variance 1, while data points for class 0 are drawn from a one-dimensional gaussian distribution with mean -1 and variance 1. The optimal decision boundary will be at x = 0. Points where x > 0 will be classified as class 1, and points where x <= 0 will be classified as class 0. What will be the error rate of this optimal classifier? Let's consider class 1 first. The data points from class 1 have a gaussian distribution, so some of them will end up at x < 0 and be misclassified. What is the probability that a data point belonging to class 1 has x < 0? It is precisely the area under the tail of the standard normal distribution from -infinity up to -1 (because the threshold (0) minus the true mean (1) is -1).
You can look in any statistics book to find that this is 0.1587. You can also compute this using the R statistical package (which is installed on the research Suns, type R at a shell prompt; you can also download and install it from The R Project). Suppose we want the area under the standard normal curve from -infinity up to "a". You just enter
pnorm(a, 0, 1)where the 0 is the mean and the 1 is the standard deviation,
In Matlab, you can type
0.5*erfc(-a/sqrt(2))So this tells us that the probability of misclassifying a data point from class 1 is 0.1587. By symmetry, the same is true for data points from class 0. Hence, the probability of error is
P(y=1) * 0.1587 + P(y=0) * 0.1587 = [P(y=1) + P(y=0)] * 0.1587 = 1 * 0.1587 = 0.1587Now let's return to the 2-dimensional gaussians of data set hw2-1. The way to solve this problem is to convert it into a one-dimensional problem and then use the method I've just presented. The idea is to take the "optimal projection" view of LDA. LDA computes a decision boundary. If we project the two gaussians on a line perpendicular to that decision boundary (and integrate away the dimension in the direction of the decision boundary), we will obtain our one-dimensional problem.
A few hints:
You can obtain WEKA by visiting the
WEKA Project Webpage and
clicking on the appropriate link for your operating system.
Alternatively, if you are on one of the CS systems, you can
access WEKA by connecting to /usr/local/classes/eecs/spring2005/cs534/weka
and executing the command run-weka
or
run-weka.bat
. I have verified that this works from COE
windows machines (under drive W).
These instructions will describe how to apply the learning algorithms to the BR data set. The others can be processed in exactly the same way, of course. When you start up Weka, you will first see the WEKA GUI Chooser, which has a picture of a bird (a weka) and four buttons. You should click on the Explorer button. This opens a large panel with several tabs, and the Preprocess tab will already be selected.
Click on "Open file...", then click on the "data" folder, and then select the "br-train.arff" file. The "Current relation" window should now show "Relation" as BR with 614 instances and 17 attributes. The table and bar plot on the right-hand side of the window will show 316 examples in class 0 and 298 in class 1.
Now click on the "Classify" tab of the Explorer window and examine the "Test options" panel. First we will load in the test data. Click on the radio button "Supplied test set". Then click on the "Set..." button. A small "Test Instances" pop-up window should appear. Click on "Open file...", navigate to the "data" folder, and select "br-test.arff". The Test Instances window should now show the relation "BR" with 613 instances and 17 attributes. You may close this window at this point.
Now we will tell Weka which of the 17 attributes is the class variable. Below the Test options panel, there is a drop down menu with the entry "(Num) x16" selected. Click on this and choose "(Nom) class" instead. [Num means numeric; Nom means nominal, i.e., discrete]
Now we need to select the learning algorithm to apply. Go to the "Classifier" panel (near the top) which initially shows two buttons: "Choose" and "ZeroR". ZeroR is a very simple rule-learning algorithm (which we do not want). The general idea of this user interface is that if you click on "Choose" you can choose a different algorithm. If you click on "ZeroR" (or whatever algorithm name is displayed there), you can set the parameters for the algorithm.
Click on "Choose", and you will see a hierarchical display whose top level is "weka", whose second level is "classifiers", and whose third level contains seven general kinds of classifiers: "bayes", "functions", "lazy", "meta", "trees", "rules". To choose NaiveBayesSimple, click on the "bayes" indicator and then select "NaiveBayesSimple". To select Logistic Regression, choose "functions" and then "Logistic". To select the Perceptron algorithm, choose "functions" and then "VotedPerceptron".
Once we have chosen an algorithm to run, Now we are ready to run the algorithm. Click on the "Start" button, and the Classifier Output window will show the output from the classifier. For Naive Bayes, this output consists of several sections: