br. These data sets are in the
datasubdirectory of the Weka directory (
/usr/local/classes/eecs/spring2005/cs534/weka/data). This folder is available on COE windows machines under drive W:. I have also made the data available on the ENGR web server: http://classes.engr.oregonstate.edu/eecs/spring2005/cs534/data/. Each data set has one or more training data files and one test data file:
br data files: br-test.arff br test data file br-train.arff br training data file hw2-1 data files hw2-1-10.arff 10 training examples hw2-1-20.arff 20 training examples hw2-1-50.arff 50 training examples hw2-1-100.arff 100 training examples hw2-1-200.arff 200 training examples hw2-1-test.arff test data file hw2-2 data files hw2-2-25.arff 25 training examples hw2-2-50.arff 50 training examples hw2-2-100.arff 100 training examples hw2-2-200.arff 200 training examples hw2-2-600.arff 600 training examples hw2-2-test.arff test data file
You will run the three learning algorithms on each training data file and evaluate the results on the corresponding test data files.
hw2-2you should turn the following:
hw2-1: N Perceptron NaiveBayesSimple LogisticRegression 10 xxx yyy zzz 20 xxx yyy zzz 50 xxx yyy zzz 100 xxx yyy zzz 200 xxx yyy zzz hw2-2: N Perceptron NaiveBayesSimple LogisticRegression 25 xxx yyy zzz 50 xxx yyy zzz 100 xxx yyy zzz 200 xxx yyy zzz 600 xxx yyy zzz br: N Perceptron NaiveBayesSimple LogisticRegression 614 xxx yyy zzzWhere
xxxgives the error rate of the perceptron,
yyygives the error rate of NaiveBayesSimple and
zzzgives the error rate of LogisticRegression. We will measure error rates on separate files of test points.
hw2-2plotting the performance of the three algorithms as a function of the size of the training data set (known as a "learning curve"). I recommend using gnuplot or excel for constructing the graphs -- I don't think WEKA provides an easy way to do this.
hw2-1-20with lines showing the decision boundary learned by Logistic Regression. That is, you should plot the data as points in the x/y plane and then plot the decision boundary learned by the algorithm. (Computing the boundary for Naive Bayes is more difficult, so you do not have to do that. Computing the boundary for the Voted Perceptron is even more difficult.) I recommend gnuplot for this, since it can plot equations as well as data points. To compute the decision boundary for Logistic Regression, recall that the logistic regression model has the form
log [ P(y=1|X) / P(y=0|X) ] = w0 + w1*x1 + w2*x2WEKA produces a table that looks like
Variable Coeff. 1 w1 2 w2 Intercept w0
hw2-2-50with a line showing the learned decision boundary for Logistic Regression.
hw2-2data sets. The best possible error rate is sometimes called the Bayes Rate. We can only know the Bayes Rate for artificial data sets for which we know the procedure that generated the data. The data set
hw2-1is generated from two gaussian distributions. One is centered as (1,0) and the other at (0,1). Both have the same co-variance matrix:
[ 2 0 ] [ 0 1 ]
hw2-2is generated as follows. The x coordinate is generated from an exponential distribution with parameter 1.0. The y coordinate is generated from a uniform random variable in the interval [0,1]. The class is assigned as follows. If (x > 0.5) the example belong to the positive class, otherwise to the negative class. However, this class label is flipped with probability 0.1 (so-called "10% label noise").
To understand how to compute the Bayes rate, consider a simpler problem where there is only one feature x and two equally-likely classes. Suppose data points for class 1 are drawn from a one-dimensional gaussian distribution with mean 1 and variance 1, while data points for class 0 are drawn from a one-dimensional gaussian distribution with mean -1 and variance 1. The optimal decision boundary will be at x = 0. Points where x > 0 will be classified as class 1, and points where x <= 0 will be classified as class 0. What will be the error rate of this optimal classifier? Let's consider class 1 first. The data points from class 1 have a gaussian distribution, so some of them will end up at x < 0 and be misclassified. What is the probability that a data point belonging to class 1 has x < 0? It is precisely the area under the tail of the standard normal distribution from -infinity up to -1 (because the threshold (0) minus the true mean (1) is -1).
You can look in any statistics book to find that this is 0.1587. You can also compute this using the R statistical package (which is installed on the research Suns, type R at a shell prompt; you can also download and install it from The R Project). Suppose we want the area under the standard normal curve from -infinity up to "a". You just enter
pnorm(a, 0, 1)where the 0 is the mean and the 1 is the standard deviation,
In Matlab, you can type
0.5*erfc(-a/sqrt(2))So this tells us that the probability of misclassifying a data point from class 1 is 0.1587. By symmetry, the same is true for data points from class 0. Hence, the probability of error is
P(y=1) * 0.1587 + P(y=0) * 0.1587 = [P(y=1) + P(y=0)] * 0.1587 = 1 * 0.1587 = 0.1587Now let's return to the 2-dimensional gaussians of data set hw2-1. The way to solve this problem is to convert it into a one-dimensional problem and then use the method I've just presented. The idea is to take the "optimal projection" view of LDA. LDA computes a decision boundary. If we project the two gaussians on a line perpendicular to that decision boundary (and integrate away the dimension in the direction of the decision boundary), we will obtain our one-dimensional problem.
A few hints:
You can obtain WEKA by visiting the
WEKA Project Webpage and
clicking on the appropriate link for your operating system.
Alternatively, if you are on one of the CS systems, you can
access WEKA by connecting to
and executing the command
run-weka.bat. I have verified that this works from COE
windows machines (under drive W).
These instructions will describe how to apply the learning algorithms to the BR data set. The others can be processed in exactly the same way, of course. When you start up Weka, you will first see the WEKA GUI Chooser, which has a picture of a bird (a weka) and four buttons. You should click on the Explorer button. This opens a large panel with several tabs, and the Preprocess tab will already be selected.
Click on "Open file...", then click on the "data" folder, and then select the "br-train.arff" file. The "Current relation" window should now show "Relation" as BR with 614 instances and 17 attributes. The table and bar plot on the right-hand side of the window will show 316 examples in class 0 and 298 in class 1.
Now click on the "Classify" tab of the Explorer window and examine the "Test options" panel. First we will load in the test data. Click on the radio button "Supplied test set". Then click on the "Set..." button. A small "Test Instances" pop-up window should appear. Click on "Open file...", navigate to the "data" folder, and select "br-test.arff". The Test Instances window should now show the relation "BR" with 613 instances and 17 attributes. You may close this window at this point.
Now we will tell Weka which of the 17 attributes is the class variable. Below the Test options panel, there is a drop down menu with the entry "(Num) x16" selected. Click on this and choose "(Nom) class" instead. [Num means numeric; Nom means nominal, i.e., discrete]
Now we need to select the learning algorithm to apply. Go to the "Classifier" panel (near the top) which initially shows two buttons: "Choose" and "ZeroR". ZeroR is a very simple rule-learning algorithm (which we do not want). The general idea of this user interface is that if you click on "Choose" you can choose a different algorithm. If you click on "ZeroR" (or whatever algorithm name is displayed there), you can set the parameters for the algorithm.
Click on "Choose", and you will see a hierarchical display whose top level is "weka", whose second level is "classifiers", and whose third level contains seven general kinds of classifiers: "bayes", "functions", "lazy", "meta", "trees", "rules". To choose NaiveBayesSimple, click on the "bayes" indicator and then select "NaiveBayesSimple". To select Logistic Regression, choose "functions" and then "Logistic". To select the Perceptron algorithm, choose "functions" and then "VotedPerceptron".
Once we have chosen an algorithm to run, Now we are ready to run the algorithm. Click on the "Start" button, and the Classifier Output window will show the output from the classifier. For Naive Bayes, this output consists of several sections: