CS534 Homework 2 Due Monday April 11

In this assignment, you will use the WEKA system to analyze two artificial data sets and one real data set. You will apply three learning algorithms to each data set and compare their performance.

Learning Algorithms. We will compare LMS, Naive Bayes Simple, and Logistic Regression.

Data Sets. We will apply these three algorithms to the data sets hw2-1, hw2-2, and br. These data sets are in the data subdirectory of the Weka directory (/usr/local/classes/eecs/spring2005/cs534/weka/data). This folder is available on COE windows machines under drive W:. I have also made the data available on the ENGR web server: http://classes.engr.oregonstate.edu/eecs/spring2005/cs534/data/. Each data set has one or more training data files and one test data file:

br data files:
      br-test.arff         br test data file
      br-train.arff        br training data file

hw2-1 data files 
      hw2-1-10.arff        10 training examples
      hw2-1-20.arff        20 training examples
      hw2-1-50.arff        50 training examples
      hw2-1-100.arff       100 training examples
      hw2-1-200.arff       200 training examples
      hw2-1-test.arff      test data file

hw2-2 data files
      hw2-2-25.arff        25 training examples
      hw2-2-50.arff        50 training examples
      hw2-2-100.arff       100 training examples
      hw2-2-200.arff       200 training examples
      hw2-2-600.arff       600 training examples
      hw2-2-test.arff      test data file

You will run the three learning algorithms on each training data file and evaluate the results on the corresponding test data files.

Results. For hw2-1 and hw2-2 you should turn the following:
- A table in the following format:
```
hw2-1:
N            Perceptron  NaiveBayesSimple   LogisticRegression
10           xxx         yyy                zzz        
20           xxx         yyy                zzz
50           xxx         yyy                zzz
100          xxx         yyy                zzz
200          xxx         yyy                zzz

hw2-2:
N            Perceptron  NaiveBayesSimple   LogisticRegression
25           xxx         yyy                zzz
50           xxx         yyy                zzz
100          xxx         yyy                zzz
200          xxx         yyy                zzz
600          xxx         yyy                zzz

br:
N            Perceptron  NaiveBayesSimple   LogisticRegression
614          xxx         yyy                zzz
```
  Where xxx gives the error rate of the perceptron, yyy gives the error rate of NaiveBayesSimple and zzz gives the error rate of LogisticRegression. We will measure error rates on separate files of test points.
- Graphs of the results for hw2-1 and hw2-2 plotting the performance of the three algorithms as a function of the size of the training data set (known as a "learning curve"). I recommend using gnuplot or excel for constructing the graphs -- I don't think WEKA provides an easy way to do this.
- Plot of the data points for hw2-1-20 with lines showing the decision boundary learned by Logistic Regression. That is, you should plot the data as points in the x/y plane and then plot the decision boundary learned by the algorithm. (Computing the boundary for Naive Bayes is more difficult, so you do not have to do that. Computing the boundary for the Voted Perceptron is even more difficult.) I recommend gnuplot for this, since it can plot equations as well as data points. To compute the decision boundary for Logistic Regression, recall that the logistic regression model has the form
```
log [ P(y=1|X) / P(y=0|X) ] = w0 + w1*x1 + w2*x2
```
  WEKA produces a table that looks like
```
 Variable      Coeff.
        1      w1
        2      w2
Intercept      w0
```
- Plot of the data points for hw2-2-50 with a line showing the learned decision boundary for Logistic Regression.
- Compute the best possible error rate for the hw2-1 and hw2-2 data sets. The best possible error rate is sometimes called the Bayes Rate. We can only know the Bayes Rate for artificial data sets for which we know the procedure that generated the data. The data set hw2-1 is generated from two gaussian distributions. One is centered as (1,0) and the other at (0,1). Both have the same co-variance matrix:
```
          [ 2  0 ]
          [ 0  1 ]
```
  hw2-2 is generated as follows. The x coordinate is generated from an exponential distribution with parameter 1.0. The y coordinate is generated from a uniform random variable in the interval [0,1]. The class is assigned as follows. If (x > 0.5) the example belong to the positive class, otherwise to the negative class. However, this class label is flipped with probability 0.1 (so-called "10% label noise").
  To understand how to compute the Bayes rate, consider a simpler problem where there is only one feature x and two equally-likely classes. Suppose data points for class 1 are drawn from a one-dimensional gaussian distribution with mean 1 and variance 1, while data points for class 0 are drawn from a one-dimensional gaussian distribution with mean -1 and variance 1. The optimal decision boundary will be at x = 0. Points where x > 0 will be classified as class 1, and points where x <= 0 will be classified as class 0. What will be the error rate of this optimal classifier? Let's consider class 1 first. The data points from class 1 have a gaussian distribution, so some of them will end up at x < 0 and be misclassified. What is the probability that a data point belonging to class 1 has x < 0? It is precisely the area under the tail of the standard normal distribution from -infinity up to -1 (because the threshold (0) minus the true mean (1) is -1).
  You can look in any statistics book to find that this is 0.1587. You can also compute this using the R statistical package (which is installed on the research Suns, type R at a shell prompt; you can also download and install it from The R Project). Suppose we want the area under the standard normal curve from -infinity up to "a". You just enter
```
    pnorm(a, 0, 1)
```
  where the 0 is the mean and the 1 is the standard deviation,
  In Matlab, you can type
```
    0.5*erfc(-a/sqrt(2))
```
  So this tells us that the probability of misclassifying a data point from class 1 is 0.1587. By symmetry, the same is true for data points from class 0. Hence, the probability of error is
```
  P(y=1) * 0.1587 + P(y=0) * 0.1587 =  [P(y=1) + P(y=0)] * 0.1587
                                    =  1 * 0.1587
                                    = 0.1587
```
  Now let's return to the 2-dimensional gaussians of data set hw2-1. The way to solve this problem is to convert it into a one-dimensional problem and then use the method I've just presented. The idea is to take the "optimal projection" view of LDA. LDA computes a decision boundary. If we project the two gaussians on a line perpendicular to that decision boundary (and integrate away the dimension in the direction of the decision boundary), we will obtain our one-dimensional problem.
  A few hints:
  - Hint 1: If the equation of a line is w * x = c, and then w is a vector perpendicular to the line.
  - Hint 2: If u is a unit vector and x is an arbitrary vector, then u * x (the dot product) gives the length of x projected onto u.
  - Hint 3: If sigma is the covariance matrix of a multivariate gaussian distribution and u is a unit vector, then sigma^2 = u^T sigma u is the variance of a one-dimensional gaussian projected onto u. (where u^T indicates the transpose of u.)
  When I say "project a gaussian onto a vector u", I mean to compute the one-dimensional probability density that corresponds to integrating away the dimensions orthogonal to u.

Obtaining Weka

You can obtain WEKA by visiting the WEKA Project Webpage and clicking on the appropriate link for your operating system. Alternatively, if you are on one of the CS systems, you can access WEKA by connecting to /usr/local/classes/eecs/spring2005/cs534/weka and executing the command run-weka or run-weka.bat. I have verified that this works from COE windows machines (under drive W).

Using Weka

These instructions will describe how to apply the learning algorithms to the BR data set. The others can be processed in exactly the same way, of course. When you start up Weka, you will first see the WEKA GUI Chooser, which has a picture of a bird (a weka) and four buttons. You should click on the Explorer button. This opens a large panel with several tabs, and the Preprocess tab will already be selected.

Click on "Open file...", then click on the "data" folder, and then select the "br-train.arff" file. The "Current relation" window should now show "Relation" as BR with 614 instances and 17 attributes. The table and bar plot on the right-hand side of the window will show 316 examples in class 0 and 298 in class 1.

Now click on the "Classify" tab of the Explorer window and examine the "Test options" panel. First we will load in the test data. Click on the radio button "Supplied test set". Then click on the "Set..." button. A small "Test Instances" pop-up window should appear. Click on "Open file...", navigate to the "data" folder, and select "br-test.arff". The Test Instances window should now show the relation "BR" with 613 instances and 17 attributes. You may close this window at this point.

Now we will tell Weka which of the 17 attributes is the class variable. Below the Test options panel, there is a drop down menu with the entry "(Num) x16" selected. Click on this and choose "(Nom) class" instead. [Num means numeric; Nom means nominal, i.e., discrete]

Now we need to select the learning algorithm to apply. Go to the "Classifier" panel (near the top) which initially shows two buttons: "Choose" and "ZeroR". ZeroR is a very simple rule-learning algorithm (which we do not want). The general idea of this user interface is that if you click on "Choose" you can choose a different algorithm. If you click on "ZeroR" (or whatever algorithm name is displayed there), you can set the parameters for the algorithm.

Click on "Choose", and you will see a hierarchical display whose top level is "weka", whose second level is "classifiers", and whose third level contains seven general kinds of classifiers: "bayes", "functions", "lazy", "meta", "trees", "rules". To choose NaiveBayesSimple, click on the "bayes" indicator and then select "NaiveBayesSimple". To select Logistic Regression, choose "functions" and then "Logistic". To select the Perceptron algorithm, choose "functions" and then "VotedPerceptron".

Once we have chosen an algorithm to run, Now we are ready to run the algorithm. Click on the "Start" button, and the Classifier Output window will show the output from the classifier. For Naive Bayes, this output consists of several sections:

Run Information: Details of the data set
Classifier model: The learned model. For naive bayes, each Num attribute is modeled by its own gaussian distribution. The output shows the mean and standard deviation of that gaussian along with the probabilities of the two classes.
Evaluation on test set: This gives various statistics. The key item is the second one: Incorrectly Classified Instances will be expressed as a count and a percentage. You should report the percentages in your answer. One other item of interest comes at the very end: The Confusion Matrix. This shows how many false positive and false negative errors were made.