**Learning Algorithms**. We will compare LMS, Naive Bayes Simple, and Logistic Regression.**Data Sets**. We will apply these three algorithms to the data sets`hw2-1`

,`hw2-2`

, and`br`

. These data sets are in the`data`

subdirectory of the Weka directory (`/usr/local/classes/eecs/spring2005/cs534/weka/data`

). This folder is available on COE windows machines under drive W:. I have also made the data available on the ENGR web server: http://classes.engr.oregonstate.edu/eecs/spring2005/cs534/data/. Each data set has one or more training data files and one test data file:br data files: br-test.arff br test data file br-train.arff br training data file hw2-1 data files hw2-1-10.arff 10 training examples hw2-1-20.arff 20 training examples hw2-1-50.arff 50 training examples hw2-1-100.arff 100 training examples hw2-1-200.arff 200 training examples hw2-1-test.arff test data file hw2-2 data files hw2-2-25.arff 25 training examples hw2-2-50.arff 50 training examples hw2-2-100.arff 100 training examples hw2-2-200.arff 200 training examples hw2-2-600.arff 600 training examples hw2-2-test.arff test data file

You will run the three learning algorithms on each training data file and evaluate the results on the corresponding test data files.

**Results**. For`hw2-1`

and`hw2-2`

you should turn the following:- A table in the following format:
hw2-1: N Perceptron NaiveBayesSimple LogisticRegression 10 xxx yyy zzz 20 xxx yyy zzz 50 xxx yyy zzz 100 xxx yyy zzz 200 xxx yyy zzz hw2-2: N Perceptron NaiveBayesSimple LogisticRegression 25 xxx yyy zzz 50 xxx yyy zzz 100 xxx yyy zzz 200 xxx yyy zzz 600 xxx yyy zzz br: N Perceptron NaiveBayesSimple LogisticRegression 614 xxx yyy zzz

Where`xxx`

gives the error rate of the perceptron,`yyy`

gives the error rate of NaiveBayesSimple and`zzz`

gives the error rate of LogisticRegression. We will measure error rates on separate files of test points. - Graphs of the results for
`hw2-1`

and`hw2-2`

plotting the performance of the three algorithms as a function of the size of the training data set (known as a "learning curve"). I recommend using gnuplot or excel for constructing the graphs -- I don't think WEKA provides an easy way to do this. - Plot of the data points for
`hw2-1-20`

with lines showing the decision boundary learned by Logistic Regression. That is, you should plot the data as points in the x/y plane and then plot the decision boundary learned by the algorithm. (Computing the boundary for Naive Bayes is more difficult, so you do not have to do that. Computing the boundary for the Voted Perceptron is even more difficult.) I recommend gnuplot for this, since it can plot equations as well as data points. To compute the decision boundary for Logistic Regression, recall that the logistic regression model has the formlog [ P(y=1|X) / P(y=0|X) ] = w0 + w1*x1 + w2*x2

WEKA produces a table that looks likeVariable Coeff. 1 w1 2 w2 Intercept w0

- Plot of the data points for
`hw2-2-50`

with a line showing the learned decision boundary for Logistic Regression. - Compute the best possible error rate for the
`hw2-1`

and`hw2-2`

data sets. The best possible error rate is sometimes called the Bayes Rate. We can only know the Bayes Rate for artificial data sets for which we know the procedure that generated the data. The data set`hw2-1`

is generated from two gaussian distributions. One is centered as (1,0) and the other at (0,1). Both have the same co-variance matrix:[ 2 0 ] [ 0 1 ]

`hw2-2`

is generated as follows. The x coordinate is generated from an exponential distribution with parameter 1.0. The y coordinate is generated from a uniform random variable in the interval [0,1]. The class is assigned as follows. If (x > 0.5) the example belong to the positive class, otherwise to the negative class. However, this class label is flipped with probability 0.1 (so-called "10% label noise").To understand how to compute the Bayes rate, consider a simpler problem where there is only one feature x and two equally-likely classes. Suppose data points for class 1 are drawn from a one-dimensional gaussian distribution with mean 1 and variance 1, while data points for class 0 are drawn from a one-dimensional gaussian distribution with mean -1 and variance 1. The optimal decision boundary will be at x = 0. Points where x > 0 will be classified as class 1, and points where x <= 0 will be classified as class 0. What will be the error rate of this optimal classifier? Let's consider class 1 first. The data points from class 1 have a gaussian distribution, so some of them will end up at x < 0 and be misclassified. What is the probability that a data point belonging to class 1 has x < 0? It is precisely the area under the tail of the standard normal distribution from -infinity up to -1 (because the threshold (0) minus the true mean (1) is -1).

You can look in any statistics book to find that this is 0.1587. You can also compute this using the R statistical package (which is installed on the research Suns, type R at a shell prompt; you can also download and install it from The R Project). Suppose we want the area under the standard normal curve from -infinity up to "a". You just enter

pnorm(a, 0, 1)

where the 0 is the mean and the 1 is the standard deviation,In Matlab, you can type

0.5*erfc(-a/sqrt(2))

So this tells us that the probability of misclassifying a data point from class 1 is 0.1587. By symmetry, the same is true for data points from class 0. Hence, the probability of error isP(y=1) * 0.1587 + P(y=0) * 0.1587 = [P(y=1) + P(y=0)] * 0.1587 = 1 * 0.1587 = 0.1587

Now let's return to the 2-dimensional gaussians of data set hw2-1. The way to solve this problem is to convert it into a one-dimensional problem and then use the method I've just presented. The idea is to take the "optimal projection" view of LDA. LDA computes a decision boundary. If we project the two gaussians on a line perpendicular to that decision boundary (and integrate away the dimension in the direction of the decision boundary), we will obtain our one-dimensional problem.A few hints:

- Hint 1: If the equation of a line is w * x = c, and then w is a vector
perpendicular to the line.
- Hint 2: If u is a unit vector and x is an arbitrary vector, then u * x
(the dot product) gives the length of x projected onto u.
- Hint 3: If sigma is the covariance matrix of a multivariate gaussian distribution and u is a unit vector, then sigma^2 = u^T sigma u is the variance of a one-dimensional gaussian projected onto u. (where u^T indicates the transpose of u.)

- Hint 1: If the equation of a line is w * x = c, and then w is a vector
perpendicular to the line.

- A table in the following format:

You can obtain WEKA by visiting the
WEKA Project Webpage and
clicking on the appropriate link for your operating system.
Alternatively, if you are on one of the CS systems, you can
access WEKA by connecting to `/usr/local/classes/eecs/spring2005/cs534/weka`

and executing the command `run-weka`

or
`run-weka.bat`

. I have verified that this works from COE
windows machines (under drive W).

These instructions will describe how to apply the learning algorithms to the BR data set. The others can be processed in exactly the same way, of course. When you start up Weka, you will first see the WEKA GUI Chooser, which has a picture of a bird (a weka) and four buttons. You should click on the Explorer button. This opens a large panel with several tabs, and the Preprocess tab will already be selected.

Click on "Open file...", then click on the "data" folder, and then select the "br-train.arff" file. The "Current relation" window should now show "Relation" as BR with 614 instances and 17 attributes. The table and bar plot on the right-hand side of the window will show 316 examples in class 0 and 298 in class 1.

Now click on the "Classify" tab of the Explorer window and examine the "Test options" panel. First we will load in the test data. Click on the radio button "Supplied test set". Then click on the "Set..." button. A small "Test Instances" pop-up window should appear. Click on "Open file...", navigate to the "data" folder, and select "br-test.arff". The Test Instances window should now show the relation "BR" with 613 instances and 17 attributes. You may close this window at this point.

Now we will tell Weka which of the 17 attributes is the class variable. Below the Test options panel, there is a drop down menu with the entry "(Num) x16" selected. Click on this and choose "(Nom) class" instead. [Num means numeric; Nom means nominal, i.e., discrete]

Now we need to select the learning algorithm to apply. Go to the "Classifier" panel (near the top) which initially shows two buttons: "Choose" and "ZeroR". ZeroR is a very simple rule-learning algorithm (which we do not want). The general idea of this user interface is that if you click on "Choose" you can choose a different algorithm. If you click on "ZeroR" (or whatever algorithm name is displayed there), you can set the parameters for the algorithm.

Click on "Choose", and you will see a hierarchical display whose top level is "weka", whose second level is "classifiers", and whose third level contains seven general kinds of classifiers: "bayes", "functions", "lazy", "meta", "trees", "rules". To choose NaiveBayesSimple, click on the "bayes" indicator and then select "NaiveBayesSimple". To select Logistic Regression, choose "functions" and then "Logistic". To select the Perceptron algorithm, choose "functions" and then "VotedPerceptron".

Once we have chosen an algorithm to run, Now we are ready to run the algorithm. Click on the "Start" button, and the Classifier Output window will show the output from the classifier. For Naive Bayes, this output consists of several sections:

- Run Information: Details of the data set
- Classifier model: The learned model. For naive bayes, each Num attribute is modeled by its own gaussian distribution. The output shows the mean and standard deviation of that gaussian along with the probabilities of the two classes.
- Evaluation on test set: This gives various statistics. The key item is the second one: Incorrectly Classified Instances will be expressed as a count and a percentage. You should report the percentages in your answer. One other item of interest comes at the very end: The Confusion Matrix. This shows how many false positive and false negative errors were made.