CS534 Homework 3 Due Monday April 18

In this assignment, you will use the WEKA system to analyze the same data sets as in the previous assignment, but with three new algorithms.

Learning Algorithms. We will compare decision trees (J48), neural networks, and k-nearest neighbors (IBk). You should use the defaults for these algorithms with the following exceptions:
- trees>J48 Set unpruned to True.
- functions>MultiLayerPerceptron. Set hiddenLayers to 5, set trainingTime to 1000. (We will experiment with other settings below).
- lazy>IBk. Set KNN to 1 (which is the default; we will experiment with other values below).

Data Sets. We will apply these three algorithms to the same data sets as for HW2: hw2-1, hw2-2, and br. http://classes.engr.oregonstate.edu/eecs/spring2005/cs534/data/.

br data files:
      br-test.arff         br test data file
      br-train.arff        br training data file

hw2-1 data files 
      hw2-1-10.arff        10 training examples
      hw2-1-20.arff        20 training examples
      hw2-1-50.arff        50 training examples
      hw2-1-100.arff       100 training examples
      hw2-1-200.arff       200 training examples
      hw2-1-400.arff       400 training examples
      hw2-1-test.arff      test data file

hw2-2 data files
      hw2-2-25.arff        25 training examples
      hw2-2-50.arff        50 training examples
      hw2-2-100.arff       100 training examples
      hw2-2-200.arff       200 training examples
      hw2-2-600.arff       600 training examples
      hw2-2-test.arff      test data file

You will run the three learning algorithms on each training data file and evaluate the results on the corresponding test data files.

Results. You should turn in the following:
1. A table in the following format:
```
hw2-1:
N            J48        NeuralNet           kNN
10           xxx        yyy                 zzz        
20           xxx        yyy                 zzz
50           xxx        yyy                 zzz
100          xxx        yyy                 zzz
200          xxx        yyy                 zzz
400          xxx        yyy                 zzz

hw2-2:
N            J48        NeuralNet           kNN
25           xxx        yyy                 zzz
50           xxx        yyy                 zzz
100          xxx        yyy                 zzz
200          xxx        yyy                 zzz
600          xxx        yyy                 zzz

br:
N            J48        NeuralNet           kNN
614          xxx        yyy                 zzz
```
  Where xxx gives the error rate of J48, yyy gives the error rate of NeuralNetwork and zzz gives the error rate of IBk. We will measure error rates on separate files of test points.
2. Graphs of the results for hw2-1 and hw2-2 plotting the performance of the three algorithms as a function of the size of the training data set (known as a "learning curve").
3. Plot of the data points for hw2-1-200 and hw2-2-200 with lines showing the decision boundary learned by J48. This will require that you read the decision tree and understand the decision boundary. J48 displayes the tree in the following format:
```
x1 <= 1.0: positive (75.0/17.0)
x1 > 1.0
|   x2 <= 5.0: negative (42.0/12.0)
|   x2 > 5.0: positive (33.0/10.0)
```
  The first line indicates a split on feature x1 with threshold 1.0. The first branch leads to a leaf labeled "positive". The numbers in parentheses indicate that this leaf contains 75 data points of which 17 were misclassified. Indentation indicates child nodes. The vertical bars are intended to make it easier to see the indentations.
  Note: You should only plot line segments that separate the two classes (not all separating lines chosen by J48). You should also plot the optimal decision boundaries as determined on HW2.
4. Plot of the data points for hw2-1-200 and hw2-2-200 with a curve showing the decision boundary computed by the neural network code. To assist you with this, I have provided an additional file grid.arff. This file contains 10201 points on a 0.1 grid for x in [-5,5] and y in [-5,5]. To compute the decision boundary for neural networks, select this as your "Supplied test set" in WEKA. Then after the neural network training is complete, you can right-click on the last entry in the Result list and select "Visualize classifier errors". You can visualize the decision boundary by selection "X: x (Num)" and "Y: y (Num)". All of the points in grid.arff are labeled Positive. Incorrectly classified points are plotted by WEKA as blue squares, correctly classified points are plotted as blue x's. This will allow you to see the boundary. However, to determine the points on the boundary, click the "Save" button and choose a file name in which to save the outputs. If you examine this file, you will see that it contains five comma-separated values per line. The second and third values give the X and Y coordinates of the points. The fourth value is the predicted class and the fifth value is the correct class. You should write a program (or perl script) to find pairs of lines where the predicted class changes from one line to the next and where the X coordinate does not change. These points will give an approximation to the decision boundary. You should also plot the optimal decision boundaries as determined on HW2.
5. Plot of the data points for hw2-1-200 and hw2-2-200 with a curve showing the decision boundary computed by the IBk (first nearest neighbor) rule. As with neural networks, you will need to use grid.arff to determine the decision boundary. Again, you should plot the optimal decision boundaries as determined on HW2.
6. The results of additional experiments with neural networks. Specifically, use the data set hw2-1-50 and repeat the neural network training with 3 different random seeds of your own choosing. You can set the random seed on the parameter panel for the neural network algorithm. Report the error rate on the test set from each of these three random seeds.
  Now on the same data set, change the number of hidden units to 40 by setting "hiddenLayers" to 40. Train this network with four different random seeds and report the error rate from each.
  An interesting thing to do is to visualize the misclassification errors of each network and compare them. You may also want to visualize the decision boundaries of each network using grid.arff inside Weka. (You do not need to turn in any additional graphs of these boundaries.) Another interesting thing to do is to train the network for a longer period of time by increasing the trainingTime. What happens to the weight values of the network as you train longer?
7. The results of additional experiments with IBk. Specifically, use the data set hw2-1-50 and repeat IBk training with the KNN parameter set to 3, 5, and 9. Report the error rate on the test set from each of these settings.
  Again, an interesting thing to do is to visualize the decision boundaries of these different KNN settings. You should see that the boundary becomes smoother as you increase K.