br data files: br-test.arff br test data file br-train.arff br training data file hw2-1 data files hw2-1-10.arff 10 training examples hw2-1-20.arff 20 training examples hw2-1-50.arff 50 training examples hw2-1-100.arff 100 training examples hw2-1-200.arff 200 training examples hw2-1-400.arff 400 training examples hw2-1-test.arff test data file hw2-2 data files hw2-2-25.arff 25 training examples hw2-2-50.arff 50 training examples hw2-2-100.arff 100 training examples hw2-2-200.arff 200 training examples hw2-2-600.arff 600 training examples hw2-2-test.arff test data file
You will run the three learning algorithms on each training data file and evaluate the results on the corresponding test data files.
hw2-1: N J48 NeuralNet kNN 10 xxx yyy zzz 20 xxx yyy zzz 50 xxx yyy zzz 100 xxx yyy zzz 200 xxx yyy zzz 400 xxx yyy zzz hw2-2: N J48 NeuralNet kNN 25 xxx yyy zzz 50 xxx yyy zzz 100 xxx yyy zzz 200 xxx yyy zzz 600 xxx yyy zzz br: N J48 NeuralNet kNN 614 xxx yyy zzzWhere
xxxgives the error rate of J48,
yyygives the error rate of NeuralNetwork and
zzzgives the error rate of IBk. We will measure error rates on separate files of test points.
hw2-2plotting the performance of the three algorithms as a function of the size of the training data set (known as a "learning curve").
hw2-2-200with lines showing the decision boundary learned by J48. This will require that you read the decision tree and understand the decision boundary. J48 displayes the tree in the following format:
x1 <= 1.0: positive (75.0/17.0) x1 > 1.0 | x2 <= 5.0: negative (42.0/12.0) | x2 > 5.0: positive (33.0/10.0)The first line indicates a split on feature x1 with threshold 1.0. The first branch leads to a leaf labeled "positive". The numbers in parentheses indicate that this leaf contains 75 data points of which 17 were misclassified. Indentation indicates child nodes. The vertical bars are intended to make it easier to see the indentations.
Note: You should only plot line segments that separate the two classes (not all separating lines chosen by J48). You should also plot the optimal decision boundaries as determined on HW2.
hw2-2-200with a curve showing the decision boundary computed by the neural network code. To assist you with this, I have provided an additional file
grid.arff. This file contains 10201 points on a 0.1 grid for x in [-5,5] and y in [-5,5]. To compute the decision boundary for neural networks, select this as your "Supplied test set" in WEKA. Then after the neural network training is complete, you can right-click on the last entry in the Result list and select "Visualize classifier errors". You can visualize the decision boundary by selection "X: x (Num)" and "Y: y (Num)". All of the points in
grid.arffare labeled Positive. Incorrectly classified points are plotted by WEKA as blue squares, correctly classified points are plotted as blue x's. This will allow you to see the boundary. However, to determine the points on the boundary, click the "Save" button and choose a file name in which to save the outputs. If you examine this file, you will see that it contains five comma-separated values per line. The second and third values give the X and Y coordinates of the points. The fourth value is the predicted class and the fifth value is the correct class. You should write a program (or perl script) to find pairs of lines where the predicted class changes from one line to the next and where the X coordinate does not change. These points will give an approximation to the decision boundary. You should also plot the optimal decision boundaries as determined on HW2.
hw2-2-200with a curve showing the decision boundary computed by the IBk (first nearest neighbor) rule. As with neural networks, you will need to use
grid.arffto determine the decision boundary. Again, you should plot the optimal decision boundaries as determined on HW2.
hw2-1-50and repeat the neural network training with 3 different random seeds of your own choosing. You can set the random seed on the parameter panel for the neural network algorithm. Report the error rate on the test set from each of these three random seeds.
Now on the same data set, change the number of hidden units to 40 by setting "hiddenLayers" to 40. Train this network with four different random seeds and report the error rate from each.
An interesting thing to do is to visualize the misclassification errors of each network
and compare them. You may also want to visualize the decision boundaries of each network
grid.arff inside Weka. (You do not need to turn in any additional
graphs of these boundaries.) Another interesting thing to do is to
train the network for a longer period of time by increasing the
trainingTime. What happens to the weight values of the network as you
hw2-1-50and repeat IBk training with the KNN parameter set to 3, 5, and 9. Report the error rate on the test set from each of these settings.
Again, an interesting thing to do is to visualize the decision boundaries of these different KNN settings. You should see that the boundary becomes smoother as you increase K.