hw2-1
, hw2-2
, and br
. http://classes.engr.oregonstate.edu/eecs/spring2005/cs534/data/.
br data files: br-test.arff br test data file br-train.arff br training data file hw2-1 data files hw2-1-10.arff 10 training examples hw2-1-20.arff 20 training examples hw2-1-50.arff 50 training examples hw2-1-100.arff 100 training examples hw2-1-200.arff 200 training examples hw2-1-400.arff 400 training examples hw2-1-test.arff test data file hw2-2 data files hw2-2-25.arff 25 training examples hw2-2-50.arff 50 training examples hw2-2-100.arff 100 training examples hw2-2-200.arff 200 training examples hw2-2-600.arff 600 training examples hw2-2-test.arff test data file
You will run the three learning algorithms on each training data file and evaluate the results on the corresponding test data files.
hw2-1: N J48 NeuralNet kNN 10 xxx yyy zzz 20 xxx yyy zzz 50 xxx yyy zzz 100 xxx yyy zzz 200 xxx yyy zzz 400 xxx yyy zzz hw2-2: N J48 NeuralNet kNN 25 xxx yyy zzz 50 xxx yyy zzz 100 xxx yyy zzz 200 xxx yyy zzz 600 xxx yyy zzz br: N J48 NeuralNet kNN 614 xxx yyy zzzWhere
xxx
gives the error rate of J48, yyy
gives the error rate of NeuralNetwork and zzz
gives the
error rate of IBk. We will measure error rates on separate files of
test points.
hw2-1
and
hw2-2
plotting the performance of the three algorithms as
a function of the size of the training data set (known as a "learning
curve").
hw2-1-200
and
hw2-2-200
with lines showing the decision boundary
learned by J48. This will require that you read the decision tree and
understand the decision boundary. J48 displayes the tree in the
following format:
x1 <= 1.0: positive (75.0/17.0) x1 > 1.0 | x2 <= 5.0: negative (42.0/12.0) | x2 > 5.0: positive (33.0/10.0)The first line indicates a split on feature x1 with threshold 1.0. The first branch leads to a leaf labeled "positive". The numbers in parentheses indicate that this leaf contains 75 data points of which 17 were misclassified. Indentation indicates child nodes. The vertical bars are intended to make it easier to see the indentations.
Note: You should only plot line segments that separate the two classes (not all separating lines chosen by J48). You should also plot the optimal decision boundaries as determined on HW2.
hw2-1-200
and
hw2-2-200
with a curve showing the decision boundary
computed by the neural network code. To assist you with this, I have
provided an additional file grid.arff
. This file
contains 10201 points on a 0.1 grid for x in [-5,5] and y in [-5,5].
To compute the decision boundary for neural networks, select this as
your "Supplied test set" in WEKA. Then after the neural network
training is complete, you can right-click on the last entry in the
Result list and select "Visualize classifier errors". You can
visualize the decision boundary by selection "X: x (Num)" and "Y: y
(Num)". All of the points in grid.arff
are labeled
Positive. Incorrectly classified points are plotted by WEKA as blue
squares, correctly classified points are plotted as blue x's. This
will allow you to see the boundary. However, to determine the points
on the boundary, click the "Save" button and choose a file name in
which to save the outputs. If you examine this file, you will see
that it contains five comma-separated values per line. The second and
third values give the X and Y coordinates of the points. The fourth
value is the predicted class and the fifth value is the correct class.
You should write a program (or perl script) to find pairs of lines
where the predicted class changes from one line to the next and where
the X coordinate does not change. These points will give an
approximation to the decision boundary. You should also plot the
optimal decision boundaries as determined on HW2.
hw2-1-200
and
hw2-2-200
with a curve showing the decision boundary
computed by the IBk (first nearest neighbor) rule. As with neural
networks, you will need to use grid.arff
to determine the
decision boundary. Again, you should plot the optimal decision
boundaries as determined on HW2.
hw2-1-50
and repeat the
neural network training with 3 different random seeds of your own
choosing. You can set the random seed on the parameter panel for the
neural network algorithm. Report the error rate on the test set from
each of these three random seeds.
Now on the same data set, change the number of hidden units to 40 by setting "hiddenLayers" to 40. Train this network with four different random seeds and report the error rate from each.
An interesting thing to do is to visualize the misclassification errors of each network
and compare them. You may also want to visualize the decision boundaries of each network
using grid.arff
inside Weka. (You do not need to turn in any additional
graphs of these boundaries.) Another interesting thing to do is to
train the network for a longer period of time by increasing the
trainingTime. What happens to the weight values of the network as you
train longer?
hw2-1-50
and repeat IBk training with
the KNN parameter set to 3, 5, and 9. Report the error rate on the
test set from each of these settings.
Again, an interesting thing to do is to visualize the decision boundaries of these different KNN settings. You should see that the boundary becomes smoother as you increase K.