CS434 Assignment 3 Due Wed Nov 12th in class
Part I:
Experiments with Support Vector Machines
- Learning Algorithm. We will experiment with SMO (i.e.,
support vector machines). We will
use an internal validation set to decide on the best settings of the
SVM parameters and choices. You can do this by selecting the Percentage
Split option in the Classify panel of the Weka Explorer to use 66% of
the data for training and 34% as the internal validation set. Then
you must perform a series of runs in which you vary the different
parameters and choices to find a setting that minimizes the
error on the internal validation set. Once you have chosen the best
paramter settings based on the internal validation set. Choose the
"Supplied test set" option
and use the
br-test.arff
data. Train your algorithm on
the entire training set using your chosen parameter values, and
evaluate on the BR test set
For SMO, there are
three parameters that we must consider:
- the value of C which controls the tradeoff between fitting the
training data (large values) and maximizing the separating margin
(small values). Values of C outside the range from 0.01 to 100
usually don't work well.
- the choice of the kernel: polynomial (the default) or RBF
(gaussian, chosen by setting "useRBF" to true).
- the kernel parameters. For the polynomial kernel, the only
parameter is "exponent", which controls the degree of the polynomial.
Set to 1 for the linear kernel (i.e., no kernel at all, just a dot
product). Set to 2 for the quadratic kernel and 3 for the cubic
kernel. With polynomial kernels you can include low order terms by
setting "lowOrderTerms" to true. By default, the kernel is computed
as (x * y)^exponent. If you include low order terms, you get the
kernels we discussed in class, which are computed as (x * y +
1)^exponent.
For the RBF kernel, the parameter is "gamma", which controls
the width
of the RBF kernel. Values in the range from 1 to 10 usually work
well, but sometimes values as small at 0.1 or as large as 50 give good
results.
Note that SMO will run out of memory
with the default java
parameters. I use the command java -Xmx200m -jar
weka.jar
to request 200 megabytes of memory for the java vm
- You
need to design a set of experiments using internal validation set to
choose the most approproriate parameter settings for the BR data set.
- In your report, you need to
describe:
- The experimental procedure that you used for model selection.
- The validation error of the different models
that you have investigated in the following table format:
C kernel
kernel-params Validation error
ccc kkk
ppp
eee
where kkk
is "polynomial"
or "rbf" and ppp
is the parameter value of the kernel (exponent for polynomial and
gamma for rbf). Include one line for each combination of C, kernel,
and kernel parameters that you tried. Finally, of course, report your
chosen parameters and the test set error when training on the entire
training set.Part II: Bagging and Boosting.
- Your
chosen parameters and the test set error when training on the entire
training set.
- A discussion about the sensitivity of SVM performance to the
choice of these parameters.
Part
II. Experiments with Bagging and Boosting
- Learning Algorithms. Bagging and AdaboostM1 are available
under the "Meta"
category in Weka. Please use the following settings:
- Bagging: set numIterations to 30. You will run experiments
with the classifier set to Trees.J48, Functions.logistic, and
Bayes.naiveBayesSimple.
- AdaboostM1: set maxIterations to 30. Set weightThreshold to
1000. You will run experiments with the classifier set to the same
three algorithms as for Bagging.
For J48, set the "unpruned" option to True. You can use the
default
settings for all
other parameters of J48, NaiveBayesSimple, and Logistic Regression.
Optional: Rerun the
experiments with pruning turned on and see if it makes any difference.
In addition to running Bagging and AdaBoostM1, you should rerun
a single decision tree, a
single Naive Bayes, and a single logistic regression.
- Data Sets. We will apply these three algorithms to the
same data sets that
we have been using before:
hw2-1
, hw2-2
,
and br
.
However, we will not construct learning curves this time. Instead, you
should just train
on the following three files:
Domain Training Data File Test Data File
BR br-train.arff br-test.arff
hw2-1 hw2-1-200.arff hw2-1-test.arff
hw2-2 hw2-2-200.arff hw2-2-test.arff
- In your report, you should report the following:
- The results in the following
format:
- A discussion about the following
questions:
- Which algorithms+data sets are improved by
Bagging? Can you explain these results in terms of the bias and
variance of the
learning algorithms applied to these domains?
- Which algorithms+data sets are improved by Boosting?
Can
you provide possible explanations for why boosting can sometimes lead
to worsened performance?