# CS534 Homework 7 Due Monday May 16

In this homework, you will experiment with overfitting avoidance methods on the BR data set.

• Learning Algorithms. We will experiment with J48, Logistic Regression, and SMO (support vector machines). In each case, we will use an internal holdout set to decide on the best setting of the overfitting parameters. You can do this by selecting the Percentage Split option in the Classify panel of the Weka Explorer to use 66% of the data for training and 34% as the internal validation set. Then you must perform a series of runs in which you vary the overfitting parameters to find the values of those parameters that minimize the error on the internal validation set. Once you have chosen the values of the overfitting parameters. Choose the "Supplied test set" option and use the `br-test.arff` data. Train your algorithm on the entire training set using your chosen parameter values, and evaluate on the BR test set.

• Overfitting Parameters for Each Algorithm
• For J48, we control the amount of overfitting with the "confidenceFactor" parameter. Values in the range from 0.01 to 0.50 are sensible.

• For Logistic Regression, overfitting is controlled by the "ridge" parameter which controls the size of a square penalty on the weights. Values in the range from 0.01 to 100.0 are sensible.

• For SMO, there are three parameters that we must consider:
• the value of C which controls the tradeoff between fitting the training data (large values) and maximizing the separating margin (small values). Values of C in the range from 0.01 to 100 usually are worth checking. Values outside this range usually don't work well.
• the choice of the kernel: polynomial (the default) or RBF (gaussian, chosen by setting "useRBF" to true).
• the kernel parameters. For the polynomial kernel, the only parameter is "exponent", which controls the degree of the polynomial. Set to 1 for the linear kernel (i.e., no kernel at all, just a dot product). Set to 2 for the quadratic kernel and 3 for the cubic kernel. With polynomial kernels you can include low order terms by setting "lowOrderTerms" to true. By default, the kernel is computed as (x * y)^exponent. If you include low order terms, you get the kernels we discussed in class, which are computed as (x * y + 1)^exponent.

For the RBF kernel, the parameter is "gamma", which controls the width of the RBF kernel. Values in the range from 1 to 10 usually work well, but sometimes values as small at 0.1 or as large as 50 give good results.

• Results. You should turn in the following:
1. For J48, please construct a table of the form
```confidence level     tree size       validation error
ppp                  sss                eee
```
with one row for each confidence level that you tried. The tree size is the total number of nodes, and it is reported by the algorithm. Finally, report your chosen confidence level, the resulting tree size (when trained on the entire training set), and the test set error.

2. For Logistic, please construct a table of the form
```ridge parameter     sum of abs(coef)    validation error
ppp                  sss                eee
```
with one row for each ridge value that you tried. The second column is the sum of the absolute values of the coefficients (not including the intercept term). You will need to compute this from the output produced by the algorithm. Finally, report your chosen ridge parameter, the resulting sum of abs(coef), and the test set error when training on the entire training set.

3. For SMO, please construct a table of the form
```C          kernel          kernel-params      validation error
ccc        kkk             ppp                eee
```
where `kkk` is "polynomial" or "rbf" and `ppp` is the parameter value of the kernel (exponent for polynomial and gamma for rbf). Include one line for each combination of C, kernel, and kernel parameters that you tried. Finally, of course, report your chosen parameters and the test set error when training on the entire training set.

• Note that SMO will run out of memory with the default java parameters. I use the command ```java -Xmx200m -jar weka.jar``` to request 200 megabytes of memory for the java vm.