CS533: Program 4

(Due November 20, 2000)

Purpose

In this assignment, you will gain some experience with two different machine learning methods: neural networks and decision trees. You will compare the C4.5 decision tree learning algorithm, the C4.5rules rule-generation algorithm, and the OPT conjugate-gradient neural network program. Specifically, you will do three things:

Train a neural network using cross-validated early stopping to determine its performance and to determine the effectiveness of early stopping. You will also try different random number seeds to see how different their behavior is.
Train decision trees for a range of different training set sizes to construct a learning curve that shows how test set performance changes as the training set grows larger.
Convert those decision trees into rules (via the c4.5rules program) and evaluate their performance.

The learning problem that you will try to solve is to train an optical character recognition classifier. The training data is drawn from 20 different fonts and each letter within these 20 fonts was distorted to produce a file of 20,000 unique training examples. To keep the training times to a reasonable (?) level, I have randomly selected 8,000 of these 20,000 examples. Each image was converted into 16 primitive numerical attributes (statistical moments and edge counts) which were then scaled to fit into a range of integer values from 0 through 15 (scaled to the range -1.0 to +1.0 for the neural network training).

Location of Files

I have installed the following programs in the directory


/nfs/stak/u1/t/tgd/cs533

on the engineering HP network:

c4.5 A decision tree learning program.
c4.5rules A program for converting decision trees into rules.
opti A conjugate-gradient training program for neural networks.
optitest A program for evaluating test data using the networks learned by opti .

To run these programs, you must be running on one of the HP workstations. You can tell you are on an HP machine by typing uname to the shell. The response is "HP-UX" on HP machines, and "SunOS" on sun machines.

Running `c4.5`

The program c4.5 requires three files to run. The name of each file must begin with a file stem, which I will denote as stem. The files have different extensions:

stem.data is the training data file, with one line for each training example. Each line has the form
```
 f1, f2, f3, ..., f16, class. 
```
where f1, f2, ... are the features describing the example, and class is the correct class.
stem.test is the test data file. It has the same format as the training data file.
stem.names. This file gives the name and legal values for each feature. It also gives the names of the various classes.

To run c4.5, you give the following unix command:

 c4.5 -f stem -u > stem.log

This will first read the stem.names file and then read in all of the training examples in stem.data. It will then analyze all of these examples and construct (and then prune) a decision tree. Finally, it will test the resulting tree on the examples stored in stem.test. All output will be written on the file stem.log. It will also create two files stem.trees and stem.unpruned, which contain the pruned and unpruned decision trees in binary format.

C4.5 summarizes its results in a table of the following form:

Evaluation on training data (4000 items):

	 Before Pruning           After Pruning
	----------------   ---------------------------
	Size      Errors   Size      Errors   Estimate

	1085  496(12.4%)    873  546(13.7%)    (26.9%)   <<

Evaluation on test data (4000 items):

	 Before Pruning           After Pruning
	----------------   ---------------------------
	Size      Errors   Size      Errors   Estimate

	1085  1232(30.8%)    873  1206(30.1%)    (26.9%)   <<

Most of this should be self-explanatory. The "Size" column gives the number of nodes in the decision tree. The "Errors" column gives the number (and percentage) of examples that are misclassified. The "Estimate" column gives the predicted error rate for new examples (this is the so-called "pessimistic" estimate, and it is computed internally by the tree algorithm). In this case, we see that the unpruned decision tree had 1,085 nodes and made 496 errors on the training data and 1,232 errors (or 30.8%) on the test data. Pruning made the tree significantly smaller (only 873 nodes) and, while it hurt performance on the training data, it slightly improved performance on the test data. The pessimistic estimate (26.9%) was actually a bit optimistic, but not too far off the mark (30.1%). You should use the error rate on the test data to plot your learning curves.

C4.5 also prints out a confusion matrix that has one row and column for every class. The number shown in row i, column j is the number of examples that we classified into class i but whose true class was j. The perfect confusion matrix has entries along the diagonal only.

Running `c4.5rules`

After C4.5 has been run, the program c4.5rules can be run to convert the decision tree into a set of rules. To execute the program, use the following command line:

 c4.5rules -f stem -u >> stem.log

C4.5rules will read the stem.names,


stem.data

and stem.unpruned files and append its output to the file stem.log. It will evaluate its rules on the examples in stem.test. This program can be quite slow.

C4.5rules displays all of the rules and then summarizes the rule performance in the following table:

Evaluation on training data (548 items):

Rule  Size  Error  Used  Wrong          Advantage
----  ----  -----  ----  -----          ---------
  23     7  16.7%     4      0  (0.0%)	     4  (4|0)   S
  12     3  27.8%    16      4  (25.0%)	     9  (12|3)  A 
   4     3  27.0%    35      9  (25.7%)	    20  (26|6)  B
  18     4  19.6%   395     73  (18.5%)	     0  (0|0)   X
  76     4  15.4%    11      1  (9.1%)	     0  (0|0)   C
  81     5  25.0%    18      4  (22.2%)	     0  (0|0)   C
  13     3  14.3%     5      0  (0.0%)	     0  (0|0)   D

Tested 548, errors 133 (24.3%)

Evaluation on test data (548 items):

Rule  Size  Error  Used  Wrong          Advantage
----  ----  -----  ----  -----          ---------
  23     7  16.7%     3      3  (100.0%)    -2  (0|2)   S
  12     3  27.8%    10      8  (80.0%)	    -1  (2|3)   A 
   4     3  27.0%    35     16  (45.7%)	    12  (19|7)  B
  18     4  19.6%   409    110  (26.9%)	     0  (0|0)   X
  76     4  15.4%     7      2  (28.6%)	     0  (0|0)   C
  81     5  25.0%    15      4  (26.7%)	     0  (0|0)   C
  13     3  14.3%     2      0  (0.0%)	     0  (0|0)   D

Tested 548, errors 174 (31.8%)

The columns have the following meaning.

"Rule": The number of the rule. There is one row for each rule.
"Size": The number of tests in the rule.
"Error": The estimated error rate for this rule.
"Used": The number of times this rule was used to classify examples in the data set.
"Wrong": The number of times the rule made an error (also expressed as a percentage).
"Advantage": A heuristic quantity used by the rule algorithms to select and prune rules.
The class given in the conclusion part of the rule.

Here we see that the rules achieved an error rate of 31.8% on the test data. The rules are grouped according to their output classes. Furthermore, the classes are ordered. The rules are applied in the order given. If none of the rules applies to an example, then the example is assigned to the "default" class.

In the same directory given above, I have placed the following data files for C4.5. The files have stems of the form fonts-nnnn, where nnnn is the number of training examples. The are files for 125, 250, 500, 1000, 2000, and 4000 examples. In each case, there is a file with extension .data for the training data, .test for the test data, and .names to define the features and classes. For all of these files, the .names and .test files are identical. In particular, the file fonts-4000.test is the final test set.

I have also created two files to allow you to experiment with validation set training. I took the 4000 examples in fonts-4000.data and randomly split them up into 3000 examples for training (in file fonts-3000.data) and 1000 examples for validation (in file fonts-3000.test).

In this assignment, you should run C4.5 and C4.5rules on each of these files (except fonts-3000.data) and plot a learning curve showing the percentage of correct classifications (on the test set) as a function of the number of examples in the training set. Plot a curve for the unpruned trees and the pruned trees.

In a real application, you would need to decide whether to use the pruned tree, the unpruned tree, or the production rules generated by C4.5rules to classify new examples. To make this decision, train each of these on fonts-3000.data and test on fonts-3000.test. Indicate which classifier should be used.

Turn in your graphs along with the confusion matrix for the fonts-4000.test file for your chosen classification method.

Running the `opt` program

The opt program is actually two programs:

opti: the program for training neural networks and
optitest: a program for testing the resulting networks on new data.

To run, opti needs two files: a configuration file and a file of training "vectors". In the directory, I have provided these files:

opt.con is the configuration file. It has the format
```
 nlayers nepochs interval enderror seed ninput nhidden noutput
```
These have the following meaning:
- nlayers: is the number of layers in the network. This should be 3.
- nepochs: is the number of epochs to train. This should be 10000 to ensure that it will not halt training too soon.
- interval: is the number of epochs between each saving of the weights. This is set to 10, so the weights will be saved after epoch 10, 20, 30, and so on.
- enderror: this should be 0.0.
- seed: this is used to initialize the pseudo-random number generator. If you change this, you will change the initial weights in the network, which can lead to major changes in the output of the network.
- ninput: the number of input units (= number of features).
- nhidden: the number of hidden units
- noutput: the number of output units (= number of classes).
fonts-nnnn.data.vec, where nnnn is the number of training examples. Each of these files has the same training examples as fonts-nnnn.data
fonts-4000.test.vec, has the final test examples.
fonts-3000.test.vec, has the 1000 validation examples.

To run opti, you give a command such as the following:

 opti -n opt.con -d fonts-4000.data.vec -w wts -v > fonts-4000.opt.log

This will read its training data from fonts-4000.data.vec and save the weights into files with the name wts10, wts20, and so on. The program will also report the summed squared error (on the training set) after each epoch and store this information on the file fonts-4000.opt.log. Specifically, each of the command line arguments has the following meaning:

-n opt.con gives the name of the configuration file.
-d fonts-4000.data.vec gives the name of the training example file.
-w wts gives the prefix of the files in which the weights will be stored.
-v turns on "verbose" mode, so that the squared error after each epoch is printed on the standard output.
> fonts-4000.opt.log this sends the standard output to the file named fonts-4000.opt.log

To determine how well the learned networks are doing, you must run a second program, optesti. You start it using the following command:

 optesti -n opt.con -t fonts-4000.test.vec -w wts -o cfn -v > fonts-4000.opt.test.log

This will read the test data from fonts-4000.test.vec and the weights from the files wts10, wts20, and so on (as specified by the interval field of opt.con). The program reads weights files until opti has finished. This is signalled when opti writes a file named last.epoch.

The optesti program can be run concurrently with opti It will sleep until it sees that the next weight file has been written. Then it will read in that weight file and use it to classify the examples. It writes a confusion matrix onto a file of the form cfn10, cfn20, and so on.

In summary, the arguments given to optesti have the following meanings:

-n opt.con: The name of the configuration file.
-t fonts-4000.test.vec: The name of the test data file.
-w wts: The prefix of the weight files.
-o cfn: The prefix of the confusion matrix files.
-v: Verbose flag. This causes optesti to report its progress on the output, which in this case has been redirected into the log file fonts-4000.opt.test.log

You should run these two programs first on the subtraining set fonts-3000.data.vec and testing on the validation set fonts-3000.test.vec. Do you see any signs of overfitting? If so, compute the stopping point based on the number of epochs. Then train on the full training set fonts-4000.data.vec, and test on fonts-4000.test.vec. How well did this procedure choose the stopping point? Was there actually any overfitting? Please note that the runs of opti will require a few hours of CPU time and also consume a large amount of disk space. Plan ahead and clean up unwanted files after each run.

You should turn in two graphs, one showing the performance of the subtraining set on fonts-3000.test.vec, and the other showing the performance on the full test set, fonts-4000.test.vec.

Finally, you should repeat this process using a different random seed for opti. How does this change your results?

How to make graphs

To make a graph, you must create a gnuplot input file where each line has one x value and one y value. You can do this by editing the output from optesti. Suppose we call this file test.gp, then you can create the plot as follows:

% gnuplot
set data style lines
plot 'test.gp'

What to Turn In

Please turn in the following:

Your learning curves for c4.5 unpruned, c4.5 pruned, and c4.5rules.
The confusion matrix (on the final test set) for the decision-tree classifier chosen via the validation set. Show the results of the validation set runs and indicate which classifier you picked.
A plot of the accuracy of opti/optesti on the 3000-example set showing both training set and validation set classification error rate. Clearly indicate what stopping epoch you chose based on this graph. Indicate whether there was any overfitting.
A plot of the accuracy for opti/optesti on the final training set. Show both training set and test set classification error rate. Indicate the analogous stopping point. Was there overfitting on the full training set?
A replication of the previous two items with a different random number seed.
Optional: Repeat the validation-set training of opt on each of the smaller data sets. You can use fonts-3000.test.vec as the validation set for each of these runs (since it is guaranteed not to contain any of the examples from the training sets of size 125, 250, 500, and 1000). Once you have chosen the stopping point, there is no need to reconstitute the training set. You should just evaluate performance on both the validation test set and the full test set. (Warning: Be sure that your -o argument gives a different prefix for the confusion matrix files for the validation tests and the full tests. Otherwise, the two runs of optesti will interfere with each other.)

Tom Dietterich, tgd@cs.orst.edu