The learning problem that you will try to solve is to train an optical character recognition classifier. The training data is drawn from 20 different fonts and each letter within these 20 fonts was distorted to produce a file of 20,000 unique training examples. To keep the training times to a reasonable (?) level, I have randomly selected 8,000 of these 20,000 examples. Each image was converted into 16 primitive numerical attributes (statistical moments and edge counts) which were then scaled to fit into a range of integer values from 0 through 15 (scaled to the range -1.0 to +1.0 for the neural network training).
/nfs/stak/u1/t/tgd/cs533
on the engineering HP network:
c4.5
A decision tree learning program.
c4.5rules
A program for converting decision trees
into rules.
opti
A conjugate-gradient training program for
neural networks.
optitest
A program for evaluating test data using
the networks learned by opti
.
uname
to the shell. The response is "HP-UX" on HP
machines, and "SunOS" on sun machines.
c4.5
c4.5
requires three files to run. The name
of each
file must begin with a file stem, which I will denote as stem
. The files have different extensions:
stem.data
is the training data file, with one line for each
training example. Each line has the form
f1, f2, f3, ..., f16, class.where
f1, f2, ...
are the features describing the
example, and class
is the correct class.
stem.test
is the test data file. It has the same
format as the training data file.
stem.names
. This file gives the name and legal
values for each feature. It also gives the names of the various
classes.
To run c4.5
, you give the following unix command:
c4.5 -f stem -u > stem.logThis will first read the
stem.names
file and then read
in all of the training examples in stem.data
. It will
then analyze all of these examples and construct (and then prune) a
decision tree. Finally, it will test the resulting tree on the
examples stored in stem.test
. All output will be
written on the file stem.log
. It will also create two
files stem.trees
and stem.unpruned
, which
contain the pruned and unpruned decision trees in binary format.
C4.5 summarizes its results in a table of the following form:
Evaluation on training data (4000 items): Before Pruning After Pruning ---------------- --------------------------- Size Errors Size Errors Estimate 1085 496(12.4%) 873 546(13.7%) (26.9%) << Evaluation on test data (4000 items): Before Pruning After Pruning ---------------- --------------------------- Size Errors Size Errors Estimate 1085 1232(30.8%) 873 1206(30.1%) (26.9%) <<Most of this should be self-explanatory. The "Size" column gives the number of nodes in the decision tree. The "Errors" column gives the number (and percentage) of examples that are misclassified. The "Estimate" column gives the predicted error rate for new examples (this is the so-called "pessimistic" estimate, and it is computed internally by the tree algorithm). In this case, we see that the unpruned decision tree had 1,085 nodes and made 496 errors on the training data and 1,232 errors (or 30.8%) on the test data. Pruning made the tree significantly smaller (only 873 nodes) and, while it hurt performance on the training data, it slightly improved performance on the test data. The pessimistic estimate (26.9%) was actually a bit optimistic, but not too far off the mark (30.1%). You should use the error rate on the test data to plot your learning curves.
C4.5 also prints out a confusion matrix that has one row and column for every class. The number shown in row i, column j is the number of examples that we classified into class i but whose true class was j. The perfect confusion matrix has entries along the diagonal only.
c4.5rules
c4.5rules
can be
run to convert the decision tree into a set of rules. To execute the
program, use the following command line:
c4.5rules -f stem -u >> stem.log
C4.5rules
will read the stem.names
,
stem.data
and stem.unpruned
files and append its
output to the file stem.log
. It will evaluate its rules
on the examples in stem.test
. This program can be quite slow.
C4.5rules
displays all of the rules and then summarizes the rule
performance in the following table:
Evaluation on training data (548 items): Rule Size Error Used Wrong Advantage ---- ---- ----- ---- ----- --------- 23 7 16.7% 4 0 (0.0%) 4 (4|0) S 12 3 27.8% 16 4 (25.0%) 9 (12|3) A 4 3 27.0% 35 9 (25.7%) 20 (26|6) B 18 4 19.6% 395 73 (18.5%) 0 (0|0) X 76 4 15.4% 11 1 (9.1%) 0 (0|0) C 81 5 25.0% 18 4 (22.2%) 0 (0|0) C 13 3 14.3% 5 0 (0.0%) 0 (0|0) D Tested 548, errors 133 (24.3%) Evaluation on test data (548 items): Rule Size Error Used Wrong Advantage ---- ---- ----- ---- ----- --------- 23 7 16.7% 3 3 (100.0%) -2 (0|2) S 12 3 27.8% 10 8 (80.0%) -1 (2|3) A 4 3 27.0% 35 16 (45.7%) 12 (19|7) B 18 4 19.6% 409 110 (26.9%) 0 (0|0) X 76 4 15.4% 7 2 (28.6%) 0 (0|0) C 81 5 25.0% 15 4 (26.7%) 0 (0|0) C 13 3 14.3% 2 0 (0.0%) 0 (0|0) D Tested 548, errors 174 (31.8%)
The columns have the following meaning.
In the same directory given above, I have placed the following data
files for C4.5. The files have stems of the form
fonts-nnnn
, where nnnn is the number of training
examples. The are files for 125, 250, 500, 1000, 2000, and 4000
examples. In each case, there is a file with extension
.data
for the training data, .test
for the
test data, and .names
to define the features and classes.
For all of these files, the .names
and .test
files are identical. In particular, the file
fonts-4000.test
is the final test set.
I have also created two files to allow you to experiment with
validation set training. I took the 4000 examples in
fonts-4000.data
and randomly split them up into 3000
examples for training (in file fonts-3000.data
) and 1000
examples for validation (in file fonts-3000.test
).
In this assignment, you should run C4.5
and
C4.5rules
on each of these files (except
fonts-3000.data
) and plot a learning curve showing the
percentage of correct classifications (on the test set) as a function
of the number of examples in the training set. Plot a curve for the
unpruned trees and the pruned trees.
In a real application, you would need to decide whether to use the
pruned tree, the unpruned tree, or the production rules generated by
C4.5rules
to classify new examples. To make this
decision, train each of these on fonts-3000.data
and test
on fonts-3000.test
. Indicate which classifier should be
used.
Turn in your graphs along with the confusion matrix for the
fonts-4000.test
file for your chosen classification
method.
opt
program opt
program is actually two programs:
opti
: the program for training neural networks and
optitest
: a program for testing the resulting
networks on new data.
To run, opti
needs two files: a configuration file and
a file of training "vectors". In the directory, I have provided
these files:
opt.con
is the configuration file. It has the
format
nlayers nepochs interval enderror seed ninput nhidden noutputThese have the following meaning:
fonts-nnnn.data.vec
, where nnnn is the number of
training examples. Each of these files has the same training examples
as fonts-nnnn.data
fonts-4000.test.vec
, has the final test examples.
fonts-3000.test.vec
, has the 1000 validation
examples.
opti
, you give a command such as the following:
opti -n opt.con -d fonts-4000.data.vec -w wts -v > fonts-4000.opt.logThis will read its training data from
fonts-4000.data.vec
and save the weights into files with the name wts10
,
wts20
, and so on. The program will also report the
summed squared error (on the training set) after each epoch and store
this information on the file fonts-4000.opt.log
.
Specifically, each of the command line arguments has the following meaning:
-n opt.con
gives the name of the configuration file.
-d fonts-4000.data.vec
gives the name of the
training example file.
-w wts
gives the prefix of the files in which the
weights will be stored.
-v
turns on "verbose" mode, so that the squared
error after each epoch is printed on the standard output.
> fonts-4000.opt.log
this sends the standard
output to the file named fonts-4000.opt.log
To determine how well the learned networks are doing, you must run a
second program, optesti
. You start it using the
following command:
optesti -n opt.con -t fonts-4000.test.vec -w wts -o cfn -v > fonts-4000.opt.test.logThis will read the test data from
fonts-4000.test.vec
and
the weights from the files wts10
, wts20
, and
so on (as specified by the interval field of opt.con
).
The program reads weights files until opti
has finished.
This is signalled when opti
writes a file named
last.epoch
.
The optesti program can be run concurrently with opti
It
will sleep until it sees that the next weight file has been written.
Then it will read in that weight file and use it to classify the
examples. It writes a confusion matrix onto a file of the form
cfn10
, cfn20
, and so on.
In summary, the arguments given to optesti
have the
following meanings:
-n opt.con
: The name of the configuration file.
-t fonts-4000.test.vec
: The name of the test data file.
-w wts
: The prefix of the weight files.
-o cfn
: The prefix of the confusion matrix files.
-v
: Verbose flag. This causes optesti to report its
progress on the output, which in this case has been redirected into
the log file fonts-4000.opt.test.log
You should run these two programs first on the subtraining set
fonts-3000.data.vec
and testing on the validation set
fonts-3000.test.vec
. Do you see any signs of
overfitting? If so, compute the stopping point based on the number of
epochs. Then train on the full training set
fonts-4000.data.vec
, and test on
fonts-4000.test.vec
. How well did this procedure choose the
stopping point? Was there actually any overfitting? Please note that
the runs of opti will require a few hours of CPU time and also consume
a large amount of disk space. Plan ahead and clean up unwanted files
after each run.
You should turn in two graphs, one showing the
performance of the subtraining set on
fonts-3000.test.vec
, and the other showing the
performance on the full test set, fonts-4000.test.vec
.
Finally, you should repeat this process using a different random
seed for opti
. How does this change your results?
test.gp
,
then you can create the plot as follows:
% gnuplot set data style lines plot 'test.gp'
fonts-3000.test.vec
as the validation set for each of these runs (since it is guaranteed
not to contain any of the examples from the training sets of size 125,
250, 500, and 1000). Once you have chosen the stopping point, there
is no need to reconstitute the training set. You should just evaluate
performance on both the validation test set and the full test set.
(Warning: Be sure that your -o argument gives a different prefix for
the confusion matrix files for the validation tests and the full
tests. Otherwise, the two runs of optesti will interfere with each
other.)