Running the Diabetes Experiment

In the Pima Indians Diabetes experiment, the goal is to compare three approaches to fitting a model:

The Naive Bayes model
A model found by a "hill climbing" search of the space of Bayesian networks
A knowledge-based model

We would like to evaluate these models on small and large data sets to see if they give different results.

Obtaining and Preparing the Data

Download the diabetes.arff data file and save it in the weka-3-4/data folder.

These models require that the data be discretized. For our experiment, we will discretize each input variable into 3 ranges ("low", "medium", "high") by using an automated algorithm. This algorithm does not analyze the class variable (i.e., whether the person has or does not have diabetes). Here is the procedure:

Start WEKA. You will see the weka startup window:

Click on the Explorer button. This will bring up the main screen:
Load the diabetes data by clicking on "Open file...", navigating to the data folder, and selecting diabetes.arff. The main screen should now look like this:
Click on "Choose" in the "Filter" section. This will pop up the following menu:

Click on the "+" beside "filters". Then click on the "+" beside "unsupervised". Then click on the "+" beside "attributes", and finally, click on "Discretize". The main screen should now show Discretize -B 10 -M -1.0 -R first-last in the area next to the "Choose" button.
Click on the "Discretize" text in this box. The following window should pop up:

Set bins to be "3" and set "useEqualFrequency" to be "true". Then click "OK".
Click the "Apply" button on the right-hand side of the main window. Your window should now look like this:

The attributes have now been discretized.

Fitting Bayesian Networks to the Data

Now click on the "Classify" tab at the top of the window. This will change to the classification page. It should look like this:
Now click on the "Choose" button in the "Classifier" section. In the menu that pops up, click on the "+" next to "Bayes" and then click on "BayesNet". Now the area to the right of the "Choose" button should contain a long string of text that begins BayesNet -S -D -B -Q.
Click on this area to bring up the properties menu for the BayesNet classifier:

What we do at this point depends on which of the three model-fitting methods we are going to use. Let's start with the Naive Bayes approach. In this approach, the Bayesian network consists of one edge from the class node to each of the other variables.
To choose this structure, click on the "Choose" button next to "searchAlgorithm". Click on the "+" next to "Fixed" and choose "Naive Bayes". Then Click "OK" on the properties menu.
Now turn your attention to the "Test options" section of the window. This gives us various options for how to evaluate the quality of the model after it is fit to the training data. We will use the "Percentage split" option, so click on the button next to it. The "%" box should now be "un-grayed-out". By default, it says we will use 66% of the data for training and therefore 34% for evaluation. To test how well Naive Bayes will work on small samples, we will test two values for this number: 10% and 80%. So for now, erase the "66" and type "10" into this box.
We are ready to run the algorithm. Click the "Start" button. Two things will happen. First, a line will appear in the "Result list (right-click for options)" area. Second, there will be a lot of text in the "Classifier output" window. Here is what my window looked like at this point:

Note that there were 213 incorrect-classified instances, which gives an error rate of 30.7803%. At the bottom of the window is the "Confusion Matrix":
```
   a   b   <-- classified as
 314 141 |   a = tested_negative
  72 165 |   b = tested_positive
```
The rows in this matrix correspond to the correct classes (a = does not have diabetes; b = has diabetes). Hence, there are a total of 314 + 141 = 455 patients without diabetes in the test data, and 72 + 165 = 237 patients with diabetes. The columns correspond to the predicted classes. Hence, 314 of the 455 negative patients were correctly classified as negative and 141 of them were incorrectly classified as positives (called "false positives"). This gives a false positive rate of 0.31. Conversely, 72 of the 237 positive patients were falsely classified as negatives (called "false negatives") and 165 were correctly classified as positives.
We can visualize the Bayesian network by right-clicking on the highlighted line in the "Result list" window and then selected "Visualize Graph". The resulting graph looks like this:
Now close this window and go back to the "Percentage split" box and set it to 80. Click "Start" again. You should see that the error rate is now substantially lower. Also note that we now have only 105 negative cases and 49 positive cases in our test data set. Record the error rate for later plotting.
Now let's evaluate how well a "knowledge-based" Bayesian network will work. To do this, download the file diabetes.xml and save it in a convenient folder. This file describes the structure of the network.
Now click on the BayesNet -S -D -B stuff to the right of the "Choose" button in the BayesNet classifier to bring up the properties of the BayesNet classifier. Again click on the "Choose" button for the "searchAlgorithm". This time, select "FromFile". This will fill in the text "FromFile -B" in the box next to the "Choose" button. Click on this text, and you should see the following properties box:

In the "BIFFile" box, type the path name for the "diabetes.xml" file. For example, mine was "C:/diabetes.xml". Now click "OK" to close this properties menu and "OK" to close the BayesNet properties menu. Set the Percentage Split back to 10, and click the "Start" button. Record the results. Then set the percentage split to 80, click the "Start" button, and record the results.
Now let's evaluate how well a hill-climbing search works. Again, click on the "BayesNet -S -D -B" stuff next to the "Choose" button in the "Classifier" section. The properties window for BayesNet will pop up. Again, click on the "Choose" button for "searchAlgorithm". This time, click on the "+" next to "local" and select "HillClimber". This will fill in the text "HillClimber -P 1 -S BAYES" for the "searchAlgorithm". Click on this text to bring up the properties box for the HillClimber, which looks like this:

Set "initAsNaiveBayes" (i.e., initialize the network to have a Naive Bayes structure) to "False", set "maxNrOfParents" (maximum number of incoming arrows to a node) to 3, and set "useArcReversal" (consider reversing arcs during the search) to "True". Then click "OK" to close this menu, and "OK" again to close the BayesNet menu.
We are ready to run the HillClimber algorithm. Set the "Percentage Split" to 10, click "Start", and record the results. Then set the "Percentage Split" to 80, click "Start" and record the results.

In my experiments, I found that the HillClimber worked best for both large and small samples, that Naive Bayes was second best, and that the knowledge-based network structure was the worst. You may want to repeat the experiments with different percentage splits to see whether this result holds up across the whole range. I suggest that you record your data in a spreadsheet and plot learning curves for each of the three methods.