Running the Diabetes Experiment

In the Pima Indians Diabetes experiment, the goal is to compare three approaches to fitting a model:

We would like to evaluate these models on small and large data sets to see if they give different results.

Obtaining and Preparing the Data

Download the diabetes.arff data file and save it in the weka-3-4/data folder.

These models require that the data be discretized. For our experiment, we will discretize each input variable into 3 ranges ("low", "medium", "high") by using an automated algorithm. This algorithm does not analyze the class variable (i.e., whether the person has or does not have diabetes). Here is the procedure:

Fitting Bayesian Networks to the Data

In my experiments, I found that the HillClimber worked best for both large and small samples, that Naive Bayes was second best, and that the knowledge-based network structure was the worst. You may want to repeat the experiments with different percentage splits to see whether this result holds up across the whole range. I suggest that you record your data in a spreadsheet and plot learning curves for each of the three methods.