CS 552, Spring 1998
General Guidelines Based on Studying Programmer Behavior Experimentally
Adapted from Ruven E. Brooks, 1980, "Studying Programmer Behavior
Experimentally: The Problems of Proper Methodology" Communications
of the ACM, April 1980, pp. 207-213.
(1) Subjects of a programming test should be both representative and uniform.
The subjects should be representative so that any statement made about
their behavior will also be true for some large population. The subjects should be
relatively uniform in regard to their characteristics and abilities.
If one group contains a disproportionate amount of high ability subjects and/or of low
ability subjects, then the results will show this effect and not what we’re trying to
It is tough to fulfill both requirements simultaneously when the parent
population is heterogeneous. This is the case most of the time, so satisfiability of one
or the other is sufficient if the experiment can be molded to take this into account.
(2) Do not use students of programming courses as accurate representatives of advanced
There are large individual differences both within and between programming
classes. There is no reason to believe that there exists a direct relationship between
beginning programming students and advanced programmers (ie. no studies have been performed to show this
In beginning programming courses, there are students majoring in Conputer Science and students
taking the course only to fulfill course requirements. Thus, their background and their interests
will differ significantly.
When no other large body of programmers is available and the experiment
does not seem to test problem solving skills or debugging techniques, then beginning
students will suffice.
(3) Use within-subject testing when testing programming behavior.
Each subject is exposed to all levels of the experimental variables under
investigation. The advantage is that the analysis is based on the relative performance of each
technique within each subject and not on the relative performance of each subject within
each test group.
If there is an experiment to test the efficiency of three different programming
languages, then each subject would write programs in all three languages. So that if there were 4
different programs to write, then each subject would actually write 3 * 4 = 12
If the amount of testing any subject must do seems excessive, then opt for
another approach. (ie. The subject shouldn’t be tested for stamina.)
(4) Exclude irrelevant behavior when measuring time.
We don’t want to measure affects not associated with what we are actually
We do not want to know how long it takes for a subject to fully understand the
task when we are testing programming ability. What we want to know is how long it takes the
programmer to write a complete program that solves the problem that was asked. The problem is that
some subjects may start programming
before they fully understand the task, while others wait until they fully understand it before
There is not one.