This text describes precondition tests and 16 rules of thumb to help you decide whether parallelization is likely to pay off. The example presented here is based on a volume renderer application developed at the Cornell Theory Center as part of the Global Basins Research Network collaboration. A serial version of the volume rendered was written by Daniel Kartsch and Catherine Devine. It was parallelized by Hugh Caffey, first for the IBM ES 3090-600 (a shared-memory multiprocessor) and later for networks of IBM RS-6000 workstations (using the PVM message-passing library).
Consider a time-step simulation of geophysical processes, where each time step generates a large array (approximately 500 Kbytes) of 3D data. To analyze the process being simulated, it's necessary to convert the 3D data array to a 2D image that can be displayed on the computer screen. The final result is a series of those images, one per time step, that can be studied one at a time or displayed as an animated sequence.
Case Study 1
Parallelization is being considered because users want to run the
simulation for thousands of time steps. This isn't practical with the
current version, since it would take too long to get results (almost 150
hours of computer time would be needed for each 1,000 steps). Since the
image rendering takes place in a separate processing phase, an increase
in time steps would also mean that temporary storage of the data arrays
could occupy a gigabyte of disk space.
Step 1: Preconditions
Although the simulation is not executed on a daily basis, it's a stable application and likely to be used hundreds to thousands of times between modifications. It requires hours of computer time even for a relatively short simulation (15 hours for 100 time steps). Because of the performance constraints, scientists have been unable to get the number of steps they really wanted. In terms of the precondition tests, then, this application scores very high (light gray for frequency, white for execution time, white for resolution needs).
Step 2: Problem architecture
The application encompasses two phases, each with somewhat different data access and computational interrelationships. During the simulation itself, each time step evolves from the predecessor step and cannot be treated as independent. The rendering phase, on the other hand, processes each data array in totally independent fashion to generate the images.
One way to view the problem's architecture, then, is to consider the phases independently. The rendering phase is embarrassingly parallel; it's fairly easy to imagine replacing the single copy of the rendering program with 50 concurrent copies, each working on one data array and producing one image. The simulation phase is much more constrained, fitting the loosely synchronous model. At each time step, the grid data representing the geophysical structure must be accessed multiple times, and computation varies according to the structural characteristics at each grid point; moreover, the data at each step depend on the results of the previous step.
However, it's just as easy to think of this problem as a pipeline situation. The simulation delivers a data array to the renderer, then proceeds to calculate the next time step while the first one is being converted to an image. Since the data always flow from simulation to renderer, there really is no need to accumulate all the data arrays from all the time steps before starting to generate images. (Note that viewing the application in a slightly different way can help eliminate the data storage problem associated with thousands of time steps; this underscores the importance of taking some time to think about your application, since it ultimately can have significant impact on performance.)
Once problem architecture is established, we can apply the first four rules of thumb to understand something about how much effort parallelization is likely to require. According to rule 2, balancing the computational intensity between the two phases could be problematical but is likely to be the critical issue.
Step 3: Machine
Mapping the problem to the appropriate machine style is relatively straightforward using the next four rules of thumb. According to rule 6, the application will probably perform best on a shared-memory machine. Since the working storage requirements are significant for both the simulation and rendering phases, an SMP is probably not appropriate; it is unlikely that either phase can fit on a single node. The same rule of thumb indicates that a distributed-memory system might also be acceptable. (Note that we can rule out SIMD. If that were the only machine available, we would likely discontinue the analysis at this point.)
Step 4: Language
As rule of thumb 9 points out, language options are likely to be limited. Since both phases of the application are currently implemented in Fortran, and since we intend to use a shared-memory multiprocessor, parallelization will be accomplished using a Fortran plus compiler directives to control access to shared memory variables (the data arrays produced by the simulation and consumed by the renderer).
Step 5: Performance expectations
The next step is to time the baseline version of the application (rule 10). Timing calls are inserted at the beginning and end of each phase (simulation and rendering). Since input/output activities will require serial execution, we also gather timings on the I/O portions of each phase. The measurements reveal a total time-step calculations, 11 to store the data array, 9 more to reread the array at the start of the rendering phase, 205 for rendering calculations, and 18 for writing out the display image.
To calculate the parallel content for rule 11, we consider the portions that could be parallelized, comparing their duration with that of the overall code. It is important to analyze how behavior might change in the parallel version. In this case, the writing and subsequent reading of the data array will be eliminated once the application is converted to pipeline form. Consequently, we eliminate their timings from the total, yielding a somewhat reduced whole-code time:
For rule 12, we consider the impact of producing a full sequence of 1,800 time steps. Only the simulation's first step simulation requires that data be input to initialize arrays; remaining steps will use data already available in memory or calculated by the preceding step. This is the only major change, since the rendering phase must reinitialize its arrays for each image processed. We adjust the parallel content equation by eliminating the 4 seconds for data input, since they will be negligible for long simulations:
Rules 13-15 remind us of the fragility of those estimates, but do not raise any warning flags. Because our target is a share-memory system, rule 16 can be ignored.
The rules of thumb indicate that our problem lends itself to parallelism, is likely to be relatively straightforward and to yield reasonable performance on a shared-memory system, and has a sufficiently high parallel content to make the effort worthwhile.
The application on which this example is based was, in fact, parallelized for a shared-memory multicomputer. As indicated by the text's discussions of machine architectures, the major programming hurdle in parallelizing this application was the addition of locking mechanisms to protect the shared data arrays. In particular, since the second phase executes faster than the first, the renderer had to be prevented from trying to read an input array before it had been fully generated by the simulator. However, the effort required to parallelize the application was minimal since an efficient, well-debugged baseline serial version was already available.
The resulting performance was 307 seconds per time step (the time required for the slower simulation phase), plus 4 seconds for the initial simulation input and 223 seconds to render the final step after all simulations were complete. For executions involving 1,800 time steps, the total was approximately 156 hours-as compared to the 267 hours that a serial version would have required.
Case Study 2
The success of the first parallelization helped provide impetus
for reexamining the simulation phase, which was proving to be the
performance bottleneck. Improvements in the serial version resulted in a
significantly reduced execution time, to 187 seconds per simulated time
step. This had a moderate effect on overall performance (now 116 hours
for 1,800 time steps), but it also shifted the performance bottleneck to
the rendering phase.
Can this phase be improved by parallelization? Also, since the job load on the shared-memory system has become very heavy, it might be desirable to offload as much work as possible to a cluster of workstations connected by a local network.
Step 1: Preconditions
These tests yield the same results as before (although it is now possible to generate long simulations, they still require days or weeks of computing time).
Step 2: Problem architecture
This time, we consider the structure of just the rendering phase. The image is constructed using a technique known as ray casting with trilinear interpolation. Imaginary rays are fired from a hypothetical viewpoint through the data array. Along each ray, a search is performed to find values within the array that correspond to value thresholds that have been defined by the user and associated with particular colors. Values within threshold ranges are transformed to produce graphical effects (color, transparency, reflectancy).
The important characteristic of this application is that the rays are computationally independent and could theoretically be calculated simultaneously. However, the number of calculations performed along each ray varies. If a ray finds no values within the range of interest, no calculations whatever are needed. If values are detected, the number of calculations to be performed depends on whether this is the first value within a particular color range, whether other colors have already been detected, and several other factors related to shading and highlighting algorithms.
Overall, these characteristics reveal a loosely synchronous style of parallelism. According to rules 1-4, this application will be difficult to parallelize and requires that the points of CPU interaction be infrequent. Since interaction will be required only at the beginning and end of each ray search, we hope performance gains are possible.
Step 3: Machine
Using rules 5-8, we find that shared-memory is again preferable, but that distributed-memory systems-like the workstation cluster-might work as long as there are many computations between CPU interactions.
Step 4: Language
Like many workstation clusters, ours is limited in terms of the languages and libraries supported. Given the fact that the existing application is in Fortran, we choose to use PVM message passing to implement the parallelism.
Step 5: Performance expectations
Timing the baseline version of the rendering phase reveals that 223 seconds are being used: 6 for setup and initialization of arrays, 199 for ray-casting calculations, and 18 for generating the output file:
This time, we apply rule 16 to estimate the message equivalent of our workstation cluster. According to our system support staff, the peak CPU speed of each workstation is approximately 110 Mflops/sec, with latency and bandwidth about 2,000 microseconds and 2 Mbytes/sec, respectively. This yields a message equivalent of approximately 275,000 flops. Unless a very large number of calculations can be performed between each CPU interaction, we are unlikely to achieve respectable performance.
This time, the rules of thumb provide much less positive indication for parallelization. In the real-world case, however, the programmer already had some experience in parallelizing other applications and wanted to see how much performance could be gained through message passing on a workstation cluster. The major programming hurdle was how to minimize CPU interactions. Given the extremely high message equivalent, the programmer had to be creative in handling the division of rays among CPUs. Considerable time and effort were spent debugging and tuning the parallel code. The resulting performance was 71 seconds per image, a significant improvement over the previous time of 223.
Case Study 3
Since the rendering phase had been maintained independent of the
simulation itself, it could be used for rendering other types of images
as well. The decision was made to see just how much performance could be
exacted from the renderer through parallelism.
The precondition tests yield slightly weaker results than the previous analyses. Since the renderer is no longer tied to the simulation, average time-to-results is somewhat faster.
In reanalyzing the problem architecture for the rendering phase, we find that there is an inner producer-consumer relationship: The ray-casting (now carried out in parallel) modifies the data array, which is then passed to a plotting routine to convert the computed colors to RGB values suitable for display on a computer screen. As in the first case study, the one-way flow of data shows this to be a pipeline model. The same rules of thumb are applied, with the same results as before. A shared-memory system in again indicated by preference, but our distributed-memory system might work, given sufficient computations between CPU interactions. We choose to continue using PVM message passing for implementation.
This time, the entire rendering phase consumes only 71 seconds: 6 for setup and initialization, 47 for ray-casting calculations, 13 for plotting, and 5 for writing the output file. The target for parallelization efforts is very significantly reduced:
The message equivalent is unchanged. Although there is measurable equivalent is unchanged. Although there is measurable room for improvement, it's far below the threshold indicated by rules 10-16.
Clearly, the rules of thumb indicate that parallelization is not warranted. Since the intent of the real-world case was to push the limits of performance, the programmer proceeded anyway. By pipelining the ray-casting and plotting calculations, it was actually possible to reduce execution time by a few seconds per image; however, the amount of effort required was substantial. Even for an experienced programmer, the investment was inordinate for such a small gain in performance.
 M. Levoy, "Display of Surfaces from Volume Data," IEEE Computer Graphics & Applications, Vol. 8, No. 3, March 1988, pp. 29-37.