One particularly frustrating aspect of parallel programming is that when your program crashes or hangs, it's difficult or impossible to determine how far execution got. System messages are notoriously cryptic about program failures. Even if you've inserted print statements to keep track of progress, that final (and most important!) line may be stuck in a buffer and never appear as output.
All parallel computers and workstations support some kind of corefile mechanism, but this is not a particularly helpful solution. Off-loading the core image of a program executing on hundreds of processors can delay everyone's work -- and may well fill up all available disk space. The corefile information is at too low a level to interpret it directly. It's necessary to use a debugger or some other tool just to determine at what line the program failed. If the program wasn't compiled with debugging options enabled, it may be impossible to get any useful information. Even if the debugger is capable of telling you the crucial information, you have to invoke, wait for, and interact with a complex tool. All in the interests of acquiring some very basic information about execution!
The Ptools Lightweight Corefile Browser project was formed in response to this need. The goal was to create a tool which quickly and easily provides a high-level view of where the program was when it terminated. Both graphical and command-line versions are available. The tool automatically assimilates the details from tens or hundreds of processes and presents them in a consolidated summary form representing the dynamic call graph of a parallel application.
To view the dynamic state of the entire parallel application, the
graphical browser is invoked.
Initial LCB display -- Overview Graph
This shows the current location of the program in the form of a call graph, where each node represents a routine in the call stack of one or more processes. The coloring of the nodes indicates how many processes were active in each routine; black nodes represent routines that were suspended when calls to other routines were made. The routine(s) where program failure occurred is colored red. A highlighted (or white) node is the one currently selected. A message line at the bottom of the display changes as the cursor is moved across the graph, indicating the routine name associated with a node as well as the number of processes executing the node.
To view the names of all routines, the view is changed to the "Call
Graph" by selecting the appropriate button from the controls at the
bottom of the screen. This brings
up a more obvious call graph.
LCB Call Graph display
By clicking on a node, the user can bring up a window showing the location of all processes currently executing the node. It is also possible to search the graph for a specific routine by name.
The graphical browser can also be incorporated into other tools - such as interactive debuggers or performance analysis tools - where users might want to view a snapshot presentation of parallel program execution. For this mode, we recommend that vendors integrate LCB into their tools such that when the user clicks on a routine name, another window is brought up, displaying the source code associated with that routine.
Although it is called a corefile, the information need not be stored as a file. The browsing tools acquire the corefile data through an abstraction layer, so that the information can be constructed on-the-fly, in addition to being provided via a conventional file.
The file format provides symbolic, high-level information on program state (not the hexadecimal notation that users must struggle with when using typical core files). It is structured as an ASCII file that is readable by humans as well as by analysis programs.
If you are a user, we need your help in reviewing the usability of the browser interfaces (e.g., are the terminology and operations self-explanatory?). The browsers have been tested on all major UNIX workstation platforms (see the Web pages for more information on obtaining a copy). Users are also encouraged to talk with their favorite vendors, encouraging them to have their operating system generate lightweight corefiles when applications crash, or to add an LCB browsing component to their existing debuggers.
If you work for a workstation or parallel computer vendor, we would like to help you implement LCB on your company's platform(s). See the Web pages for contact information.
The final draft of the lightweight corefile format has been completed. An Adopter's Guide is available; it includes several modifications requested as the initial vendors have begun porting LCB to their product line.
The LCB working group is open to all interested participants. The
email reflector for working group discussions is
To subscribe or unsubscribe to the list, send one of the following
- unsubscribe ptools-lcb
Web pages at http://www.ptools.org/index.html