Parallel Tools Consortium Projects

Lightweight Corefile Browser


One particularly frustrating aspect of parallel programming is that when your program crashes or hangs, it's difficult or impossible to determine how far execution got. System messages are notoriously cryptic about program failures. Even if you've inserted print statements to keep track of progress, that final (and most important!) line may be stuck in a buffer and never appear as output.

All parallel computers and workstations support some kind of corefile mechanism, but this is not a particularly helpful solution. Off-loading the core image of a program executing on hundreds of processors can delay everyone's work -- and may well fill up all available disk space. The corefile information is at too low a level to interpret it directly. It's necessary to use a debugger or some other tool just to determine at what line the program failed. If the program wasn't compiled with debugging options enabled, it may be impossible to get any useful information. Even if the debugger is capable of telling you the crucial information, you have to invoke, wait for, and interact with a complex tool. All in the interests of acquiring some very basic information about execution!

The Ptools Lightweight Corefile Browser project was formed in response to this need. The goal was to create a tool which quickly and easily provides a high-level view of where the program was when it terminated. Both graphical and command-line versions are available. The tool automatically assimilates the details from tens or hundreds of processes and presents them in a consolidated summary form representing the dynamic call graph of a parallel application.



What The Lightweight Corefile Browser Does

The Lightweight Corefile Browser project (LCB) provides a mechanism for capturing and representing the dynamic state of a parallel application that potentially involves hundreds of processes. There are three main components to the project: All components are defined in a flexible way to permit parallel computer and workstation vendors to implement them by taking advantage of existing facilities for corefile generation.

How the Browsers Work

LCB is a simple tool with one goal: to provide a high-level view of the dynamic calling structure of the program, at the moment it terminated. The command-line browser reads the corefile (or accepts information on-the-fly from the operating system or some other tool), strips out all data concerning the culprit process and the reason it failed, and presents the results in a simple traceback format. This offers a quick-and-dirty way to find out "what happened" to your program.

To view the dynamic state of the entire parallel application, the graphical browser is invoked.
Initial LCB display -- Overview Graph

This shows the current location of the program in the form of a call graph, where each node represents a routine in the call stack of one or more processes. The coloring of the nodes indicates how many processes were active in each routine; black nodes represent routines that were suspended when calls to other routines were made. The routine(s) where program failure occurred is colored red. A highlighted (or white) node is the one currently selected. A message line at the bottom of the display changes as the cursor is moved across the graph, indicating the routine name associated with a node as well as the number of processes executing the node.

To view the names of all routines, the view is changed to the "Call Graph" by selecting the appropriate button from the controls at the bottom of the screen. This brings up a more obvious call graph.
LCB Call Graph display

By clicking on a node, the user can bring up a window showing the location of all processes currently executing the node. It is also possible to search the graph for a specific routine by name.

The graphical browser can also be incorporated into other tools - such as interactive debuggers or performance analysis tools - where users might want to view a snapshot presentation of parallel program execution. For this mode, we recommend that vendors integrate LCB into their tools such that when the user clicks on a routine name, another window is brought up, displaying the source code associated with that routine.

The Lightweight Corefile Format

The lightweight corefile is a platform-independent format designed to contain snapshot information about the current location of a parallel application. The data elements include the current value of the program counter, the contents of invocation stack frames, and a reason code for program failure. It is flexible enough to support a wide variety of platforms, including clusters of workstations as well as parallel computers.

Although it is called a corefile, the information need not be stored as a file. The browsing tools acquire the corefile data through an abstraction layer, so that the information can be constructed on-the-fly, in addition to being provided via a conventional file.

The file format provides symbolic, high-level information on program state (not the hexadecimal notation that users must struggle with when using typical core files). It is structured as an ASCII file that is readable by humans as well as by analysis programs.

How You Can Participate

The command-line and graphical browsers are available in the form of royalty-free source code. The assistance of both users and computer vendors is needed to complete this project.

If you are a user, we need your help in reviewing the usability of the browser interfaces (e.g., are the terminology and operations self-explanatory?). The browsers have been tested on all major UNIX workstation platforms (see the Web pages for more information on obtaining a copy). Users are also encouraged to talk with their favorite vendors, encouraging them to have their operating system generate lightweight corefiles when applications crash, or to add an LCB browsing component to their existing debuggers.

If you work for a workstation or parallel computer vendor, we would like to help you implement LCB on your company's platform(s). See the Web pages for contact information.

Current Status

Both command-line and graphical versions of the tool have been implemented. The command-line version is in ANSI C. The graphical browser is written in GNU C++ and makes use of TCL and the Tk widgets (all available in the public domain). Both versions have been tested on IBM-AIX, HP/UX, Solaris, and SunOS platforms.

The final draft of the lightweight corefile format has been completed. An Adopter's Guide is available; it includes several modifications requested as the initial vendors have begun porting LCB to their product line.



For More Information

Visit the LCB Web pages at http://www.nero.net/~pancake/ptools/lcb. These provide the most up-to-date information on the LCB project.

The LCB working group is open to all interested participants. The email reflector for working group discussions is ptools-lcb@ptools.org. To subscribe or unsubscribe to the list, send one of the following lines to majordomo@ptools.org:



The Parallel Tools Consortium, ptools@ptools.org

Web pages at http://www.ptools.org/index.html


Last updated February 23, 1996.