Parallel Tools Consortium

Lightweight Corefile Browser Adopter's Guide

Version 2.0

Information for software developers wishing to implement proprietary versions of LCB


On-line flyer describing project from viewpoint of parallel application developers
Information on downloading current version
Paper describing how user-oriented design was applied in this project
General Web pages on the LCB project

Purpose and Organization of Document

This document describes those aspects of LCB design likely to be of interest to commercial software developers who wish to adopt the standard file format, the graphical representation of dynamic program state, and/or the browsers (command-line and graphical). Emphasis is on the features that were designed in response to specific user requests or feedback during the user-centered design process.


Name and Ownership Issues

According to the objectives of the Parallel Tools Consortium (Ptools), the design of LCB and its reference implementation are available for royalty-free adoption and use by any public or private organization. Copyright is retained by the Parallel Tools Consortium and Oregon State University. (Queries concerning its use should be directed to the address at the end of this document.)

The name "Lightweight Corefile Browser" and its abbreviation "LCB" are used as a convenience. Proprietary implementations are not required to use those names (see Ptools Naming Conventions), but aliases should be provided to associate the names lcb and xlcb with the proprietary tools, for the benefit of users familiar with the generic Ptools names.

Adopters are asked to attribute the origin of the design in their support documentation, referring to their work as derivative work based on "the Parallel Tool Consortium's Lightweight Corefile component", where component can be "file format", "graphical representation," "command-line browser," or "graphical browser." Adopters are also requested to inform Ptools of their plans to adopt the design, by sending email to the address at the end of this document. (This will add them to the distribution list for any updates.)

The intent of Ptools projects is that their proprietary implementations by other parties conform to certain consistencies in operation and look-and-feel, so that users familiar with the tool on one vendor platform will be able immediately to recognize and use it on a different platform. Therefore, the Ptools Steering Committee recommends that alterations be confined to "cosmetic" features that might help the tool to fit into a platform-specific programming environment, but do not substantially affect its use.


Design Objectives

The original need for the LCB project was identified at a series of user group meetings and workshops. The same forums established what should be the primary design considerations:
  1. Data on final program location must be generated automatically upon program failure. That is, it must not be necessary to re-execute the program under the control of some monitor in order to determine the location of failure.

  2. Final program location should be reported in terms of the last source locations (routine and line number) of all processes. However, it is understood that those locations may only be approximations, and that all processes will not have terminated at the same moment in time.

  3. The interface must be extremely simple. In particular, the tool should be quick to invoke.

  4. The interface must be intuitive. That is, no user manual should be necessary, even for first-time users.

  5. An ASCII (command-line) interface must be available. Graphical viewing facilities must not be necessary, even if the data revealed by the command-line tool is just a subset of the information available.

These priorities constrained the tool's design in several ways, as outlined below.

The remainder of this document describes the particular features that were identified as critical by users involved in the design process. Adopters should recall that the entire point of Ptools projects is to reflect user needs and preferences. Wherever possible (within the constraints of the execution environment), a proprietary version of LCB should maintain all features that evolved in response to user input.

It is anticipated that some adoptions will uncover additional uses for the LCB project components (file format, graphical representation, browser tools) that were not foreseen and may require substantial modifications to the original design (see "Extensions to the LCB Model"). The LCB working group invites direct interaction with adopters so that such changes can be made in a consistent way and propagated through future adoptions.


Requirements

The Lightweight Corefile Browser includes the following elements:

Additional information will be found in the sections below.


Mechanisms for Generating LCB Information

The primary concern expressed by users was that program location information (PLI) be generated "cheaply" and automatically. They characterized expense in terms of time to offload the information from processors, as well as disk space required for storage.

Users noted that the most critical need for PLI is associated with long-running, large-scale production or quasi-production runs. They emphasized that failure of these runs often occurs during off-hours (middle of the night or weekends/holidays) when no person is monitoring their behavior. No one is available to explicitly activate data collection or to direct the flow of PLI to alternative file systems if the current system is already heavily burdened. Three requirements result from the scenario described above. In later meetings with user representatives (including those involved in the national task force for HPC software requirements), a fourth requirement arose.

The generation of PLI:

  1. must either be standard upon any program failure, or controllable through some runtime environment setting (such as an environment variable or batch system switch).

  2. must be scalable to applications involving potentially thousands of processors.

  3. must result in data whose volume is small enough so that (a) offloading does not tie up the system for long periods of time, and (b) storage does not present unusual demands for disk space.

  4. should also be available through a standard API, allowing the user to explicitly request generation of the data from within an executing application.
In response to the fourth requirement, the working group defined an appropriate calling interface. Although this was not part of the original LCB project, we highly recommend that it be adopted in conjunction with the other components.


Type of Data Generated

(a) Data Content: Users were quite explicit about what should and should not be included as part of the PLI. As indicated in the priorities at the beginning of this document, the data must reference source, not object locations. It must also include a reason message (not just an error code, but the message associated with that code).

Developers in the Working Group articulated several concerns, each of which received direct user response. The following list consolidates those discussions in the form of questions/answers.

(b) Data Format: Users were less concerned with formatting details, although they insisted that the PLI be human-readable (i.e., formatted ASCII). It was immaterial to them whether the PLI is stored that way in the first place, or stored in some non-readable (perhaps platform-specific) way and accessed using a command-line filter. The filter, however, must be provided by the adopter as part of the LCB facility; ideally, it would be applied automatically whenever the "lightweight corefile" is opened for read access.

The standardized format went through several iterations, and is specified elsewhere. Adopters are encouraged to implement PLI generation in whatever form is most efficient for their platforms. It is immaterial to the users whether that form is converted to the standardized format on-the-fly as data is generated, immediately after generation (in a postprocessing step), or upon access to the stored data. All three techniques were implemented as part of the Working Group's efforts.

It should be noted, however, that users applying the API to explicitly generate a "lightweight corefile" will expect to receive the standardized format.


Graphical Representation of Program Location Information

The graphical representation is based on the concept of a call graph, commonly used as a program analysis and documentation aid by technical programmers. The representation evolved from a series of iterative tests with users, so its final form embodies a great deal of very specific feedback (see paper on user-oriented design for more information on its evolution).

Virtually every aspect of the representation was modified in response to user preferences obtained through direct user testing. The features that ended up being of particular importance to the users are described below. Adopters are cautioned not to undermine the effectiveness of the representation by making arbitrary changes to these features.

As subsequent user trials have borne out, this appears to be an exceedingly intuitive way for technical programmers to conceptualize the dynamic state of their applications. Absolutely no training or documentation was needed for users to understand what the representation meant and how to interact with it. Consequently, the implications for use of the representation go well beyond the initial concept for LCB (see "Extensions to the LCB Model" at the end of this document).

As it turned out, most of the design issues that troubled developers in the Working Group - such as "how do we handle multiple executables?" and "what about recursion?" - had simple answers that were completely self-evident to the users. However, we did not exhaustively test every aspect. In particular, the reference implementation intentionally sidesteps color and font issues (under the assumption that proprietary implementations are most likely to want these to conform to product-line appearance standards). What you see are the default colors and fonts of the Tk widgets. With one exception (see "color buckets" bullet), users expressed no particular preferences for color.

The features that reflect explicit user preferences are listed below:

Finally, cursor shape is managed within the graphical representation to indicate that elements are selectable via the mouse. See the next section for more information on this feature.


Interface Functionality

Users want two versions of the browser: The issue of "how much is enough" precipitated some of the most vocal exchanges between the developers and users involved in the project. Users were firm in stating that for many errors, simply knowing the final source location of a faulting node/process/thread was sufficient to enable them to find and correct the error. Adopters should take special note that the extreme simplicity of both browsers was at the explicit request of the users.

Graphical Browser (xlcb): Over the course of the prototype/test/re-design iterations, a number of specific features of the browser were discussed explicitly with users:

Command-line Browser (lcb): A number of possibilities were also suggested for the command-line version, most centering on how to present comprehensive information on multiple processes without deluging the user with output. Users were clear that they wanted to see only a minimal amount of information with this version. Specifically, just information from one failing node/process/thread should be shown. Again, the issue of ambiguities arose. The users clarified that if multiple ones failed, any one could be selected for presentation; in the absence of a failure, the best choice for presentation would be the "first" (in terms of logical node/process/thread numbers).


Tool Invocation

On a number of occasions throughout the design process, users reiterated the necessity of including a short-form, command-line invocable version that would provide minimal information about program failure. This should generate information to stdout; that way, it could be invoked via dial-in access, or included as part of a batch job "cleanup" script.

One suggestion made by the developers was that a script be used that would invoke the command-line version if there was just one failing process, or the graphical version if there were multiple ones. This idea was rejected by the users.

Another idea was the ability to support multiple lightweight corefiles, by adding a "change file" option to the graphical interface. Users also rejected this, saying that if they were to use the tool on more than one corefile (and this would be unlikely), it would be for comparison purposes so it would make more sense to have multiple instantiations of the tool, one per corefile.

The Working Group strongly recommends that each proprietary implementation include two ways to invoke the tool, command-line (with simple, quick information) and graphical (accessing more complete information).


Extensions to the LCB Model

Over the course of the project, it became obvious that the PLI graph provides a general representation for dynamic program state, whose use extends well beyond post-mortem examination of final program location. One member of the working group, John May (Lawrence Livermore National Laboratory), integrated it into the TotalView interactive debugger as a means of denoting program state at quiescent points in execution, such as at global breakpoints.

The Parallel Tools Consortium encourages use of the graphical representation for purposes other than LCB. At the same time, it should be recalled that cross-platform operational consistency is essential if the representation's intuitiveness is to be maintained.

This section of the adopter's guide, therefore, will be updated from time to time as new uses of LCB components precipitate modifications to the original design embodied in the reference implementation. It is suggested that adopter's review it before adding new features to the components. If what you need is not present, please contact the address below to determine if other, in-progress design changes might meet your needs.

At the time of the latest update, the following extensions have been defined:

  1. A standard API allowing the user to explicitly request generation of lightweight corefile data from within an executing application.

  2. Ability of graphical browser to accept input data from a stream that is augmented over the course of the tool session ("data-push mode"). In this mode, the browser is intended to monitor the input stream and update the display when appropriate; it involved modifying the file format to accommodate multiple sets of PLI data.

  3. Ability of graphical browser to prompt some external module for more input data ("data-pull mode"). This mode requires the addition of an "Update" button, whereby the user explicitly requests that the program state information be updated. We recommend that this button be aligned in the same row as the buttons in the reference implementation, but located on the far right of the screen.

  4. Mechanism for displaying source code corresponding to a routine in the graph. One of the original intentions of the Working Group was to include this feature, but we decided that a source code display was a general purpose tool element, not just a feature of LCB (and hence, according to Ptools guidelines, merited a separate project group). This feature can be handled in two ways:

    • Method 1, for use when graphical "augmentations" of the source code (e.g., iconic symbols to the left of individual lines) are available: The new source code window will functionally replace the current "Detail" window (listing the locations of nodes/processes/threads currently active in the routine). The new window will appear at the same point that the Detail window currently does; that is, when a graph node is selected with the pointer. It will show the source code for the corresponding routine - with augmentations to indicate all active lines - and be auto-scrolled to the same line whose number would have appeared as the first entry in the detail window's sorted list. The augmentations will have to indicate not just which locations are active, but which process(es) is at each location. This is the method preferred by the users involved in the Working Group.

    • Method 2, for use when graphical augmentations are not available: The new source code window will be invoked from the the "Detail" window, when the user selects (clicks on) any line in the listing. The new window will show the code from the routine named by the detail window, with its position auto-scrolled to the line number corresponding to the entry the user selected.


Last updated March 10, 1996.

Comments and queries to Cherri M. Pancake, pancake@cs.orst.edu