Information for software developers wishing to implement proprietary versions of LCB
Name and Ownership Issues
According to the objectives of the
Parallel Tools
Consortium (Ptools),
the design of LCB and its reference implementation are available for
royalty-free adoption and use by any public or private organization. Copyright
is retained by the Parallel Tools Consortium and Oregon State University.
(Queries concerning its use should be directed to the address at the end
of this document.)
The name "Lightweight Corefile Browser" and its abbreviation "LCB" are used as a convenience. Proprietary implementations are not required to use those names (see Ptools Naming Conventions), but aliases should be provided to associate the names lcb and xlcb with the proprietary tools, for the benefit of users familiar with the generic Ptools names.
Adopters are asked to attribute the origin of the design in their support documentation, referring to their work as derivative work based on "the Parallel Tool Consortium's Lightweight Corefile component", where component can be "file format", "graphical representation," "command-line browser," or "graphical browser." Adopters are also requested to inform Ptools of their plans to adopt the design, by sending email to the address at the end of this document. (This will add them to the distribution list for any updates.)
The intent of Ptools projects is that their proprietary implementations by other parties conform to certain consistencies in operation and look-and-feel, so that users familiar with the tool on one vendor platform will be able immediately to recognize and use it on a different platform. Therefore, the Ptools Steering Committee recommends that alterations be confined to "cosmetic" features that might help the tool to fit into a platform-specific programming environment, but do not substantially affect its use.
The remainder of this document describes the particular features that
were identified as critical by users involved in the design process.
Adopters should recall that the entire point of Ptools projects is to
reflect user needs and preferences. Wherever possible (within the
constraints of the execution environment), a proprietary version of LCB
should maintain all features that evolved in response to user input.
It is anticipated that some adoptions will uncover additional uses for
the LCB project components (file format, graphical representation,
browser tools) that were not foreseen and may require substantial
modifications to the original design
(see "Extensions to the LCB Model"). The LCB working group invites
direct interaction with adopters so that such changes can be made in a
consistent way and propagated through future adoptions.
In addition, the working group strongly recommends the following features:
Users noted that the most critical need for PLI is associated with
long-running, large-scale production or quasi-production runs. They
emphasized that failure of these runs often occurs during off-hours
(middle of the night or weekends/holidays) when no person is monitoring
their behavior. No one is available to explicitly activate data collection
or to direct the flow of PLI to alternative file systems if the current
system is already heavily burdened. Three requirements result from
the scenario described above. In later meetings with user representatives
(including those involved in
the national task
force for HPC software requirements), a fourth requirement arose.
The generation of PLI:
Developers in the Working Group articulated several concerns, each of
which received direct user response. The following list consolidates
those discussions in the form of questions/answers.
The standardized format went through several iterations,
and is specified elsewhere. Adopters are encouraged to implement PLI
generation in whatever form is most efficient for their platforms. It
is immaterial to the users whether that form is converted to the
standardized format on-the-fly as data is generated, immediately after
generation (in a postprocessing step), or upon access to the stored data.
All three techniques were implemented as part of the Working Group's
efforts.
It should be noted, however, that users applying the API to explicitly
generate a "lightweight corefile" will expect to receive the standardized
format.
Virtually every aspect of the
representation was modified in response to user preferences obtained
through direct user testing. The features that ended up being of
particular importance to the users are described below. Adopters
are cautioned not to undermine the effectiveness of
the representation by making arbitrary changes to these features.
As subsequent user trials have borne out, this appears to be an
exceedingly intuitive way for technical programmers to conceptualize the
dynamic state of their applications. Absolutely
no training or documentation was needed for users to understand what
the representation meant and how to interact with it. Consequently,
the implications for use of the representation go well beyond the
initial concept for LCB (see "Extensions to the LCB Model" at the
end of this document).
As it turned out, most of the design issues
that troubled developers in the Working Group - such as "how do we
handle multiple executables?" and "what about recursion?" - had
simple answers that were completely self-evident to the users.
However, we did not exhaustively test every aspect. In particular,
the reference implementation intentionally sidesteps
color and font issues (under the assumption that proprietary
implementations are most likely to want these to conform to
product-line appearance standards). What you see are the default
colors and fonts of the Tk widgets. With one exception (see "color
buckets" bullet), users expressed no particular preferences for
color.
The features that reflect explicit user preferences are listed below:
In tests with user programs, we found that the display area
almost always accommodated a full overview
representation of the application, with no scrolling. In
contrast, the labeled view was generally large enough to
require scrolling.
Graphical Browser (xlcb):
Over the course of the prototype/test/re-design iterations, a number
of specific features of the browser were discussed explicitly with
users:
It should be noted that inclusion of a mechanism for "updating" the
display to reflect changes in the dynamic state of the program (not
needed for post-mortem examination, but important for other
uses of the graph, as mentioned in the last section) was cited
by users as another appropriate button function.
In the reference implementation, our secondary sorting order
is by logical process ID (i.e., the order in which they appear
in the corefile) rather than physical process ID. Users claimed
they did not have a strong preference on secondary order - or
at least one that was platform-independent.
Command-line Browser (lcb):
A number of possibilities were also suggested for the command-line
version, most centering on how to present comprehensive information on
multiple processes without deluging the user with output. Users were
clear that they wanted to see only a minimal amount of information with
this version. Specifically, just information from one
failing node/process/thread should be shown. Again, the issue of
ambiguities arose. The users clarified that if multiple ones failed,
any one could be selected for presentation; in the absence of a failure,
the best choice for presentation would be the "first" (in terms of
logical node/process/thread numbers).
One suggestion made by the developers was that a script be used that
would invoke the command-line version if there was just one failing
process, or the graphical version if there were multiple ones. This idea
was rejected by the users.
Another idea was the ability to support multiple lightweight corefiles,
by adding a "change file" option to the graphical interface. Users also
rejected this, saying that if they were to use the tool on more than one
corefile (and this would be unlikely), it would be for comparison
purposes so it would make more sense to have multiple instantiations
of the tool, one per corefile.
The Working Group strongly recommends that each proprietary implementation
include two ways to invoke the tool, command-line (with simple, quick
information) and graphical (accessing more complete information).
The Parallel Tools Consortium encourages use of the graphical representation
for purposes other than LCB. At the same time, it should be recalled that
cross-platform operational consistency is essential if the
representation's intuitiveness is to be maintained.
This section of the adopter's guide, therefore, will be updated from
time to time as new uses of LCB components precipitate modifications to
the original design embodied in the reference implementation. It is
suggested that adopter's review it before adding new features to the
components. If what you need is not present, please contact the
address below to determine if other, in-progress design changes might meet
your needs.
At the time of the latest update, the following extensions have been
defined:
Comments and queries to Cherri M. Pancake,
pancake@cs.orst.edu
Design Objectives
The original need for the LCB project was identified at a series of user
group meetings and workshops. The same forums established what should
be the primary design considerations:
These priorities constrained the tool's design in several ways, as
outlined below.
Requirements
The Lightweight Corefile Browser includes the following elements:
Additional information will be found in the sections below.
In addition, the working group recommends that the output conform generally
to the format shown in the command-line browser user
manual.
It is anticipated that proprietary versions will vary the color scheme and
some aspects of appearance (e.g., node shape), but the
representations must resemble the reference version sufficiently
so that users have no trouble adapting from one version to
another.
Mechanisms for Generating LCB Information
The primary concern expressed by users was that program location
information (PLI) be generated "cheaply" and automatically. They
characterized expense in terms of time to offload the information from
processors, as well as disk space required for storage.
In response to the fourth requirement, the working group defined an
appropriate
calling interface. Although this was not part of the original LCB
project, we highly recommend that it be adopted in conjunction with the
other components.
Type of Data Generated
(a) Data Content: Users were quite explicit about what should
and should not be included as part of the PLI. As indicated in the
priorities at the beginning of this document, the data must reference
source, not object locations. It must also include a reason message (not
just an error code, but the message associated with that code).
(b) Data Format: Users were less concerned with formatting
details, although they insisted that the PLI be human-readable (i.e.,
formatted ASCII). It was immaterial to them whether the PLI is stored
that way in the first place, or stored in some non-readable (perhaps
platform-specific) way and accessed using a command-line filter. The
filter, however, must be provided by the adopter as part of the LCB
facility; ideally, it would be applied automatically whenever the
"lightweight corefile" is opened for read access.
Users: Do the best mapping you can, using
the memory data available; even an approximate source location -
perhaps a few dozen lines off - is infinitely more useful than
a memory address.
Users: A simple routine name,
without file prefix, is better (easier to read).
Users: A simple
routine name is sufficient; developers should not attempt to deal
with duplicate names.
Users: It doesn't really matter, as long as you're
consistent; we can figure it out. If one is significantly easier
than the other, do it.
Users: No; if values are wanted, a debugger
can and should be used.
Users:
Just show the error message associated with the first failure you
detect.
Users: Report either one as the (single) failure, or
report both; it doesn't matter.
Graphical Representation of Program Location Information
The graphical representation is based on the concept
of a call graph, commonly used as a program analysis and
documentation aid by technical programmers. The representation
evolved from a series of iterative tests with users, so its
final form embodies a great deal of very specific feedback (see
paper on
user-oriented design for more information on its evolution).
Finally, cursor shape is managed within the graphical representation
to indicate that elements are selectable via the mouse. See the
next section for more information on this feature.
Interface Functionality
Users want two versions of the browser:
The issue of "how much is enough" precipitated some
of the most vocal exchanges between the developers and users
involved in the project. Users were firm in stating that for
many errors, simply knowing the final source location of a
faulting node/process/thread was sufficient to enable them to find and
correct the error. Adopters should take special note that the
extreme simplicity of both browsers was at the explicit request
of the users.
Furthermore, it should be the zoomed-out (overview) graph, rather
than the labelled (call) graph, so that
the maximum amount of global program information can be seen at the
outset.
Tool Invocation
On a number of occasions throughout the design process, users reiterated
the necessity of including a short-form, command-line invocable version
that would provide minimal information about program failure. This
should generate information to stdout; that way, it could be invoked via
dial-in access, or included as part of a batch job "cleanup" script.
Extensions to the LCB Model
Over the course of the project, it became obvious that the PLI graph
provides a general representation for dynamic program state, whose
use extends well beyond post-mortem examination of final program location.
One member of the working group, John May (Lawrence Livermore National
Laboratory), integrated it into the TotalView interactive debugger as
a means of denoting program state at quiescent points in execution,
such as at global breakpoints.
Last updated March 10, 1996.