PTR Notes from 23 May '95 Ptools Meeting


Summary of the Breakout Session for the Portable Timing Routines Library 23 May '95; NASA Ames.

About 30 people attended the 1.5-hour session; John Zollweg (Cornell Theory Center) and Baljinder Singh Ghotra (Oregon State U.) took notes. These are coalesced and presented in summary form in this report. Hugh Caffey (Convex) gave an update of the working group's progress to date and set out some tentative Application Programming Interface (API) specifications for the sake of argument.

The Portable Timing Routines (PTR) project was begun as a PTools project in December '94. The main aims are spelled out in the original proposal and justification.

Some of the main characteristics of the routines to be specified and developed in this project are: - timers can be used to time any arbitrary interval in a running program - timers usable from Fortran or C - timers are minimally intrusive - timers are available on virtually all current HPC systems - there will be a common API - times return times for the calling process/thread only

Among several things this project will *not* attempt to do are: - replace UNIX time_of_day - provide portable *code* - define or standardize how user CPU and system CPU are measured - standardize the accuracy/granularity of timers - provide timers that permit comparison between different types of system

There were 3 main aims of this session:

  1. Agree on functionality
  2. Agree on the API
  3. Enlist more help -

There are to be 6 main routines: 3 are 'tick-collecting' routines that read appropriate values of whatever timing addresses or registers are available with the least possible intrusion (these routines are paired around the code interval to be timed); the other 3 are 'interval-calculating' routines that actually calculate the time in the interval and/or to return status values encoded with why the timing calculation somehow failed. Three "sorts" of time are to be in the API (if not actually implemented as such; see later discussion): (1) wallclock; (2) user CPU; and (3) system CPU.

There are several factors of interest in timing that, on any given system, will be constant or very nearly so. These include the rollover frequency of counter(s), the intrusiveness of the timer, the "tick" frequency and the accuracy or reliability of a given measured time. These were proposed to be put into header files, e.g. "ctimer.h" and "ftimer.h", when the library is built on a given system. They would simply be available as run-time constants. Several people noted - correctly - that this would be a problem on systems which were binary compatible but which have quite different characteristics, e.g. clock rate, e.g. families of processors like RS/6000, H.-P., etc. The alternative proposed (and tentatively adopted) is to provide a timer initialization routine that would "somehow" query the system at runtime to discover what the appropriate such values would be, e.g. tick frequency.

A related issue involves small (or possibly not-so-small) departures from vendor-supplied nominal values of the system-specific "constants" such as tick frequency, accuracy, rollover frequency, etc. Apparently there really are such departures and there are users who might want to use empirically measured values of these variables rather that the nominal ones. John May (LLNL) says that it is possible to write calibration programs or routines that can - optionally - provide these empirical values. The tentative way to deal with these system-specific values is to include in the PTR library a initialization routine that the user calls at the beginning of program execution. This routine would return the nominal, system-specific values of tick frequency, rollover frequency, etc. Optionally (e.g. by setting an input flag), the initialization routine would, in turn, generate calibrated system-specific values. It might look like this:

where: flag is logical and determines whether the routine will return nominal values or empirical, i.e. calibrated, values; and run_secs is floating-point (32-bit?) and specifies an upper time limit on how long the calibration - if any - is to run; if run_secs is <= 0.0, nominal values are returned, i.e. no calibration is done.

The timer templates originally proposed were:

where t_var* are 64-bit inputs to the routine; secs is the 64-bit floating-point output and is the no. of seconds in the interval (t_var2 - t_var1); status is an integer return code providing information about the success of the call; in particular, it could be the no. of rollovers that had occurred in the interval or simply whether or not the calculated interval can be relied upon.

Bill Tuel (IBM Kingston) pointed out that '$' mightn't be widely available and understood outside North America. There was general agreement to drop it.

A couple of people asked why we bothered to have 3 separate routines instead of just 1 for collecting/converting timing. The reason was that wall, user CPU and system CPU are apt to have quite different resolution, accuracy, overhead, etc. Another reason is that, in many cases, the user may really only want 1 of the 3 sorts of "time" and would wish to avoid the possibly greater intrusiveness incurred by getting 3 values instead of only the 1 s/he wanted.

There was much discussion of whether these should be implemented as subroutines or functions. The prevailing opinion seemed to be that they should be subroutines for two reasons: (1) if they're functions, there's a need to store at least one value (the status code), which is nominally intrusive; and (2) if they're functions, the potential confusion that could result from the interval- calculating routines' returning calculated time in an argument and status code as a result to be stored. For consistency, both the 'tick collecting' routines and the 'interval-calculating' routines should be implemented as subroutines with all arguments passed by reference.

In the subroutines vs functions debate, someone noted that if these were functions and could be in-lined, the intrusiveness would be much less. But someone else pointed out that such in-lined code would be subject to code movement and other compiler shennanigans; in this case, while there might be *less* intrusiveness, it would also be a good deal more *variable* depending on exactly where timers were put in the code, etc. The grumbling consensus was that users would be happier with a relatively fixed, if higher, overhead than with a relatively variable, if lower, overhead.

Cherri Pancake (Oregon State U.) pointed out that the current batch of templates was arrived at largely with Fortran in mind but that C-usable routines should be relatively straightforward.

Rod Oldehoeft (Tera) asked what - if anything - the working group intended to recommend regarding compiler optimization, especially code movement, and its effect on what is actually timed. Unfortunately, the answer seems to be 'What we've done with this problem all along, viz. ignored it or tried to second guess and trick the compilers'. Rod noted the use of the volatile data type in C and that it might be useful in this context. Anyone who can come up with a way for users to time what they think they're timing in a heavily optimized code, please speak up.

Cherri Pancake explained that one role of members of the PTR working group is as a buffer between users and developers. In this role, the PTools Consortium can take full responsibility for specification, distribution and some maintenance of the PTR library. What's expected from implementors in the various vendors' development shops is that they provide the best low-level hooks or routines to get at the information required in the specification. Moreover, they're also expected to provide their best information on tick frequency, overhead costs, rollover frequency and accuracy. They may do this and provide full, on-going support for their implementations of the library or they may provide code and information as a one-off event with no continued maintenance (at the possible later loss of PTR functionality on their systems). At any rate, implementors won't be exposed to end users other than at their own instigation.

Hugh Caffey noted that there is nothing parallel *per se* about these timing routines or their functionality. He added that the current project seeks only to specify and implement a few, basic routines that can be used *as is* and possibly later be used by others (including subsequent PTools projects) as the basis for more ambitious performance measurement. For example, the user CPU times recorded for each of several parallel processes or threads could be combined to yield total CPU time or CPU time for some sub-tree of parent/child processes. Given such considerations, these routines should follow the general UNIX model of being small, largely autonomous building blocks usable later for other purposes not necessarily anticipated by the original project personnel, i.e. they should have some modularity.

Mary Zosel (LLNL) suggested that since implementors of these routines would be providing source code, they should include their own comments regarding any implementation details, resolution, accuracy, caveats, etc. She added that these comments should be read-only by marketing/sales types. These are good ideas and should be passed along to implementors.

Chuck Leith (LLNL) reminded us that, while a standard API and well implemented code underneath are fine things in themselves, they're not much use if the vendors don't make accurate, high-resolution clocks available to these routines. He's right, of course. Anyone with any say in the matter of hardware and/or low-level OS support for good clocks should speak up.

A couple of people asked about [wallclock] timestamping and, related to this, the issues of timer 'drift', skewness and "tachyons" (a term due to Al Geist or Adam Begulin, as far as I know). The answer is that several trace analysis tools already deal (or don't) with these issues and that the PTR library will stick to the simpler functionality already on the table.

Current - but still tentative - templates for the timing routines:

where t_wall, t_user, t_sys are 64-bits (could be integer or floating-pt) returned from the routines

where t_wall*, t_user*, t_sys* are 64-bits (could be integer or floating-pt) supplied as input to the routines; *_secs are 64-bit fl.-pt returned by the routines and *_stat are integers returned by the routines. The status codes return the number of rollovers that occurred in the interval (t2 - t1). (N.B.: The availability of this rollover information varies among systems: some provide no information whatever; others provide information on whether at least one rollover has occurred; and still others the actual number of rollovers.) It was generally agreed that on systems for which the no. of rollovers that occur in an interval is actually available, this should be used in the correct calculation of the interval (*_secs). On systems for which the count of rollovers is not available, a negative interval (*_secs) should be returned and *_stat should be set to ROLLOVER_DETECTED (an integer). The user could then simply ignore the returned time.

Systems we decided to drop from the list to be tested:

New systems and volunteers for testing PTR:

Systems no one has signed up to test PTR on yet:

Outstanding Issues:

Addendum

Alan Karp (H.-P. Labs) was unable to attend this breakout session. He has a method of interval timing that avoids the rollover issue and also provides both high resolution (where needed) and "reach" , i.e. timing of long intervals (where needed). Unfortunately, he only does this for wallclock time, preferring to avoid the relatively larger intrusiveness of gathering CPU information. Richard Frost (SDSC) talked about this with Alan and says he'll attempt to come up with a scheme that incorporates some of these methods; he also said he'd like to be the technical lead for the working group.

After the breakout session, Rusty Lusk and Bill Gropp (both ANL) stressed the variability among systems in their ability to distinguish user CPU from system CPU. On some systems, this can be done reasonably reliably; on others, it's impossible; on still others, it's a mess. They also noted an orthogonal issue here, namely that whether they could properly distinguish user from system CPU or not, some vendors - for reasons best known only to themselves - would prefer not to divulge this information to users.

Users of all HPC systems - but especially parallel ones - are *obsessed* with performance and its measurement. They very badly want reliable timing information and spend a good deal of time being frustrated at the lack of it on each new system they port to. As difficult as it undoubtably is to provide such information, e.g. user vs system CPU time, users want it badly enough that they wind up cobbling together some fairly awful and unportable timers that don't work very well and that do a much worse job of getting the desired information than anything professional developers could come up with. Developers: users don't want absolute, rigorous perfection in these routines. They just want routines that: (1) are better than they can do themselves; (2) have a common interface among systems; and (3) have reasonably well estimated performance characteristics, e.g. intrusiveness and accuracy.


Portable Timing Routines home page
Parallel Tools at OSU home page
Parallel Tools Consortium home page

For further information, contact kennino@cs.orst.edu.