It
seems to me that three over-arching notions appear in the use case doc you sent
me:
Let’s
focus first on measuring performance, since that’s the principle focus of this
project. Let’s assume that we measure
speed with some sort of “timer” and spatial usage with some sort of disk and
memory “rulers” and that we measure all aspects of performance in every test.
Then
I claim that all performance tests follow one of two basic patterns:
define the source of a data transfer
define the destination of a data transfer
define the transfer process
start performance measurement
H5Dwrite() or H5Dread()
stop performance measurement
cleanup
Suppose that we have a device, an iterator, that
synchronizes stepping through the source and destination dataspaces. Then the partial i/o pattern is:
for (iterator_start; ! iterator_is_done;
iterator_next)
{
basic pattern
}
Note
that the basic pattern is a special case of the iterated pattern – a single
iteration. The definitions of the
source, destination, and process of data transfer cover the definitions of the
datatype transferred, the dataspaces in memory and on disk, the selections on
these dataspaces, any sequence of filters applied to the data in transit, any
special properties of the source, destination, and transfer, the number and topology
of processors participating in the transfer, etc. Here’s a list of all the “actors” defining
the source, destination, and transfer:
Actor |
Representation |
Implementation |
memory
buffer |
contiguous
array of bytes |
Wrapper
around malloc() |
memory
dataspace |
hid_t |
HDF5
library |
memory
selection |
Private
to HDF5 library |
HDF5
library |
memory
datatype |
hid_t |
HDF5
library |
file |
hid_t |
HDF5
library |
dataset |
hid_t |
HDF5
library |
disk
dataspace |
hid_t |
HDF5
library |
disk
selection |
Private
to HDF5 library |
HDF5
library |
disk
datatype |
hid_t |
HDF5
library |
file
creation properties |
hid_t |
HDF5
library |
dataset
creation properties |
hid_t |
HDF5
library |
process
set and topology |
MPI_Comm |
MPI
library |
filter |
Private
to HDF5 library |
HDF5
library (or user code?) |
transfer
properties |
hid_t |
HDF5
library |
timer |
double |
MPI_Wtime()
or getrusage() system call |
memory
ruler |
double |
? |
disk
ruler |
double |
stat()
system call |
Any
HDF5 performance test can be constructed by selecting one particular implementation
of each of the actors and one particular implementation of the iterator and somehow
sticking these choices together to form an executable. The number of different tests one can
construct with this assembly process is staggering. As a crude illustration, let’s say we have n different actors, each of which comes
in m varieties. Then there are mn different ways we can combine the actors. Just to play with numbers, and not taking the
numbers too seriously, note that 18 different actors are listed above. If each has 2 different implementations,
i.e., 2 different kinds of dataspaces, 2 different kinds of selections on
dataspaces, 2 different kinds of datatypes, etc., then there are 218
or about a quarter of a million different kinds of tests we can construct.
The
second claim I make is that our test suite should be capable of instantiating
any possible test. We clearly won’t
instantiate all possible tests – there are simply too many – but we should not
build a test suite in which some kinds of tests can’t be performed. Experience tells us that the test we can’t
perform will be precisely the test we need to perform.
There
may be mn possible tests,
but fortunately there are only m*n actors
that we might have to implement. To
continue playing with example numbers, the quarter million possible tests we
calculated above can all be constructed out of 2*18 = 36 different actors. 36 is a very manageable number. This result suggests that every test in our
test suite be built by combining pre-existing components into distinct
arrangements. The task of building a
test suite is really a task of building machinery to make new combinations of
existing components.
Now
let’s see if we can exploit any structure that might exist in the set of all
possible tests, with the hope that such exploitation would make it easier to
construct specific tests. But first,
let’s illustrate the structure with an example.
Suppose that out of the set of all possible tests we focus on tests
constructed out of components a, b, and c. a might be a particular kind of dataspace, b might be a particular datatype, and c might be a particular kind of filter. (But a,
b, and c can be any particular implementation of any of the actors we
listed above.) Then the set of all tests
that use component a is some subset
of all possible tests, and likewise with components b and c. In the diagram below, sets and subsets are
indicated with boxes. The labels
indicate the contents of the set. An
arrow from one box to another indicates that the first set is a subset of the
second. So the tests using components a, b,
and c form a subset of the tests
using only components a and b, and that set is in turn a subset of
those tests using only component a. What is notable about this diagram is that it
is precisely an inheritance diagram. If
we were to implement the test suite in an object-oriented manner using
inheritance, and if we drew its inheritance diagram, we’d get exactly the same
diagram. In general, the set of all
possible HDF5 tests has the structure of an inheritance hierarchy.
From
this result, we conclude that the natural way to implement the test suite is
with an object-oriented language using inheritance.
But
let’s examine various ways that the test suite could be constructed. I’ll start with two extreme cases that will
help narrow the focus. The two extreme
cases are:
I picture this as a big piece of code which takes
as input a “test specification language” – something that could describe, for
example, that one particular test uses chunked datasets of a given size, and
2-d dataspaces of which only the first dimension is fixed, and does partial
i/o, uses 8 processors, data compression, and a user defined datatype
containing a mixure of ints, doubles, and chars.
Clearly, a “test specification language” to
describe anything that HDF5 can do would be exceedingly complex, and the
machinery to parse it and handle all the myriad cases would be more
complex. Apart from the difficulty of
constructing such a beast, we’d need a test suite to test the tester.
At this extreme we’d write a precisely focused
piece of code for each test variation – something that does only what is
required for a specific test case, and no more.
This approach has the virtues of simplicity and understandability. It’s also compatible with the conclusion
reached above that we should construct tests out of pre-existing components.
So,
now we’ve argued ourselves into the position that each test should be a small
piece of software that is focused on a single situation, that we should build
tests out of pre-existing components (implementations of the actors tabulated
above), and that using inheritance is a natural and powerful way to construct
new test cases from old ones.
How
might we construct such test suites in C++, C, and F90?
Implementing the test suite in C++ is easy and
natural because the language is designed specifically to express the
hierarchies of similarities and differences inherent in the proposed test
suite. One could start with an abstract
use case class (a “root” class) that defines but does not implement what every
use case should do. The root class would
be equipped with an interface enabling it to fit into an automated test
harness. Each test in the test suite
would be a concrete descendant of the root class, i.e., a descendant that
implements the entire abstract interface introduced in the root class. A layered inheritance hierarchy allows a gradual
introduction of various specializations: complicated user defined datatypes,
different kinds of dataspaces, extendible rather than fixed datasets, different
patterns of partial i/o, etc. Gradual
layering provides abundant opportunities for specialization and makes it easy
to express any specific specialization with very little code. We used this technique extensively and
effectively at LPS.
I can think of several ways of constructing the
test suite in C:
C is a procedural language, not an object oriented
language, so expressing the object-oriented notions of inheritance and
polymorphism is not a natural thing to do in C.
But it can be done. http://www.embedded.com/97/fe29712.htm and http://ldeniau.home.cern.ch/ldeniau/html/oopc/oopc.html
give similar detailed accounts of the techniques
involved.
Briefly, one uses a struct to encapsulate the data
representation of a class. A descendant
class’s struct
always installs the immediate ancestor’s struct as its first data member. Then the construct
((ancestor*)&descendant)-> properly accesses the ancestor’s struct data
members. Methods that are meant to be
redefined in descendant classes are referred to by a function pointer stored as
class data. Descendant classes can
initialize this pointer to a different function from the ancestor class,
thereby specializing ancestor class behavior.
Essentially, one can do in C what is done in C++ if
one is willing to do by hand tasks that the C++ compiler does automatically,
and if one is willing to forgo C++ compiler benefits such as type checking.
One could just program each test case individually,
copying similar bits and pieces from other tests and tweaking these bits and
pieces so that they work together.
This is certainly a simple technique, and has the
benefit that all the code to an example is in one place, where it can be
comprehended as a whole. Since we
advocate using a minimalist coding style – one test focusing exclusively on one
use case – each test should be easy to understand as well. However, experience warns that writing lots
of customized code is error prone and hard to maintain.
A variation on “copy-and-tweak” is to #include
existing source code into a new test case rather than copying it with an
editor. One can imagine writing a number
of C functions that create numerous variations on the actors tabulated above,
and C functions that specify various kinds of iterators and behavior
patterns. Then a specific test could be
constructed by #including one’s choice of actors and behaviors. This approach requires that the C functions
conform to a pattern so that no editing is required after inclusion. But that’s just good software engineering
practice.
The source library approach makes it easier to
maintain code than copy-and-tweak, because there’s only one copy of each bit of
code. Fixes to this one copy propagate
automatically to all clients. One
drawback to this approach is that there may be quite a number of these bits of
source code; one needs some scheme for browsing them to find one that fits the
current need. Another drawback is more
subtle. Suppose you wanted to use a
certain dataspace definition as both the source and destination
dataspaces. Would one #include it twice,
once for the source and once for the destination? Wouldn’t that result in a redefinition
compilation error? One could solve this
problem by having separate source libraries for source and destination
dataspaces (and datatypes, and selections, and …) with different interfaces and
#including the common definition of this dataspace into both the source and
destination versions. This “solution” sounds
clumsy; one worries about what other unidentified problems are lurking in this
approach.
One can imagine building a pre-compiled object
library of useful actors and behaviors, then writing short programs that link
code from this library. At first glance
this approach seems to be an obvious improvement on the “source library”
idea. However, there’s a substantial
difficulty in building such a library: name clashes. We’d like to be able to program behavioral
patterns once, and have these patterns call test specific actions. In object-oriented terminology, we want polymorphism. An object-oriented language like C++ supports
multiple implementations of a named feature by uniquely mangling the source
name when it produces object code, and demangling at link time. So, to support polymorphism in a C object
library, we’d have to come up with a name mangling scheme. Note that the “source library” name
redefinition problem is just the name mangling problem arising in a different
way.
For the purposes of this project, it appears that
F90 and C allow more or less the same implementation approaches:
I found two good papers that discuss object
oriented programming in F90: “How to Support Inheritance and Run-Time Polymorphism
in Fortran 90”, and “How to Express C++ Concepts in Fortran 90”, both by Decyk,
Norton, and Szymanski, and available at http://www.cs.rpi.edu/~szymansk/oof90.html.
I won’t try to describe the techniques because I’m
not fluent in F90. See the papers cited
for the details. However, the F90
techniques give the impression of being something like the C techniques for
emulating C++. The good news for F90 is
that inheritance and polymorphism is possible with about the same effort as in
C. The bad news is the same as it is in
C: one has to do by hand tasks that are done automatically by an
object-oriented compiler.
This is the same technique as discussed above with C,
with the same pros and cons.
F90 appears to have an include directive, so this is
the same technique as discussed above with C, with the same pros and cons.
Without knowing F90 better than I do, I suspect that
the name mangling/demangling issue identified in the C discussion also arises
in F90.
Language Summary
A test suite has an inherent hierarchical structure. C++ can express this hierarchy in a simple
and natural way using its native language constructs. C and F90 can also express this hierarchy,
but awkwardly, with added programmer discipline and effort, using the
techniques described in the references cited above. If one wants to use C or F90 in a more
natural way, then the copy-and-tweak method is probably the most
straightforward approach to take.
So
I recommend using C++.
Arguments
against C++
I’ve
heard three arguments against C++, which I’ll try to address. They are:
From personal experience, I think this is a
non-issue. When I started with LPS, I
had no C++ experience at all. Others at
LPS had at most a cursory knowledge of C++; no one had written anything much
bigger than “hello, world”. Within a
week or two we were all reasonably fluent in basic C++. We all quickly grew to prefer it to C, since
it is so much better at allowing one to package up and reuse understandable
pieces of code. I think that a team of
developers like the HDF5 staff, who know C, F90, XML, autoconf, html, Java,
MPI, all sorts of Unix mini-languages, Macs, Windows, etc. would have little
difficulty in picking up the basics of something that, after all, is a lot like
C.
I’ve programmed on a linux platform using g++ and a
commercial compiler that has since become the Intel compiler. I don’t have direct experience with porting
to other platforms. As I recall, the LPS
code was ported to various LLNL and Sandia ASCI machines with little
effort. The biggest problem was telling
the compiler where to find C++’s Standard Template Library (STL) header files,
which tended to be located in various different places on different platforms
and named in slightly different ways, and where to find our own class header
files, which were scattered among a number of directories. Autoconf solved those problems. We coded in ANSI standard C++, which seemed
to be quite portable. I am not aware of
us encountering any of those irritating problems where legal code is rejected
or miscompiled on certain platforms. At
the time we were developing, g++ was undergoing a major revision from its own
dialect of C++ to ANSI standard C++.
Even this caused us little more difficulty than being fastidious about
using std:: to specify the proper namespace for standard C++ library operators. In my experience, C++ requires about the same
maintenance effort as other languages.
The mozilla development team’s recommendations
about C++ portability are found at http://www.mozilla.org/hacking/portable-cpp.html. The only recommendation that sounds
burdensome to me is the fifth in this list: not using C++ standard library
features because including the associated header files is nonportable among
compilers that support different C++ standards.
That’s the problem I mentioned above.
We solved this problem at LPS with autoconf and wrappers.
Test suite code could serve other purposes besides
just measuring HDF5 library performance:
All of these purposes are best served by a test
suite written in a language known well by both the user and the HDF5
development team. C++ isn’t that
language, but C and F90 are.
If we insist that the test suite does double duty, both
measuring performance and serving the purposes listed above, then the test
suite can only be constructed in one way.
We can’t use C++, and we can’t use C or F90 in C++ emulation style,
since those styles are somewhat unnatural for these procedural languages, and
users won’t easily understand examples written this way. The only approach we’re left with is
“copy-and-tweak”. And we have to have
two test suites, one written in C and a duplicate one written in F90. The C users aren’t going to want to look at
F90 code, and the F90 folks won’t want to look at C code.
So neither C++ nor C/F90 do everything we want to
do in a satisfying way. The way out of
this dilemma might be to maintain multiple test suites, one in each
language. Three test suites sounds like
a really bad idea, but it might work better than it sounds:
The C++ test suite would be the largest and most
complete because C++ lends itself better than C/F90 to writing many test cases
easily. The C++ test suite would be
populated systematically with test cases because it would be easy to do so. The C++ test suite would be used principally by
HDF5 library developers interested in maintaining or improving performance. And it could be used when HDF5 is installed
to report the performance of that installation.
The C and F90 test suites would be populated on an
ad hoc basis. Whenever the HDF5
development team engaged in a discussion with a client about some use case,
that use case would be located or coded in the C++ suite. That C++ test would be translated to C or F90
– whichever language appealed to the client – and installed in the appropriate
C or F90 suite. The translation would be
done in a procedural style natural to C or F90, and not in a C++ emulation
style. The translation to C need not be
very complicated – just copy methods out of the C++ inheritance hierarchy, and
edit a bit to glue the methods together.
The C and F90 suites would grow slowly and unevenly, but would consist
of real use cases stripped down to their HDF5 essence. They’d have the same sort of role as code
found in contrib/ directories in many source code distributions: real examples
of how to use the library, but not actively maintained by developers.
Now
let’s switch our focus to testing the library’s correctness:
The
concept of HDF5 correctness divides itself naturally into two parts: the
internal correctness of the library and its external correctness. It divides this way because there's a public
interface available through hdf5.h and private interfaces not accessible
through hdf5.h. Client code
exercises the external, public part and library code exercises the internal,
private part. For all practical
purposes, the public interface allows an infinite number of use cases, so
client code can be unpredictable, and may occasionally push the library's
limits. The private interface, on the
other hand, is probably used in a smaller and more precise and more directed
way. Client code cannot exercise the
private, internal interface for the simple reason that it's inaccessible. Because of these differences, internal and
external interfaces should be tested in different ways.
At LPS, we used a form of Bertrand Meyer's “Design
by Contract” ideas to ensure the correctness of a very large and complicated
application that we built for ASCI on top of HDF5. (More information on DbC is available in “Object
Orient Software Construction”, by Bertrand Meyer, “Design by Contract by
Example”, by Mitchell and McKim, and at Meyer's Eiffel language web site, http://www.eiffel.com/developers/presentations/.)
Briefly, here's how it worked:
The software was divided into “objects”, pieces
that encapsulated a data structure and operators on that datastructure. The operators were of two kinds: queries,
which do not change the state of an object but return information about it, and
commands, which change the state of an object, but return no information about
it. Each operator was equipped with a
“contract”, i.e., a set of preconditions, which the caller guarantees to be
true on entering the operator, and a set of postconditions, which the operator
guarantees to be true on exit. In
addition, each class of objects had an associated invariant, a boolean valued
query which the class guaranteed to be true at every moment when an instance of
the class is accessible to a client. We
implemented the pre and postconditions with C/C++ assert
statements. Since the contracts were
executable, they tested the code every time it was run, and in every path of
execution ever tried. A contract in
executable code is an extremely precise statement of what a routine needs, and
what it does, and what the underlying data structure represents. So, the writing of contracts sharpened the
public and private class interfaces, helped design the software, and tested the
code thoroughly.
Here's what it looks like in a little pseudo-code:
int
some_query(object* obj, other args...)
{
int result;
/* Preconditions */
assert(obj != NULL);
assert(property(obj) == 7);
/* Body of function */
do something that doesn't change the state of
“obj”.
result = whatever;
/* Postconditions */
assert(invariant(obj)); /* state is still valid */
assert(property(obj) == 7); /* state hasn't changed
*/
/* Exit */
return result;
}
This implementation of Design by Contract proved to
be immensely powerful. Once a class was
implemented and went through some initial testing, it just worked correctly
from then on and clients of that class could trust it implicitly. We never experienced the kinds of
intermittent or undefined behaviors that tend to crop up in C. HDF5 is written in procedural style, but it
does have well defined objects, so DbC could be used. Retrofitting HDF5 with Design by Contract may
not be a small undertaking, but in the long run I would expect it to be far
superior to writing a test suite.
Software contracts should be effective on the
internal interfaces because those interfaces are thoroughly exercised by normal
use of the library. But the external
interface tends not to be used as much by developers, so fitting it with
contracts may not prove its correctness.
Furthermore, developers tend not to abuse it or push its limits. Clients tend to write much more challenging
code, and they are more likely to provide good test cases. So an external interface test suite should be
written and it should definitely contain client written code. However, client code may be proprietary, or
it may contain a lot of complications irrelevant to an HDF5 test. So client written code should be reduced
somehow to its essential HDF5 behavior, and the irrelevant or needlessly
complicated code should be stripped out.
Because there are lots of different ways in which HDF5 can be used, an
external interface test suite should be a large one. A large test suite should be handled by an
automated test harness, so there should be a uniformity to each test – a
conformity to the test harness. And the
test suite should be written in a natural, library supported language: C, C++,
or F90, so that it can express any execution path through the library.
At least two questions arise from having lots of
tests: how do you know which cases
you've actually tested (or, conversely, you haven't tested), and how do you
write and maintain all the test code?
One classic response in computer science to
managing this sort of complexity is to factor out and reuse common
patterns. Let’s elaborate a little on
the basic HDF5 use pattern introduced above:
create_io_environment()
{
define the parallel environment
define file and memory dataspaces and selections
define file and memory datatypes
create a memory buffer
define the dataset characteristics
open/create a file and dataset
optionally define a filter
}
do_io_test()
{
do whatever io action needs testing
}
main()
{
create_io_environment()
do_io_test()
report_results()
}
One can easily imagine running a family of tests in
which this pattern is repeated but the parameters of datatype, dataspace, type
and number of datasets, etc. vary from test to test. While it's easy to make a copy of a source code
file and tweak it slightly for a new variation, the copy-and-tweak methodology
soon becomes a maintenance problem.
Experience suggests that it's much better for constant "stuff"
to be coded once and reused. So, we're
led to consider ideas such as expressed in "Design Patterns: Elements of
Reusable Object-Oriented Software" (Gamma, et. al.) and “Patterns for
Parallel Programming” (Mattson, Sanders, and Massingill). I can see at least three of their "Behavioral
Patterns" that our test suite could profit from:
"Template
Method"
Defines the skeleton of an algorithm, deferring
some steps to subclasses. That's what
do_io_test()) does above.
"Strategy"
Defines a family of algorithms and makes them
interchangeable, letting the algorithm vary from client to client. An example might be using fixed size data
structures and fixed size datasets vs. using extendible data structures and
chunked datasets.
"Iterator"
Provides a way to access the elements of an
aggregate object sequentially w/o exposing the representation of the object. An example of this might be stepping through
memory and disk dataspaces by coordinated changing of selections and extending
a chunked dataset and writing to it at each step. An "iterator" would coordinate the
dataspace selection and dataset extension activities.
Other useful patterns undoubtedly exist, but these
examples alone argue for our "tool" to be coded so that it is easy to
express and reuse such behavioral patterns, to replace one pattern with
another, and to vary the types and properties of the objects that act within
these patterns.
Measuring speed is an easy extension to testing
correctness if we have a suitably designed library of usage patterns. For example, the pattern outlined above can
be fitted with a timer. Only the code in
red changes:
create_io_environment()
{
same implementation as above
}
do_io_test()
{
start a timer
do_io_action()
stop the timer
return elapsed time
}
do_io_action()
{
do whatever io action needs a performance test
}
main()
{
create_io_environment()
do_io_test()
report_results()
}
The fact that this performance test example uses
nearly the same code as the external interface test example suggests that if we
design our test suite correctly, we should be able to use any performance test
as an external interface test and vice versa.
The timer can be implemented as a call to
getrusage() in serial and MPI_Wtime in parallel. A good example of this already exists in
XXXX.
Measuring and resporting the spatial resources
(memory and disk) consumed by a use case ought to be yet another simple
variation on the use case examples shown above.
We time an operation by noting a starting time and a stopping time, and
reporting the difference as the elapsed time used. Measuring the consumption of spatial
resources is not essentially different – just note the starting and stopping
size of memory and/or disk resources and report the difference.
create_io_environment()
{
same implementation as above
}
do_io_test()
{
measure space consumed before i/o
do_io_action()
measure space consumed after i/o
return net increase in space consumed
}
do_io_action()
{
do whatever io action needs a performance test
}
main()
{
create_io_environment()
do_io_test()
report_results()
}
The similarity of this code to the speed testing
code suggests another opportunity for code reuse. Designed correctly, we ought to be able to
slip any measuring instrument into any use case without having to recode the
use case. For speed tests, the
instrument is a “stopwatch”. For space
tests, it's a “ruler” (rulers measure how big something is, right?). For correctness tests, we might have a
“counter” - something that counts the number and severity of unsuccessful
library calls. Or maybe we always
measure all aspects of performance: correctness, speed, and space. The overhead of measuring these
characteristics should be a small fraction of the cost of running the use
case. Why not get every bit of
diagnostic information from every test run?
How do we measure sizes? Stat() call gives the file size on disk. But I haven’t found any portable equivalent
call to determine how much memory a process has consumed. Perhaps HDF5 could keep track of some of this
by wrapping malloc/calloc/free/etc. calls in something that keeps track of the
amount of memory requested. This is at
best a lower bound on the memory used, however.
I wonder how useful that would be?