Document's Audience:
- Current H5 library designers and knowledgable external developers.
Background Reading:
Introduction:
- What is raw data I/O in HDF5?
- The raw data I/O algorithms in HDF5 determine how raw data is dealt
with on its way to being transferred between memory and a file.
- Why should I care about raw data I/O in HDF5?
- The algorithms and data structures used to transfer data between
memory and disk are crucial in determining the HDF5 library's
performance when
dealing with raw data. Choosing an inapproriate algorithm or poorly
designed data structure for common access patterns will guarantee
that the library performs poorly for certain applications, even if
other parts of the library perform very well.
- How can we measure raw data I/O performance in HDF5?
- Care needs to be taken to create several benchmarks, which are
representative of common application access patterns. These
benchmarks should be used to measure performance of the library
on various machines. This performance information can then be
used as the basis for investigation into the library's behavior
and (hopefully) improvement.
Feature's Primary Users:
- Current HDF5 users
- Most (if not all) HDF5 users create datasets and store raw data
in those datasets and thus care about how the library performs when
reading and writing raw data to the file. There are several
specific user communities who use different aspects of raw data I/O
that we should pay attention to:
- Parallel I/O applications which store large
contiguous-storage datasets. Many applications in the ASCI
community fall into this category.
- Serial I/O applications which store chunked-storage
datasets, both compressed and uncompressed and with
extendible or fixed dimensions. Applications in the NASA
earth science community fall into this category.
- Serial I/O applications which store contiguous-storage
datasets. This category represents the "general
application" most commonly used with HDF5 files.
- New users
- Additionally, there may be other users who have chosen not to use
HDF5 due to poor raw data I/O performance, and we may enlarge the
HDF5 user base by providing improving this aspect of the library.
Design Goals:
- Provide a document which describes the raw data I/O
algorithms used by the library as well as a clearly written document
describing the new raw data I/O architecture.
- Improve the library's raw data I/O performance. This should be
measured with actual benchmark information from before and after the
changes.
- Impact the current public HDF5 APIs as little as possible. Specificly,
user applications should not need to make any source code changes
to operate correctly with the changes in the library. Changes to
application source code may be required to take advantage of
specific performance improvement features added, but that is not
foreseen at this time.
Requirements:
- No changes to the HDF5 file format may occur as a side-effect of
implementing this feature.
Definitions of Terms Used Below:
There are several important aspects of the raw data I/O that must be
kept in mind when deciding how to best perform the I/O:
- Convert/No-convert
- Whether the raw data needs to be converted with a datatype
conversion.
- Contiguous/Chunked
- Whether the raw data is stored using contiguous or chunked
storage.
- Serial/Parallel
- Whether the I/O operation is occurring in serial or parallel.
- Allocated/Unallocated
- Whether space for the chunk or contiguous dataset is allocated
in the file or not.
Current Library Behavior:
The following pseudo-code describes the current behavior of the library when
performing raw data I/O currently:
- Unallocated
- Parallel:
Can't happen, data is always guaranteed to be
allocated in the file when performing parallel
I/O.
- Serial:
- Reading:
Fill the application's selection with the fill-value
for the dataset if a fill-value is defined, otherwise
just return leaving "junk" in user's memory.
- Writing:
Allocate space for dataset, filling with fill-value
if one is defined for this dataset. Then proceed to
"Allocated" case below.
- Allocated
- No Convert:
- Optimized I/O possible: ("all" and regular hyperslabs
only)
- Parallel:
Generate MPI type for selection and perform
optimized MPI read directly into memory.
- Serial:
Perform optimized read directly into memory.
- Optimized I/O not possible:
Fall through to "Convert" case below.
- Convert:
Loop through filling buffer with elements in selection:
There are several problems with the current approach. First, both the
"convert" and "no convert" cases don't account for the dataset being chunked,
leading to very poor performance in all I/O requests which access less than the
entire dataset at a time. Second, the "convert" case does not perform parallel
I/O with MPI-I/O, it essentially breaks all the I/O requests into serial
requests, leading to poor performance in many cases also.
Proposed Changes to Library Behavior:
Revisions to the raw data I/O algorithms need to address two areas of poor
performance in the current design: accounting for chunk-storage datasets and
allowing true parallel I/O to occur for all I/O operations. The following
outline describes the initial revision to the raw data I/O architecture:
- Chunked Storage: Iterate through each chunk the selection overlaps
(algorithm to determine this is different for serial vs. parallel
I/O, so abstract "get first chunk", "get next chunk" function
pointers will have to be defined, with different functions defined
for parallel & serial I/O)
Foreach <chunk>: Generate sub-selection in file and memory
for portion of selection which overlaps <chunk>, then:
- Unallocated: (Convert/No Convert not an issue)
- Reading: (Parallel/Serial I/O not an issue)
- Fill the application's selection with the
fill-value for the dataset if a fill-value
is defined, otherwise just return leaving
"junk" in user's memory.
- Writing: (Parallel/Serial I/O not an issue)
- Writing: Allocate <chunk>, fill
<chunk> with the fill-value for the
dataset if a fill-value is defined, then
fall through to "Allocated" case below.
- Allocated:
- Convert: Loop through filling conversion buffer
element from sub-selection.
Generate sub-sub-selection in file and memory
for portion of sub-selection in <chunk> which
fits in conversion buffer.
- Serial: Gather -> Convert -> Scatter
- Parallel: Gather -> Convert -> Scatter also,
but generate MPI type from sub-sub-selection
so file I/O operation can be a
single optimized MPI operation.
- No Convert: Can move raw data directly between
application memory buffer and the file in one
operation.
- Serial: Optimized I/O directly between
application buffer and the file.
- Parallel: Optimized I/O directly between
application buffer and the file also,
but generate MPI type from sub-sub-selection
so I/O operation can be a
single optimized MPI operation.
- Contiguous Storage
- Unallocated: (Convert/No Convert not an issue)
- Reading: (Parallel/Serial I/O not an issue)
- Fill the application's selection with the
fill-value for the dataset if a fill-value
is defined, otherwise just return leaving
"junk" in user's memory.
- Writing: (Parallel/Serial I/O not an issue)
- Writing: Allocate dataset, filling it
with the fill-value for the
dataset if a fill-value is defined, then
fall through to "Allocated" case below.
- Allocated:
- Convert: Loop through filling conversion buffer
with elements from selection.
Generate sub-selection in file and memory
for portion of selection which
fits in conversion buffer.
- Serial: Gather -> Convert -> Scatter
- Parallel: Gather -> Convert -> Scatter also,
but generate MPI type from sub-selection
so file I/O operation can be a
single optimized MPI operation.
- No Convert: Can move raw data directly between
application memory buffer and the file in one
operation.
- Serial: Optimized I/O directly between
application buffer and the file.
- Parallel: Optimized I/O directly between
application buffer and the file also,
but generate MPI type from sub-selection
so I/O operation can be a
single optimized MPI operation.
Clearly, the contiguous-storage I/O case is a sub-set of the chunk-storage case
and should be implemented as a sub-routine that can be called for both
contiguous-storage and chunked-storage cases.
Then the chunk-storage case should define the chunk as if it were a
contiguous-storage dataset and call a the common sub-routine to operate on it.
Note: The "allocate" operation for parallel I/O must currently be done
in collective mode, until "flexible parallel HDF5" is implemented.
Implementation Plans:
Several things need to be done for the design outlined above to be
implemented:
- The code which generates the MPI type for a selection for use with
optimized parallel I/O (i.e. direct transfers between memory and
the file with no datatype conversion) must be
enhanced to allow any selection to be used as the basis of the MPI
type. The current code only handles "regular" hyperslab selections
(i.e. those generated from a single call to H5Sselect_hyperslab(),
not selections formed from multiple operations on hyperslabs) and
"all" selections (i.e. the entire dataset).
- The code which performs optimized serial I/O must be enhanced to
allow any selection to be used.
The current code only handles "regular" hyperslab selections
(i.e. those generated from a single call to H5Sselect_hyperslab(),
not selections formed from multiple operations on hyperslabs) and
"all" selections (i.e. the entire dataset).
- Parallel I/O gather/scatter driver routines (accessed through
function pointers in the datatype conversion loop) must be
implemented. This will allow the gather/scatter operations to
be implemented with optimized MPI operations.
- The raw data I/O architecture must be inverted to deal with
chunked vs. contiguous storage first, instead of handling the
dataset very "abstractly" and allowing the lower levels to deal with
the different storage methods.
The first three changes can be implemented withint the current raw data I/O
architecture. These changes would be implemented first, followed by the
inversion of the raw data I/O architecture after the pieces are all in place.
Advanced Features:
?
Alternate Approachs:
?
Forward/Backward Compatibility Repercussions:
There will be no forward or backward file-format compatibility changes
resulting from these changes. It is possible that some aspects of the new
raw data I/O architecture may require additional API functions to enable or
control new functionality, but those new functions should not have any negative
impact on existing applications.
New API Calls:
None planned.