Raw Data I/O in HDF5

Quincey Koziol
koziol@ncsa.uiuc.edu
April 24, 2002

Document's Audience:
- Current H5 library designers and knowledgable external developers.
Background Reading:
- ?
Introduction:

What is raw data I/O in HDF5?

The raw data I/O algorithms in HDF5 determine how raw data is dealt with on its way to being transferred between memory and a file.

Why should I care about raw data I/O in HDF5?

The algorithms and data structures used to transfer data between memory and disk are crucial in determining the HDF5 library's performance when dealing with raw data. Choosing an inapproriate algorithm or poorly designed data structure for common access patterns will guarantee that the library performs poorly for certain applications, even if other parts of the library perform very well.

How can we measure raw data I/O performance in HDF5?

Care needs to be taken to create several benchmarks, which are representative of common application access patterns. These benchmarks should be used to measure performance of the library on various machines. This performance information can then be used as the basis for investigation into the library's behavior and (hopefully) improvement.
Feature's Primary Users:
Current HDF5 users
Most (if not all) HDF5 users create datasets and store raw data in those datasets and thus care about how the library performs when reading and writing raw data to the file. There are several specific user communities who use different aspects of raw data I/O that we should pay attention to:
- Parallel I/O applications which store large contiguous-storage datasets. Many applications in the ASCI community fall into this category.
- Serial I/O applications which store chunked-storage datasets, both compressed and uncompressed and with extendible or fixed dimensions. Applications in the NASA earth science community fall into this category.
- Serial I/O applications which store contiguous-storage datasets. This category represents the "general application" most commonly used with HDF5 files.
New users

Additionally, there may be other users who have chosen not to use HDF5 due to poor raw data I/O performance, and we may enlarge the HDF5 user base by providing improving this aspect of the library.
Design Goals:
- Provide a document which describes the raw data I/O algorithms used by the library as well as a clearly written document describing the new raw data I/O architecture.
- Improve the library's raw data I/O performance. This should be measured with actual benchmark information from before and after the changes.
- Impact the current public HDF5 APIs as little as possible. Specificly, user applications should not need to make any source code changes to operate correctly with the changes in the library. Changes to application source code may be required to take advantage of specific performance improvement features added, but that is not foreseen at this time.
Requirements:
- No changes to the HDF5 file format may occur as a side-effect of implementing this feature.
Definitions of Terms Used Below:

There are several important aspects of the raw data I/O that must be kept in mind when deciding how to best perform the I/O:
Current Library Behavior:

The following pseudo-code describes the current behavior of the library when performing raw data I/O currently:
- Unallocated
  - Parallel: Can't happen, data is always guaranteed to be allocated in the file when performing parallel I/O.
  - Serial:
    - Reading: Fill the application's selection with the fill-value for the dataset if a fill-value is defined, otherwise just return leaving "junk" in user's memory.
    - Writing: Allocate space for dataset, filling with fill-value if one is defined for this dataset. Then proceed to "Allocated" case below.
- Allocated
  - No Convert:
    - Optimized I/O possible: ("all" and regular hyperslabs only)
      - Parallel: Generate MPI type for selection and perform optimized MPI read directly into memory.
      - Serial: Perform optimized read directly into memory.
    - Optimized I/O not possible: Fall through to "Convert" case below.
  - Convert:
    Loop through filling buffer with elements in selection:
There are several problems with the current approach. First, both the "convert" and "no convert" cases don't account for the dataset being chunked, leading to very poor performance in all I/O requests which access less than the entire dataset at a time. Second, the "convert" case does not perform parallel I/O with MPI-I/O, it essentially breaks all the I/O requests into serial requests, leading to poor performance in many cases also.
Proposed Changes to Library Behavior:

Revisions to the raw data I/O algorithms need to address two areas of poor performance in the current design: accounting for chunk-storage datasets and allowing true parallel I/O to occur for all I/O operations. The following outline describes the initial revision to the raw data I/O architecture:
- Chunked Storage: Iterate through each chunk the selection overlaps (algorithm to determine this is different for serial vs. parallel I/O, so abstract "get first chunk", "get next chunk" function pointers will have to be defined, with different functions defined for parallel & serial I/O)
  
  Foreach <chunk>: Generate sub-selection in file and memory for portion of selection which overlaps <chunk>, then:
  - Unallocated: (Convert/No Convert not an issue)
    - Reading: (Parallel/Serial I/O not an issue)
      - Fill the application's selection with the fill-value for the dataset if a fill-value is defined, otherwise just return leaving "junk" in user's memory.
    - Writing: (Parallel/Serial I/O not an issue)
      - Writing: Allocate <chunk>, fill <chunk> with the fill-value for the dataset if a fill-value is defined, then fall through to "Allocated" case below.
  - Allocated:
    - Convert: Loop through filling conversion buffer element from sub-selection.
      
      Generate sub-sub-selection in file and memory for portion of sub-selection in <chunk> which fits in conversion buffer.
      - Serial: Gather -> Convert -> Scatter
      - Parallel: Gather -> Convert -> Scatter also, but generate MPI type from sub-sub-selection so file I/O operation can be a single optimized MPI operation.
    - No Convert: Can move raw data directly between application memory buffer and the file in one operation.
      - Serial: Optimized I/O directly between application buffer and the file.
      - Parallel: Optimized I/O directly between application buffer and the file also, but generate MPI type from sub-sub-selection so I/O operation can be a single optimized MPI operation.
- Contiguous Storage
  - Unallocated: (Convert/No Convert not an issue)
    - Reading: (Parallel/Serial I/O not an issue)
      - Fill the application's selection with the fill-value for the dataset if a fill-value is defined, otherwise just return leaving "junk" in user's memory.
    - Writing: (Parallel/Serial I/O not an issue)
      - Writing: Allocate dataset, filling it with the fill-value for the dataset if a fill-value is defined, then fall through to "Allocated" case below.
  - Allocated:
    - Convert: Loop through filling conversion buffer with elements from selection.
      
      Generate sub-selection in file and memory for portion of selection which fits in conversion buffer.
      - Serial: Gather -> Convert -> Scatter
      - Parallel: Gather -> Convert -> Scatter also, but generate MPI type from sub-selection so file I/O operation can be a single optimized MPI operation.
    - No Convert: Can move raw data directly between application memory buffer and the file in one operation.
      - Serial: Optimized I/O directly between application buffer and the file.
      - Parallel: Optimized I/O directly between application buffer and the file also, but generate MPI type from sub-selection so I/O operation can be a single optimized MPI operation.
Clearly, the contiguous-storage I/O case is a sub-set of the chunk-storage case and should be implemented as a sub-routine that can be called for both contiguous-storage and chunked-storage cases. Then the chunk-storage case should define the chunk as if it were a contiguous-storage dataset and call a the common sub-routine to operate on it.

Note: The "allocate" operation for parallel I/O must currently be done in collective mode, until "flexible parallel HDF5" is implemented.
Implementation Plans:

Several things need to be done for the design outlined above to be implemented:
1. The code which generates the MPI type for a selection for use with optimized parallel I/O (i.e. direct transfers between memory and the file with no datatype conversion) must be enhanced to allow any selection to be used as the basis of the MPI type. The current code only handles "regular" hyperslab selections (i.e. those generated from a single call to H5Sselect_hyperslab(), not selections formed from multiple operations on hyperslabs) and "all" selections (i.e. the entire dataset).
2. The code which performs optimized serial I/O must be enhanced to allow any selection to be used. The current code only handles "regular" hyperslab selections (i.e. those generated from a single call to H5Sselect_hyperslab(), not selections formed from multiple operations on hyperslabs) and "all" selections (i.e. the entire dataset).
3. Parallel I/O gather/scatter driver routines (accessed through function pointers in the datatype conversion loop) must be implemented. This will allow the gather/scatter operations to be implemented with optimized MPI operations.
4. The raw data I/O architecture must be inverted to deal with chunked vs. contiguous storage first, instead of handling the dataset very "abstractly" and allowing the lower levels to deal with the different storage methods.
The first three changes can be implemented withint the current raw data I/O architecture. These changes would be implemented first, followed by the inversion of the raw data I/O architecture after the pieces are all in place.
Advanced Features:

?
Alternate Approachs:

?
Forward/Backward Compatibility Repercussions:

There will be no forward or backward file-format compatibility changes resulting from these changes. It is possible that some aspects of the new raw data I/O architecture may require additional API functions to enable or control new functionality, but those new functions should not have any negative impact on existing applications.
New API Calls:

None planned.

Raw Data I/O in HDF5

Quincey Koziol koziol@ncsa.uiuc.edu April 24, 2002

Document's Audience:

Background Reading:

Introduction:

Feature's Primary Users:

Design Goals:

Requirements:

Definitions of Terms Used Below:

Current Library Behavior:

Proposed Changes to Library Behavior:

Implementation Plans:

Advanced Features:

Alternate Approachs:

Forward/Backward Compatibility Repercussions:

New API Calls:

Quincey Koziol
koziol@ncsa.uiuc.edu
April 24, 2002