Raw Data I/O in HDF5

Quincey Koziol
koziol@ncsa.uiuc.edu
April 24, 2002

  1. Document's Audience:

  2. Background Reading:

  3. Introduction:

    What is raw data I/O in HDF5?
    The raw data I/O algorithms in HDF5 determine how raw data is dealt with on its way to being transferred between memory and a file.

    Why should I care about raw data I/O in HDF5?
    The algorithms and data structures used to transfer data between memory and disk are crucial in determining the HDF5 library's performance when dealing with raw data. Choosing an inapproriate algorithm or poorly designed data structure for common access patterns will guarantee that the library performs poorly for certain applications, even if other parts of the library perform very well.

    How can we measure raw data I/O performance in HDF5?
    Care needs to be taken to create several benchmarks, which are representative of common application access patterns. These benchmarks should be used to measure performance of the library on various machines. This performance information can then be used as the basis for investigation into the library's behavior and (hopefully) improvement.

  4. Feature's Primary Users:

    Current HDF5 users
    Most (if not all) HDF5 users create datasets and store raw data in those datasets and thus care about how the library performs when reading and writing raw data to the file. There are several specific user communities who use different aspects of raw data I/O that we should pay attention to:
    • Parallel I/O applications which store large contiguous-storage datasets. Many applications in the ASCI community fall into this category.
    • Serial I/O applications which store chunked-storage datasets, both compressed and uncompressed and with extendible or fixed dimensions. Applications in the NASA earth science community fall into this category.
    • Serial I/O applications which store contiguous-storage datasets. This category represents the "general application" most commonly used with HDF5 files.

    New users
    Additionally, there may be other users who have chosen not to use HDF5 due to poor raw data I/O performance, and we may enlarge the HDF5 user base by providing improving this aspect of the library.

  5. Design Goals:

  6. Requirements:

  7. Definitions of Terms Used Below:

    There are several important aspects of the raw data I/O that must be kept in mind when deciding how to best perform the I/O:

  8. Current Library Behavior:

    The following pseudo-code describes the current behavior of the library when performing raw data I/O currently:

    There are several problems with the current approach. First, both the "convert" and "no convert" cases don't account for the dataset being chunked, leading to very poor performance in all I/O requests which access less than the entire dataset at a time. Second, the "convert" case does not perform parallel I/O with MPI-I/O, it essentially breaks all the I/O requests into serial requests, leading to poor performance in many cases also.

  9. Proposed Changes to Library Behavior:

    Revisions to the raw data I/O algorithms need to address two areas of poor performance in the current design: accounting for chunk-storage datasets and allowing true parallel I/O to occur for all I/O operations. The following outline describes the initial revision to the raw data I/O architecture:

    Clearly, the contiguous-storage I/O case is a sub-set of the chunk-storage case and should be implemented as a sub-routine that can be called for both contiguous-storage and chunked-storage cases. Then the chunk-storage case should define the chunk as if it were a contiguous-storage dataset and call a the common sub-routine to operate on it.

    Note: The "allocate" operation for parallel I/O must currently be done in collective mode, until "flexible parallel HDF5" is implemented.

  10. Implementation Plans:

    Several things need to be done for the design outlined above to be implemented:

    1. The code which generates the MPI type for a selection for use with optimized parallel I/O (i.e. direct transfers between memory and the file with no datatype conversion) must be enhanced to allow any selection to be used as the basis of the MPI type. The current code only handles "regular" hyperslab selections (i.e. those generated from a single call to H5Sselect_hyperslab(), not selections formed from multiple operations on hyperslabs) and "all" selections (i.e. the entire dataset).
    2. The code which performs optimized serial I/O must be enhanced to allow any selection to be used. The current code only handles "regular" hyperslab selections (i.e. those generated from a single call to H5Sselect_hyperslab(), not selections formed from multiple operations on hyperslabs) and "all" selections (i.e. the entire dataset).
    3. Parallel I/O gather/scatter driver routines (accessed through function pointers in the datatype conversion loop) must be implemented. This will allow the gather/scatter operations to be implemented with optimized MPI operations.
    4. The raw data I/O architecture must be inverted to deal with chunked vs. contiguous storage first, instead of handling the dataset very "abstractly" and allowing the lower levels to deal with the different storage methods.

    The first three changes can be implemented withint the current raw data I/O architecture. These changes would be implemented first, followed by the inversion of the raw data I/O architecture after the pieces are all in place.

  11. Advanced Features:

    ?

  12. Alternate Approachs:

    ?

  13. Forward/Backward Compatibility Repercussions:

    There will be no forward or backward file-format compatibility changes resulting from these changes. It is possible that some aspects of the new raw data I/O architecture may require additional API functions to enable or control new functionality, but those new functions should not have any negative impact on existing applications.

  14. New API Calls:

    None planned.