Compact Data Storage Design Issues in HDF5

Raymond Lu & Quincey Koziol
{slu,koziol}@ncsa.uiuc.edu
September 9, 2002

  1. Document's Audience:

  2. Background Reading:

  3. Design Issues:

    A compact dataset is so small that HDF5 library can store it directly into dataset header message in a contiguous way. The proposed design is to expand the layout header message to store compact dataset data. If a user defines a dataset to be compact, an additional data buffer plus the size of this buffer will be appended to the layout message (please refer to File Format Changes section in this document). This layout message always stays in memory during the time when the compact dataset is open. When the dataset is closed or the file is flushed out to disk, this layout message will be written to disk.

    A few highlights of the compact dataset design:

    1. Since the maximum size of a header message is 64KB, the compact dataset data must be smaller than 65,400 Bytes (64 KB minus the size of some other layout message fields).

    2. The size of layout header message can not change after the data space is allocated. So a user should always define the value H5D_SPACE_ALLOC_EARLY for the space allocation time through the H5Pset_space_time function when he wants to define a compact dataset.

    3. For PHDF5, the compact dataset is currently limited to the "all" selection by each process during data writing. No hyperslab or point selections are supported.

    4. The compact dataset complies with the new design of fill-values, described in the fill-value document.

  4. Advantages and Disadvantages to Compact Data Storage:

    Compact datasets are designed to have several benefits:

    There are some drawbacks to compact datasets however:

  5. Future Enhancements:

    Sometime in the future, it would be good to support other selection types for parallel I/O.

    Increase the size of datasets allowed to be stored with compact storage. Currently, this is limited by the size of an object header message, which is stored in a 16-bit field and should be increased to allow larger object header messages in general.

    Possibly change the library to default to storing datasets less than some threshold in size as compact datasets.

  6. File Format Changes:

    If a dataset is compact, the actual data will be stored in header message for layout information. Two new fields (raw data buffer and its size) are added to the layout header message:

    Name: Data Storage - Layout
    Type: 0x0008
    Length: varies
    Status: Required for datasets, may not be repeated
    Purpose: Data layout describes how the elements of a multi-dimensional array are arranged in the linear address space of the file. Three types of data layout are supported:

    1. Compact - The array is small enough to be stored directly in this object header. The layout support requires the data to be non-extendible, non-compressible, non-sparse, and not stored externally. Storing data in this format eliminates the disk seek/read/write request normally necessary to read or write raw data.
    2. Contiguous - The array can be stored in one contiguous area of the file. The layout requires that the size of the array be constant and does not permit chunking, compression, checksums, encryption, etc. The message stores the total size of the array and the offset of an element from the beginning of the storage area is computed as in C.
    3. Chunked - The array domain can be regularly decomposed into chunks and each chunk is allocated separately. This layout supports arbitrary element traversals, compression, encryption, and checksums, and the chunks can be distributed across external raw data files (these features are described in other messages). The message stores the size of a chunk instead of the size of the entire array; the size of the entire array can be calculated by traversing the B-tree that stores the chunk addresses.

    Format:

    Description:

  7. API Function Changes:

    Two API functions had values added to a parameter to support compact dataset storage:





QAK:9/9/02