Compact Data Storage Design Issues in HDF5

Raymond Lu & Quincey Koziol
{slu,koziol}@ncsa.uiuc.edu
September 9, 2002

Document's Audience:
- Current H5 library designers and knowledgable external developers.
Background Reading:
- Previous versions of this document:
  - August 20, 2002.
Design Issues:

A compact dataset is so small that HDF5 library can store it directly into dataset header message in a contiguous way. The proposed design is to expand the layout header message to store compact dataset data. If a user defines a dataset to be compact, an additional data buffer plus the size of this buffer will be appended to the layout message (please refer to File Format Changes section in this document). This layout message always stays in memory during the time when the compact dataset is open. When the dataset is closed or the file is flushed out to disk, this layout message will be written to disk.

A few highlights of the compact dataset design:
1. Since the maximum size of a header message is 64KB, the compact dataset data must be smaller than 65,400 Bytes (64 KB minus the size of some other layout message fields).
2. The size of layout header message can not change after the data space is allocated. So a user should always define the value H5D_SPACE_ALLOC_EARLY for the space allocation time through the H5Pset_space_time function when he wants to define a compact dataset.
3. For PHDF5, the compact dataset is currently limited to the "all" selection by each process during data writing. No hyperslab or point selections are supported.
4. The compact dataset complies with the new design of fill-values, described in the fill-value document.
Advantages and Disadvantages to Compact Data Storage:

Compact datasets are designed to have several benefits:
- Compact datasets are faster to access than contiguous or chunked datasets, especially when repeatedly accessed. This is because the raw data is read into memory when the dataset is opened and is cached until the dataset is closed.
- Compact datasets allow space in the file to be used more efficiently.
There are some drawbacks to compact datasets however:
- Because the raw data is cached in memory while the dataset is open, compact datasets use more space while open.
- Because the raw data is stored in the dataset's object header (which is read into memory when the dataset is opened), compact datasets can be slower to open than datasets stored with other methods.
Future Enhancements:

Sometime in the future, it would be good to support other selection types for parallel I/O.

Increase the size of datasets allowed to be stored with compact storage. Currently, this is limited by the size of an object header message, which is stored in a 16-bit field and should be increased to allow larger object header messages in general.

Possibly change the library to default to storing datasets less than some threshold in size as compact datasets.

File Format Changes:

If a dataset is compact, the actual data will be stored in header message for layout information. Two new fields (raw data buffer and its size) are added to the layout header message:

Name: Data Storage - Layout
Type: 0x0008
Length: varies
Status: Required for datasets, may not be repeated
Purpose: Data layout describes how the elements of a multi-dimensional array are arranged in the linear address space of the file. Three types of data layout are supported:

Compact - The array is small enough to be stored directly in this object header. The layout support requires the data to be non-extendible, non-compressible, non-sparse, and not stored externally. Storing data in this format eliminates the disk seek/read/write request normally necessary to read or write raw data.
Contiguous - The array can be stored in one contiguous area of the file. The layout requires that the size of the array be constant and does not permit chunking, compression, checksums, encryption, etc. The message stores the total size of the array and the offset of an element from the beginning of the storage area is computed as in C.
Chunked - The array domain can be regularly decomposed into chunks and each chunk is allocated separately. This layout supports arbitrary element traversals, compression, encryption, and checksums, and the chunks can be distributed across external raw data files (these features are described in other messages). The message stores the size of a chunk instead of the size of the entire array; the size of the entire array can be calculated by traversing the B-tree that stores the chunk addresses.

Format:

byte	byte	byte	byte
Version	Dimensionality	Layout Class	Reserved
Reserved
Address (for non-compact dataset)
Dimension 0
Dimension 1
...
Data Size (for compact dataset)
Data (for compact dataset)

Description:

Field Name		Description
Version		A version number for the layout message. This document describes version two (2).
Dimensionality		An array has a fixed dimensionality. This field specifies the number of dimensions later in the message.
Layout Class		The layout class specifies how the other fields of the layout message are to be interpreted. A value of zero(0) indicates compact storage, a value of one (1) indicates contiguous storage while a value of two (2) indicates chunked storage. Other values might be defined in the future.
Adresss		For contiguous storage, this is the offset of the first byte of raw data information for the dataset. This offset may contain the value "HADDR_UNDEF" (-1) to indicate the storage space has not been allocated. For chunked storage this is the offset of the B-tree that is used to look up the offsets of the chunks. For compact storage, this field does not exist.
Dimension 0..n		For contiguous storage or compact storage the dimensions define the entire byte size of the array while for chunked storage they define the size of a single chunk.
Compact Data Size		The number of bytes used to store the compact dataset. (This field is only present for compact storage)
Compact Data		The raw data for the dataset. (This field is only present for compact storage)

API Function Changes:

Two API functions had values added to a parameter to support compact dataset storage:

QAK:9/9/02

Compact Data Storage Design Issues in HDF5

Raymond Lu & Quincey Koziol {slu,koziol}@ncsa.uiuc.edu September 9, 2002

Document's Audience:

Background Reading:

Design Issues:

Advantages and Disadvantages to Compact Data Storage:

Future Enhancements:

File Format Changes:

API Function Changes:

Raymond Lu & Quincey Koziol
{slu,koziol}@ncsa.uiuc.edu
September 9, 2002