Document's Audience:
- Current H5 library designers and knowledgable external developers.
Background Reading:
- Previous versions of this document:
Design Issues:
A compact dataset is so small that HDF5 library can store it directly into
dataset header message in a contiguous way. The proposed design is to
expand the layout header message to store compact dataset
data. If a user defines a dataset to be compact, an additional data buffer
plus the size of this buffer will be appended to the layout
message (please refer to File Format Changes section in this document).
This layout message always stays in memory during the time when the compact
dataset is open. When the dataset is closed or the file is flushed out to
disk, this layout message will be written to disk.
A few highlights of the compact dataset design:
-
Since the maximum size of a header message is 64KB, the compact dataset
data must be smaller than 65,400 Bytes (64 KB minus the size of some other
layout message fields).
-
The size of layout header message can not change after the data space
is allocated. So a user should always define the value H5D_SPACE_ALLOC_EARLY
for the space allocation time through the H5Pset_space_time function when he
wants to define a compact dataset.
-
For PHDF5, the compact dataset is currently limited to the "all" selection by
each process during data writing. No hyperslab or point selections are
supported.
-
The compact dataset complies with the new design of fill-values, described in
the fill-value
document.
Advantages and Disadvantages to Compact Data Storage:
Compact datasets are designed to have several benefits:
- Compact datasets are faster to access than contiguous or chunked
datasets, especially when repeatedly accessed. This is because the raw
data is read into memory when the dataset is opened and is cached until
the dataset is closed.
- Compact datasets allow space in the file to be used more efficiently.
There are some drawbacks to compact datasets however:
- Because the raw data is cached in memory while the dataset is open,
compact datasets use more space while open.
- Because the raw data is stored in the dataset's object header (which is
read into memory when the dataset is opened), compact datasets can be
slower to open than datasets stored with other methods.
Future Enhancements:
Sometime in the future, it would be good to support other selection types for
parallel I/O.
Increase the size of datasets allowed to be stored with compact storage.
Currently, this is limited by the size of an object header message, which is
stored in a 16-bit field and should be increased to allow larger object header
messages in general.
Possibly change the library to default to storing datasets less than some
threshold in size as compact datasets.
File Format Changes:
If a dataset is compact, the actual data will be stored in header message
for layout information. Two new fields (raw data buffer and its size) are
added to the layout header message:
Name: Data Storage - Layout
Type: 0x0008
Length: varies
Status: Required for datasets, may not be repeated
Purpose: Data layout describes how the elements of a
multi-dimensional array are arranged in the linear address space of the file.
Three types of data layout are supported:
- Compact - The array is small enough to be stored directly
in this object header. The layout support requires the data to be
non-extendible, non-compressible, non-sparse, and not stored externally.
Storing data in this format eliminates the disk seek/read/write request
normally necessary to read or write raw data.
- Contiguous - The array can be stored in one contiguous
area of the file. The layout requires that the size of the array be
constant and does not permit chunking, compression, checksums, encryption,
etc. The message stores the total size of the array and the offset of an
element from the beginning of the storage area is computed as in C.
- Chunked - The array domain can be regularly decomposed
into chunks and each chunk is allocated separately. This layout supports
arbitrary element traversals, compression, encryption, and checksums, and
the chunks can be distributed across external raw data files (these
features are described in other messages). The message stores the size of
a chunk instead of the size of the entire array; the size of the entire
array can be calculated by traversing the B-tree that stores the chunk
addresses.
Format:
byte
|
byte
|
byte
|
byte
|
Version
|
Dimensionality
|
Layout Class
|
Reserved
|
Reserved
|
Address (for non-compact dataset)
|
Dimension 0
|
Dimension 1
|
...
|
Data Size (for compact dataset)
|
Data (for compact dataset)
|
Description:
Field Name
|
|
Description
|
Version
|
|
A version number for the layout message. This document describes
version two (2).
|
Dimensionality
|
|
An array has a fixed dimensionality. This field specifies the
number of dimensions later in the message.
|
Layout Class
|
|
The layout class specifies how the other fields of the layout
message are to be interpreted. A value of zero(0) indicates
compact storage, a value of one (1) indicates contiguous storage
while a value of two (2) indicates chunked storage. Other values
might be defined in the future.
|
Adresss
|
|
For contiguous storage, this is the offset of the first byte of raw
data information for the dataset. This offset may contain the value
"HADDR_UNDEF" (-1) to indicate the storage space has not
been allocated. For chunked storage this is the offset of the B-tree
that is used to look up the offsets of the chunks. For compact
storage, this field does not exist.
|
Dimension 0..n
|
|
For contiguous storage or compact storage the dimensions define the
entire byte size of the array while for chunked storage they define
the size of a single chunk.
|
Compact Data Size
|
|
The number of bytes used to store the compact dataset.
(This field is only present for compact storage)
|
Compact Data
|
|
The raw data for the dataset.
(This field is only present for compact storage)
|
API Function Changes:
Two API functions had values added to a parameter to support compact
dataset storage:
- Name:
- H5Dset_layout
- Purpose:
- Sets the type of storage used store the raw data for a dataset.
- Signature:
- herr_t H5Dset_layout(hid_t dcpl_id,
H5D_layout_t layout)
- Parameters:
-
- hid_t dcpl_id
- IN: Dataset creation property list to modify
- H5D_layout_t layout
- IN: Type of storage layout for raw data.
- Return Value:
- Returns non-negative on success, negative on failure.
- Description:
H5Pset_layout
sets the type of storage used store the
raw data for a dataset. This function is only valid for dataset
creation property lists.
Valid parameters for layout are:
- H5D_COMPACT
- Store raw data in object header in the file.
This should only be used for small amounts of raw data
(currently limited to ~64KB).
- H5D_CONTIGUOUS
- Store raw data separately from object header in one
large chunk in the file.
- H5D_CHUNKED
- Store raw data separately from object header in chunks
of data in separate locations in the file.
- Name:
- H5Dget_layout
- Purpose:
- Gets the layout method used store the raw data for a dataset.
- Signature:
- H5D_layout_t H5Dget_layout(hid_t dcpl_id)
- Parameters:
-
- hid_t dcpl_id
- IN: Dataset creation property list to query
- Return Value:
- Returns non-negative on success, negative on failure.
Non-error return values are:
- H5D_COMPACT
- Raw data is stored in object header in the file.
- H5D_CONTIGUOUS
- Raw data is stored separately from object header in one
large chunk in the file.
- H5D_CHUNKED
- Raw data is stored separately from object header in
chunks of data in separate locations in the file.
- Description:
H5Pset_layout
sets the type of storage used store the
raw data for a dataset. This function is only valid for dataset
creation property lists.