Design Issues of HDF5 Compact Dataset

                                                           

I.                   Design Issues

 

The compact dataset is so small that HDF5 library can store it directly into dataset header message in a contiguous way.  The proposed design is to expand the header message of dataset layout to store compact dataset data.  If a user defines a dataset to be compact, an additional data buffer plus the size of this buffer will be appended to the layout message(please refer to File Format Changes section in this document).  This layout message always stays in memory during the time when the compact dataset is open.  When the dataset is closed or the file is flushed out to disk, this layout message will be written to disk. 

 

A few highlights of the compact dataset design:

 

1.                  Since the maximum size of a header message is 64 KB, the compact dataset data must be smaller than 65,400 Bytes (64 KB minus the size of some other layout fields).

2.                  The size of layout header message should not be changed once the data space is allocated.  So a user should always define the value H5D_SPACE_ALLOC_EARLY for the space allocation time through the H5Pset_space_time function when he wants to define a compact dataset.

3.                  For PHDF5, the compact dataset is limited to the selection of all data space by each process during data writting.  No hyperslab is supported. 

4.                  The compact dataset complies with the new design of fill value, described in the fill value document.             

 

 

II.                File Format Changes

 

If a dataset is compact, the actual data will be stored in header message for layout information.  Two new fields(raw data buffer and its size) are added to the layout header message.

 

Name: Data Storage - Layout
Type: 0x0008
Length: varies
Status: Required for datasets, may not be repeated
Purpose: Data layout describes how the elements of a multi-dimensional array are arranged in the linear address space of the file. Three types of data layout are supported:

Format:

byte

byte

Byte

byte

Version

Dimensionality

Layout Class

Reserved

Reserved


Address (for non-compact dataset)

Dimension 0

Dimension 1

...

Data size (for compact dataset)

Data (for compact dataset)

 

                        Description:

                                    Field Name                                      Description

Version            A version number for the layout message. This document describes version two (2).

Dimensionality            An array has a fixed dimensionality. This field specifies the number of dimension later in the message.

Layout Class            The layout class specifies how the other fields of the layout message are to be interpreted. A value of zero(0) indicates compact storage, a value of one (1) indicates contiguous storage while a value of two (2) indicates chunked storage. Other values might be defined in the future.

Address            For contiguous storage, this is the offset of the first byte of raw data information for the dataset. This offset may contain the value "HADDR_UNDEF" (-1) to indicate the storage space has not been allocated. For chunked storage this is the offset of the B-tree that is used to look up the offsets of the chunks.  For compact storage, this field does not exist.

Dimension 0…n            For contiguous storage or compact storage the dimensions define the entire byte size of the array while for chunked storage they define the size of a single chunk.

Size            This field is only for compact storage.  It keeps the byte size of the compact dataset.

Data            This field is only for compact storage.  It is the actual data of compact storage.

 

 

III.             API Function Changes

 

Name: H5Pset_layout

Signature:

herr_t H5Pset_layout(hid_t plist, H5D_layout_t layout )

Purpose:

Sets the type of storage used store the raw data for a dataset.

Description:

H5Pset_layout sets the type of storage used store the raw data for a dataset. This function is only valid for dataset creation property lists. Valid parameters for layout are:

H5D_COMPACT

Store raw data and object header contiguously in file. This should only be used for small amounts of raw data (has to be smaller than 65,400 Bytes.  That is almost 64 KB).

H5D_CONTIGUOUS

Store raw data separately from object header in one large chunk in the file.

H5D_CHUNKED

Store raw data separately from object header in one large chunk in the file and store chunks of the raw data in separate locations in the file.

Parameters:

hid_t plist

IN: Identifier of property list to query.

H5D_layout_t layout

IN: Type of storage layout for raw data.

Returns:

Returns a non-negative value if successful; otherwise returns a negative value.

 

 

Name: H5Pget_layout

Signature:

H5D_layout_t H5Pget_layout(hid_t plist)

Purpose:

Returns the layout of the raw data for a dataset.

Description:

H5Pget_layout returns the layout of the raw data for a dataset. This function is only valid for dataset creation property lists. Valid types for layout are:

H5D_COMPACT

Raw data is stored in object header in file.

H5D_CONTIGUOUS

Raw data stored separately from object header in one large chunk in the file.

H5D_CHUNKED

Raw data stored separately from object header in chunks in separate locations in the file.

Parameters:

hid_t plist

IN: Identifier for property list to query.

Returns:

Returns the layout type of a a dataset creation property list if successful. Otherwise returns H5D_LAYOUT_ERROR (-1).

 

 

Aug 20, 2002