File Format Changes

 

This document describes the HDF5 file format changes related to the proposed dataset fill value changes.  Please also refer to the proposal and API document for related information.

 

This document focuses on two modified file header message format.  For the layout message, the “address” value has been changed to express an unallocated space.  The fill value message has been added four new fields.  Please read on to find more detailed information.

 

Name: Data Storage – Layout

 

Type: 0x0008

Length: varies

Status: Required for datasets, may not be repeated.

 

Purpose and Description:  Data layout describes how the elements of a multi-dimensional array are arranged in the linear address space of the file.  Two types of data layout are supported:

  1. The array can be stored in one contiguous area of the file. The layout requires that the size of the array be constant and does not permit chunking, compression, checksums, encryption, etc. The message stores the total size of the array and the offset of an element from the beginning of the storage area is computed as in C.
  2. The array domain can be regularly decomposed into chunks and each chunk is allocated separately. This layout supports arbitrary element traversals, compression, encryption, and checksums, and the chunks can be distributed across external raw data files (these features are described in other messages). The message stores the size of a chunk instead of the size of the entire array; the size of the entire array can be calculated by traversing the B-tree that stores the chunk addresses.

byte

byte

byte

byte

Version

Dimensionality

Layout Class

Reserved

Reserved

Address

Dimension 0(4-bytes)

Dimension 1(4-bytes)

 

                        Field Name                                         Description

Version                        A version number for the layout message. This documentation describes version two.  (A word about backward compatibility:  To minimize risk, the version number is set to the value of two if dataspace is not allocated when dataset is created; the version number will be set to the value of one if dataspace is allocated when dataset is created.)

Dimensionality              An array has a fixed dimensionality. This field specifies the number of dimension size fields later in the message.

Layout Class                The layout class specifies how the other fields of the layout message are to be interpreted. A value of one indicates contiguous storage while a value of two indicates chunked storage. Other values will be defined in the future.

Address                       For contiguous storage, this is the address of the first byte of storage. This address is initialized to HADDR_UNDEF(-1) to indicate the storage space has not been allocated.  For chunked storage this is the address of the B-tree that is used to look up the addresses of the chunks.

Dimensions                   For contiguous storage the dimensions define the entire size of the array while for chunked storage they define the size of a single chunk.

 

 

Name: Data Storage – Fill Value

 

Type: 0x0005

Length: varies

Status: Optional, may not be repeated.

 

This fill value message stores a single data value(including compound data) and its related properties - space allocation time, fill value write time, and whether fill value is defined.  Whether the fill value is written to dataset or returned to user depends on its properties.  The fill value is interpreted as the same datatype as the dataset.

 

Byte

byte

byte

byte

Version

Space allocate time

Fill value write time

Fill value defined

Size(4-bytes)

Fill value

 

                        Field Name                                         Description

Version                        A version number for the fill value message.  This document describes version one.

Space allocate time       When to allocate storage space.  It specifies whether to allocate space as early as dataset is created (a value of one), or as late as user’s data is written to dataset (a value of two).

Fill value write time       When to write fill value to dataset.  A value of zero indicates never to write fill value; a value of one means to write fill value once storage space is allocated and write fill value to the entire dataset.

Fill value defined           Whether fill value is defined.  A value of zero means undefined; a value of one indicates defined(default or user-defined).  If undefined, the “size” field will have the value of zero while the “fill value” field will not exist.

Size(4 bytes)                This the size of the fill value field in bytes.  If the fill value is compound type, this size will be the size of the whole compound datatype.

Fill value                       The actual fill value.  The bytes of the fill value are interpreted using the same datatype as for the dataset.