Dimension Scales in HDF5: Preliminary Ideas

Robert E. McGrath
16 May, 2001

1. Introduction and Rationale


HDF4 implements Dimension Scales for Scientific Datasets.  This implementation includes a standard storage scheme for dimensions, and also has a programming model.  HDF5 has no specification for Dimension Scales, but this has been frequently requested.  This feature is needed both to bring forward HDF4 data and applications, and to support new applications.

Dimension scales in HDF5 have been partly addressed in previous work.  Two experiments have suggested possible approaches to dimension scales based on netCDF.  The "NetCDF-H5 Prototype" explored a fairly complete implementation of netCDF on top of HDF5.[1] This work proposed a storage scheme and software to implement netCDF's model of dimensions.  A later study, "Experiment with XSL," converted XML files to HDF5 files via XML and XSL style sheets.[2] This work used a different storage layout for dimensions from [1], and did not address any issues of programming model or compatibility.

The HDF4 to HDF5 Mapping is an official specification for a default representation of HDF4 objects in an HDF5 file. This specification includes an official specification for storing dimension scales from an HDF4 object in an HDF5 file.([3], section 3.1). This specification did not suggest a programming model, and did not consider requirements for using dimensions in ab novo HDF5 (i.e., not converted from HDF4).

While previous efforts have aimed to simulate HDF4 or netCDF, the fact is that HDF5 is a clean slate. We can implement whatever concepts we want, with whatever programming model and interfaces we want.

Based on the previous work above, it seems certain that these features can be implemented as part of the HDF5 'lite' API, with not changes to the core HDF5 library.  Other software may need to be updated, particularly the h4toh5 converter and other tools.
 

2. Requirements


I believe that a specification for Dimension Scales needs the following pieces:

These components are discussed in section 3, below.

The design should meet the following requirements:

  1. The conceptual should be a conceptual superset of HDF4 (and netCDF)
    1. The concepts supported in HDF4 should be supported by HDF5
    2. HDF5 may provide features not supported by HDF4 or netCDF
  2. The storage model should be simple and efficient
    1. No changes to the HDF5 library should be required
    2. Ideally, the storage design should be similar to the HDF4 to HDF5 Mapping currently specifies, unless there is a good reason to do something else
  3. The programming model should be a superset of HDF4 (and netCDF)
    1. Users should be able to create and access dimension scale information in ways analogous to HDF4
    2. HDF5 may provide additional features

3. Design Issues

3.1. Conceptual Model

A dimension scale is essentially a means of labeling the axes of a multidimensional array. Clearly, there are many ways to do this, and many interpretations of a given label. For HDF, the goal is to store an adequate representation of the desired labels, along with an appropriate association between each axis and its label(s), if any.
 
Table 1. Some conceptual variations of dimension scales.
Case Description and Comments
No scale No scale is needed, or else the application has its own means of labeling the axis.
Start Value plus Increment A linear scale, e.g., the dimension scale values for x are A + Bx, where A and B are stored.
Arbitrary Generating Function A generalization of above, where the dimension scale values for x are defined by some function f(x), including log, exp, etc.
Stored list The scale value at each point on dimension x is stored explicitly in a 1D array.  (Note that this can support types other than numbers.)
Partial and/or multi-list Scale values for some points are stored explicitly, with each scale point potentially stored individually.  Possibly more than one scale value per point on the dimension.

Table 1 lists some possible conceptual variations for stored dimension scales. HDF4 supports 'No scale' and 'Stored list'.  The 'Arbitrary generating function' could be difficult to implement unless the functions allowed are constrained (as in 'Start plus increment').  Partial and multi-lists would not be especially difficult to store, but the programming model might be complex.

Whatever the conceptual features supported, the programming model should be as simple and clear as possible, with reasonable default behaviors.  For instance, it should not be necessary to call an API in order to not use dimensions, and it should be possible to add or delete a dimension scale to a dataset at most points in the object's life cycle.  Thus, the selected set of features must be evaluated as a whole package.

Considering Dimension Scales as objects, there are two properties that must be considered.  First, the dimensions may potentially be shared, and the sharing (and 'addressing') might be of several scopes. Second, a Dimension Scale may have many data types, not necessarily the data type of the data in the array.
 
Table 2. Some dimensions of shared dimensions
Case Description and Notes
No sharing Dimension is local to the object, not visible to any other object.
shared within a local scope, e.g., a group Dimension is shared among a local set of objects.
shared within a single file Dimension is global to the file
shared, may be in external file  Dimension may be in another HDF5 file
shared, from any URI Dimension scale may be stored on the internet, in any format (e.g., XML)

Table 2 lists some of the ways that a dimension scale might be shared.  Shared dimensions within a file are widely used  in netCDF and HDF4, so that data in multiple datasets can be correctly related to each other.  HDF5 would support local sharing (e.g., within a Group), and schemes for dimension scales external to the file can be imagined.

In any case, it is very important that the user be able to understand and control the sharing of dimensions. Also, implementation of the conceptual cases in Table 1 need to support sharing in a consistent fashion.
 
Table 3. Data type for dimension scales.
Case Description and Comments
Sequences of atomic numbers Numerical axes. Might be constrained to be in the range of legal generating functions.
Any type, including string and compound Any user defined values, including strings.

Table 3 gives two general alternatives for the data type of a dimension scale.  If the scale is stored explicitly, then HDF5 can easily support a dimension with any data type supported by HDF5.  If a generating function is used, then the function will define the data type of its range.  In this case, the dimension scale might realistically be considered to have the type 'FUNCTION', which would be a new feature for HDF5!

3.2. Storage model for a Dimension Scale Object

This section assumes that a Dimension Scale object is defined, which implements some or all of the variations described above.  How should these objects be stored in HDF5, and how should they be associated with the dimensions they label?

Storage

HDF5 offers two primary choices for storing a dimension scale for an object:  as an Attribute of the object or as a separate Dataset pointed to by an Attribute of the object.  Ideally, a dimension scale would be an attribute of a data space, but HDF5 does not support this.  Since the data space is always tightly bound to a dataset, there is no problem attaching the dimensions to the dataset.

For a dimension scale that is stored as explicit values, storing a dimension scale as an HDF5 attribute has several drawbacks.  Dimension scales may be large, and may 'unlimited', so they can grow, and dimensions scales need to be shared.   HDF5 attributes cannot support these features.  Therefore, a dimension scale object will almost certainly be stored as one or more HDF5 objects, referenced by an attribute.

For other possibilities, the storage depends on the feature.  A description of generating function might be stored as a an attribute, either as a string or as a compound datatype, or a simple shared dataset.  The sharing and growth properties of this scale are less clear, but probably are the same as for a stored array. If multiple dimension scales (or piecewise functions) are supported, then they could be grouped in an HDF5 Group.

It might be noted that to replicate netCDF/HDF5 semantics. a shared, unlimited dimension has a single current size throughout the file.  If one dataset using that dimension extends it, all other datasets using the dimension must grow by the same amount.  For this reason, and to support some aspects of the programming model, there will likely be some sort of stored table or index, to track all the dimensions and their associations. (See Yeager's prototype for an example of the required data structures. [1])

Names

It is almost certain that dimension scales will be shared objects.  This means that there must be some way to address them, i.e., they must have names.  The names will be used to associate specific dimension scales with specific dimension of datasets.
 
 
Table 4. Some variations on names
Case Description and Comments
Default/Implicit names E.g., based on the object name and dimension, or on the order of creation (a la netCDF)
Reserved names Dimension scales must have certain names, e.g., must all be stored in a particular HDF5 Group, as is done in the HDF4 to HDF5 Mapping ([3])
Arbitrary names Any dataset can be a Dimension Scale

Table 4 lists some ways that Dimension scale objects could be named. The naming scheme definitely interacts with the kinds of sharing that must be supported.  Any sharing outside the file introduces serious problems for how to name the dimension scale object.

Properties & Attributes of Dimensions

A Dimension Scale object will be stored as an HDF5 Attribute or Dataset.  Assuming the object is stored as a dataset, it will have some mandatory and may have additional optional properties and attributes.
 
 
Table 5. Properties of a Dimension Scale
Property or attribute Description and Comments
Name See discussion above.
CLASS "DIMENSION_SCALE"  (Possibly more than one kind of dimension scale might be supported)
Data space See above.  Note that an explicitly stored dimension should be a "correct size" for the dataspace of the dataset that is using it.  E.g., a dimension scale for a dimension that is 10, should have 10 elements.
Data type See discussion above.
Other attributes UNITS, SCALE_FACTOR, OFFSET
USED_BY List of datasets using this attribute. (?)

Table 5 lists some of the properties and attributes of a Dimension Scale object.  In addition to the description of the dimension itself, the dimension scale might have attributes of its own, such as UNITS.  If several kinds of Dimension Scales are supported (explicitly stored, start plus offset, etc.), an attribute would indicate what kind of dimension it is.  Also, there could be attributes that indicate the datasets that use the dimension.

Miscellaneous

HDF5 Attributes and Array data types have data spaces with dimensions.  Shouldn't these dimensions be allowed to have Dimension Scales?  If so, how can this be implemented?  These cases are not encountered in other formats.
 

3.3. Programming model

Most users of netCDF and HDF4 neither know nor care how dimensions are stored. They are very concerned with the programming model and API, however.  When a user requests "Dimension Scales for HDF5", they usually want an API akin to what HDF4 does.  Of course, we would be wise not to slavishly copy older APIs without considering better ideas.
 
 
 
Required operation Description and Comments
create Create dimension with intended properties
destroy
attach to dimension of a dataset Associate Dimension Scale with particular dimension of a dataset
detach from dimension of a dataset
get dimension scales for dataset Retrieve the Dimension Scales (if any), in order.
iterate through all dimensions in file (or other scope) Find all Dimension Scales, in a canonical order (e.g., the order of creation).
change size Extend a Dimension and all datasets using the Dimension

Table 6 lists some operations that will likely be needed.  In addition to the basic operations to create and attach Dimension Scales, users will need iterators to list the Dimension Scales in a canonical order.
 

4. Summary

This paper suggests the need for a comprehensive design for Dimension Scales in HDF5.  Some requirements are proposed, and design issues listed.

These ideas need to be considered carefully, and a set of features decided.  Then a specification and implementation can be done.

5. References


1. Nancy Yeager, "Implementation of the NetCDF-H5 prototype", August 20, 1999. http://hdf.ncsa.uiuc.edu/HDF5/papers/netcdfh5.html

2. Robert E. McGrath, "Experiment with XSL: translating scientific data", February 21, 2001. http://hdf.ncsa.uiuc.edu/HDF5/XML/nctoh5/writeup.htm

3.  Mike Folk, Robert E. McGrath, Kent Yang, "Mapping HDF4 Objects to HDF5 Objects" Revised: October, 2000. http://hdf.ncsa.uiuc.edu/HDF5/papers/h4toh5/