Needed: A Convenience API to Support Dimensions in HDF5

Robert E. McGrath
16 July, 2001

1. Introduction and Rationale

Dimension names and scales are a key feature of HDF4, and users have requested them for HDF5 as well.
These are especially important for users who are moving from HDF4 to HDF5, not least because the h4toh5 converter stores dimension names and dimension scales in conventional attributes and datasets int the HDF5 file. However, there is no API to retrieve or manipulate these HDF5 objects as dimension names and scales. Thus, our software is creating objects that there is no convenient or standard way to use.  The neutron scattering community and some NASA users have reported this as a problem.

In addition to converting HDF4 data to HDF5, many users want to bring forward their programming model to HDF5, including the use of dimension names and scales. These users would like something at least similar to HDF4 dimensions. Another reason to provide dimension scale support is that some software packages, such as VisAD, DODS, Matlab, and IDL work best when dimension information is available.  Without dimension names and scales, these packages cannot use some of their most powerful features.

Dimension scales in HDF5 have been partly addressed in previous work.  Two experiments have suggested possible approaches to dimension scales based on netCDF.  The "NetCDF-H5 Prototype" explored a fairly complete implementation of netCDF on top of HDF5.[1] This work proposed a storage scheme and software to implement netCDF's model of dimensions.  A later study, "Experiment with XSL," converted netCDF files to HDF5 files via XML and XSL style sheets.[2] This latter experiment used a different storage layout for dimensions from [1], and did not address any issues of programming model or compatibility.

The HDF4 to HDF5 Mapping is an official specification for a default representation of HDF4 objects in an HDF5 file. This specification includes a specification for storing dimension names and scales from an HDF4 object in an HDF5 file. Dimension scales are stored as one-dimensional datasets. The names and dimensions are associated with a dataset in conventional attributes. The attributes have a list of strings for the names, and a list of object references that point to the dimension datasets. ([3], section 3.1). This specification has been implemented by the h4toh5 utility and library, and it is already in use by important users.

Our brainstorming sessions floated a number of approaches to dimension scales, such as a facility to define 'generating functions' for dimension scales.  These ideas seem to blend into the experimental 'transformation and units' activities.  These approaches may be interesting in the long run, but it appears they require changes to the HDF5 library and/or format (which likely could not happen until 2002 at the earliest). Also, there hasn't been any consideration of how to support the dimension scales already being created by the h4toh5 utility.

It is important that we provide our uses with some basic support for dimension scales in HDF5 as soon as possible.  This support should be compatible with the h4toh5 utility that is already in the hands of users.  Based on the earlier work above, it seems likely that these features can be initially implemented as part of the HDF5 'convenience' suite, with no changes to the core HDF5 library.  This could be done immediately, and could be in the hands of users this year.

2. Suggested Product

The main requirements are:
  1. Early release
  2. Compatibility with files produced by h4toh5
  3. API functions similar to the features of HDF4
    1. ability to attach a name to a dimension of a dataset, and to retrieve a list of the names of dimensions
    2. ability to attach a one-dimensional array of values to a dimension of a dataset, and to retrieve the scales for the dimensions of a dataset.
I would suggest that this be implemented immediately as part of the convenience library.  The basic functions should be something like:
 
 
Sketch of a minimal API for dimension scales
Function Description
hid_t H5Ccreate_dimscale(hsize_t size, char *name) Create a 1D dataset, marked as a dimension scale, with name 'name'
H5Cset_dim( int dimindex, hid_t dataset, hid_t dimscale) Attach dimension scale to dataset, associated with dimension number 'dimindex'.
H5Cset_name( int dimindex, hid_t dataset, char * dname) Attach dimension name to dataset, associated with dimension number 'dimindex'.
hid_t[] H5Cget_dim_scales(hid_t dataset) Get a list of the dimension scales.  Some convention to represent dimensions with no scale defined.  This list is in the order of the dimensions of the dataset.
char * H5Cget_dim_name(hid_t dimindex) Get the name of dimension dimindex.

Note that these functions are intended to work with the datasets created by the h4toh5 utility.  We may define a more general storage model, but it is important that this API deal with HDF4_DIMSCALES and the storage conventions of the current 4 to 5 mapping.

Dimension scales also can have attributes, and we may want to define other standard attributes (other than name).  E.g., offset,scale, units, format.  If defined, we can provide get/set methods.

3. Features that are difficult to support without library or format changes

There are some simple and obvious features that will be difficult to support with the current storage model using a convenience API.

Global order of dimensions, a la netCDF

Many users are used to the netCDF concept of dimensions that are global to the file, and that can be manipulated as a set.  For instance, dimensions can be retrieved in order of creation, and have a global index for each dimension.

This feature could be supported using an approach similar to the HDF5 netCDF prototype [1].  If something like this is adopted, we will need to bring up HDF4 to HDF5 mapping to use this.

Management of Shared Dimensions

The h4toh5 conventions can adequately represent shared dimensions.  However, it currently has no way to handle 'shared names'.  In addition, since the association of dimension names and scales is an attribute of each dataset, when the API deletes a dimension, it will have to have some way to delect the reference to the dimension in all datasets that might be using it. This could be done be adding a table of which datasets are using which dimensions.  This has not been specified.

Unlimited Dimensions

While both the dataspace and the dimension scale dataset can be UNLIMITED, i.e., expandable, there is no way to keep them coordinated without library support. That is, if the dimension is extended, there is no way to automatically extend the dimension scale dataset that is assigned to it.

Furthermore, users may expect the magical effect that expansion of a dimension expands the dataspaces of any dataset using that dimension.  This is extremely difficult to provide without library support.

4. Summary

I strongly recommend we give a high priority to providing a basic API to manage dimension names and scales, a la HDF4, in HDF5.  This can be done as part of the 'convenience' library work.

Our users need this now, so we should do what we can as soon as possible. We cannot do everything we might want, so we will simply have to do our best and document what can't be done.

References

1. Nancy Yeager, "Implementation of the NetCDF-H5 prototype", August 20, 1999. http://hdf.ncsa.uiuc.edu/HDF5/papers/netcdfh5.html

2. Robert E. McGrath, "Experiment with XSL: translating scientific data", February 21, 2001. http://hdf.ncsa.uiuc.edu/HDF5/XML/nctoh5/writeup.htm

3.  Mike Folk, Robert E. McGrath, Kent Yang, "Mapping HDF4 Objects to HDF5 Objects" Revised: October, 2000. http://hdf.ncsa.uiuc.edu/HDF5/papers/h4toh5/