HDF4 implements Dimension Scales for Scientific Datasets.
This implementation includes a standard storage scheme for dimensions,
and also has a programming model. HDF5 has no specification for Dimension
Scales, but this has been frequently requested. This feature is needed
both to bring forward HDF4 data and applications, and to support new applications.
Dimension scales in HDF5 have been partly addressed in previous work. Two experiments have suggested possible approaches to dimension scales based on netCDF. The "NetCDF-H5 Prototype" explored a fairly complete implementation of netCDF on top of HDF5.[1] This work proposed a storage scheme and software to implement netCDF's model of dimensions. A later study, "Experiment with XSL," converted XML files to HDF5 files via XML and XSL style sheets.[2] This work used a different storage layout for dimensions from [1], and did not address any issues of programming model or compatibility.
The HDF4 to HDF5 Mapping is an official specification for a default representation of HDF4 objects in an HDF5 file. This specification includes an official specification for storing dimension scales from an HDF4 object in an HDF5 file.([3], section 3.1). This specification did not suggest a programming model, and did not consider requirements for using dimensions in ab novo HDF5 (i.e., not converted from HDF4).
While previous efforts have aimed to simulate HDF4 or netCDF, the fact is that HDF5 is a clean slate. We can implement whatever concepts we want, with whatever programming model and interfaces we want.
Based on the previous work above, it seems certain that these features
can be implemented as part of the HDF5 'lite' API, with not changes to
the core HDF5 library. Other software may need to be updated, particularly
the h4toh5 converter and other tools.
I believe that a specification for Dimension Scales needs the following
pieces:
The design should meet the following requirements:
A dimension scale is essentially a means of labeling the axes of a multidimensional
array. Clearly, there are many ways to do this, and many interpretations
of a given label. For HDF, the goal is to store an adequate representation
of the desired labels, along with an appropriate association between each
axis and its label(s), if any.
Case | Description and Comments |
No scale | No scale is needed, or else the application has its own means of labeling the axis. |
Start Value plus Increment | A linear scale, e.g., the dimension scale values for x are A + Bx, where A and B are stored. |
Arbitrary Generating Function | A generalization of above, where the dimension scale values for x are defined by some function f(x), including log, exp, etc. |
Stored list | The scale value at each point on dimension x is stored explicitly in a 1D array. (Note that this can support types other than numbers.) |
Partial and/or multi-list | Scale values for some points are stored explicitly, with each scale point potentially stored individually. Possibly more than one scale value per point on the dimension. |
Table 1 lists some possible conceptual variations for stored dimension scales. HDF4 supports 'No scale' and 'Stored list'. The 'Arbitrary generating function' could be difficult to implement unless the functions allowed are constrained (as in 'Start plus increment'). Partial and multi-lists would not be especially difficult to store, but the programming model might be complex.
Whatever the conceptual features supported, the programming model should be as simple and clear as possible, with reasonable default behaviors. For instance, it should not be necessary to call an API in order to not use dimensions, and it should be possible to add or delete a dimension scale to a dataset at most points in the object's life cycle. Thus, the selected set of features must be evaluated as a whole package.
Considering Dimension Scales as objects, there are two properties that
must be considered. First, the dimensions may potentially be shared,
and the sharing (and 'addressing') might be of several scopes. Second,
a Dimension Scale may have many data types, not necessarily the data type
of the data in the array.
Case | Description and Notes |
No sharing | Dimension is local to the object, not visible to any other object. |
shared within a local scope, e.g., a group | Dimension is shared among a local set of objects. |
shared within a single file | Dimension is global to the file |
shared, may be in external file | Dimension may be in another HDF5 file |
shared, from any URI | Dimension scale may be stored on the internet, in any format (e.g., XML) |
Table 2 lists some of the ways that a dimension scale might be shared. Shared dimensions within a file are widely used in netCDF and HDF4, so that data in multiple datasets can be correctly related to each other. HDF5 would support local sharing (e.g., within a Group), and schemes for dimension scales external to the file can be imagined.
In any case, it is very important that the user be able to understand
and control the sharing of dimensions. Also, implementation of the conceptual
cases in Table 1 need to support sharing in a consistent fashion.
Case | Description and Comments |
Sequences of atomic numbers | Numerical axes. Might be constrained to be in the range of legal generating functions. |
Any type, including string and compound | Any user defined values, including strings. |
Table 3 gives two general alternatives for the data type of a dimension scale. If the scale is stored explicitly, then HDF5 can easily support a dimension with any data type supported by HDF5. If a generating function is used, then the function will define the data type of its range. In this case, the dimension scale might realistically be considered to have the type 'FUNCTION', which would be a new feature for HDF5!
3.2. Storage model for a Dimension Scale Object
This section assumes that a Dimension Scale object is defined, which implements some or all of the variations described above. How should these objects be stored in HDF5, and how should they be associated with the dimensions they label?
Storage
HDF5 offers two primary choices for storing a dimension scale for an object: as an Attribute of the object or as a separate Dataset pointed to by an Attribute of the object. Ideally, a dimension scale would be an attribute of a data space, but HDF5 does not support this. Since the data space is always tightly bound to a dataset, there is no problem attaching the dimensions to the dataset.
For a dimension scale that is stored as explicit values, storing a dimension scale as an HDF5 attribute has several drawbacks. Dimension scales may be large, and may 'unlimited', so they can grow, and dimensions scales need to be shared. HDF5 attributes cannot support these features. Therefore, a dimension scale object will almost certainly be stored as one or more HDF5 objects, referenced by an attribute.
For other possibilities, the storage depends on the feature. A description of generating function might be stored as a an attribute, either as a string or as a compound datatype, or a simple shared dataset. The sharing and growth properties of this scale are less clear, but probably are the same as for a stored array. If multiple dimension scales (or piecewise functions) are supported, then they could be grouped in an HDF5 Group.
It might be noted that to replicate netCDF/HDF5 semantics. a shared, unlimited dimension has a single current size throughout the file. If one dataset using that dimension extends it, all other datasets using the dimension must grow by the same amount. For this reason, and to support some aspects of the programming model, there will likely be some sort of stored table or index, to track all the dimensions and their associations. (See Yeager's prototype for an example of the required data structures. [1])
Names
It is almost certain that dimension scales will be shared objects.
This means that there must be some way to address them, i.e., they must
have names. The names will be used to associate specific dimension
scales with specific dimension of datasets.
Case | Description and Comments |
Default/Implicit names | E.g., based on the object name and dimension, or on the order of creation (a la netCDF) |
Reserved names | Dimension scales must have certain names, e.g., must all be stored in a particular HDF5 Group, as is done in the HDF4 to HDF5 Mapping ([3]) |
Arbitrary names | Any dataset can be a Dimension Scale |
Table 4 lists some ways that Dimension scale objects could be named. The naming scheme definitely interacts with the kinds of sharing that must be supported. Any sharing outside the file introduces serious problems for how to name the dimension scale object.
Properties & Attributes of Dimensions
A Dimension Scale object will be stored as an HDF5 Attribute or Dataset.
Assuming the object is stored as a dataset, it will have some mandatory
and may have additional optional properties and attributes.
Property or attribute | Description and Comments |
Name | See discussion above. |
CLASS | "DIMENSION_SCALE" (Possibly more than one kind of dimension scale might be supported) |
Data space | See above. Note that an explicitly stored dimension should be a "correct size" for the dataspace of the dataset that is using it. E.g., a dimension scale for a dimension that is 10, should have 10 elements. |
Data type | See discussion above. |
Other attributes | UNITS, SCALE_FACTOR, OFFSET |
USED_BY | List of datasets using this attribute. (?) |
Table 5 lists some of the properties and attributes of a Dimension Scale object. In addition to the description of the dimension itself, the dimension scale might have attributes of its own, such as UNITS. If several kinds of Dimension Scales are supported (explicitly stored, start plus offset, etc.), an attribute would indicate what kind of dimension it is. Also, there could be attributes that indicate the datasets that use the dimension.
Miscellaneous
HDF5 Attributes and Array data types have data spaces with dimensions.
Shouldn't these dimensions be allowed to have Dimension Scales? If
so, how can this be implemented? These cases are not encountered
in other formats.
Required operation | Description and Comments |
create | Create dimension with intended properties |
destroy | |
attach to dimension of a dataset | Associate Dimension Scale with particular dimension of a dataset |
detach from dimension of a dataset | |
get dimension scales for dataset | Retrieve the Dimension Scales (if any), in order. |
iterate through all dimensions in file (or other scope) | Find all Dimension Scales, in a canonical order (e.g., the order of creation). |
change size | Extend a Dimension and all datasets using the Dimension |
Table 6 lists some operations that will likely be needed. In addition
to the basic operations to create and attach Dimension Scales, users will
need iterators to list the Dimension Scales in a canonical order.
These ideas need to be considered carefully, and a set of features decided. Then a specification and implementation can be done.
1. Nancy Yeager, "Implementation of the NetCDF-H5
prototype", August 20, 1999. http://hdf.ncsa.uiuc.edu/HDF5/papers/netcdfh5.html
2. Robert E. McGrath, "Experiment with XSL: translating scientific data", February 21, 2001. http://hdf.ncsa.uiuc.edu/HDF5/XML/nctoh5/writeup.htm
3. Mike Folk, Robert E. McGrath, Kent Yang,
"Mapping HDF4 Objects to HDF5 Objects" Revised: October, 2000. http://hdf.ncsa.uiuc.edu/HDF5/papers/h4toh5/