HDF5 Dimension Scale Proposal No. 3
Version 2
Mike Folk
This proposal is an attempt to combine the earlier dimension scale proposals [1] and [3]. It also draws on discussions from [2] and [4].
This is not an easy task, largely because the two proposals have different emphases and sometimes conflicting requirements. For example,
· The first proposal stresses the need for compatibility with HDF4 and netCDF. The second covers a much wider range of applications, including support for different coordinate systems.
· The first proposal recommends that no changes be required to the HDF5 storage model or library, and that the design should be similar to what the HDF4 to HDF5 mapping specifies. The second recommends significant change to the HDF5 storage model; it does not address library changes.
In this third proposal, we try to compromise between the various views, including the expression of many in the HDF group that the basic mechanisms for handling dimension scales should be in the format and library, and our own view that the meaning of dimension scales should be left up to applications almost entirely, and that the relationships between dimension scales and datasets should be loosely specified.
Here is a summary of the recommendations in the new proposal:
1.
Conceptual model
a.
Do not include coordinate systems at this time.
b.
Support the following types of dimension scale: no scale, array of any dimension, simple
function at least of the form A+Bx.
c.
Make no restriction of the datatypes of scale values.
d.
Make
no restrictions on the number of scales that can be assigned to a
dimension.
e.
Store dimension scales as HDF5 datasets.
f.
In the proposed model, support the following structures
and relationships:
i. A
dimension scale is an object that is assigned to a dimension of a dataset.
ii. A
dimension scale can have at most one primary name.
iii. A
dimension scale may be assigned to more than one dimension in more than one
dataset.
iv. Every
dimension of a dataset may have one or more dimension scales assigned to
it. If there are
more than one scale assigned to a dimension, each scale is identified by an
index value.
v. A
dimension scale has all of the properties of an HDF5 dataset.
vi. There
are no restrictions on the size, shape, or datatype of a dimension scale.
g.
Make the following new functions available for dealing
with dimension scales
i. Convert
dataset to scale (D) – convert dataset D to a dimension scale.
ii. Attach
scale (D, S, i) – attach dimension scale S to the ith dimension of D.
iii. Detach
scale (D, i, j) – detach the jth
scale from the ith dimension of D.
iv. Get
number of scales (D, i) – get the number of scales
assigned to the ith dimension of D.
v. Get
OID of scale (D, i, j) – get the OID for the jth scale assigned to the ith
dimension dataset D.
vi. Get
info (S) – get info about dimension scale S (existence, size, name, etc.)
vii. The operations available for datasets are also
available for dimension scales.
h.
Do not automatically extend dimension scales when
dataset dimensions are extended.
i.
Do not automatically delete dimension scales in any
circumstances involving dataset deletion.
2.
Dimension scale datasets
a.
Make read/write behavior the same for dimension scales
as it is for normal datasets.
b.
Make all dimension scales public.
c.
Use a CLASS attribute to specify that a dataset is to
be interpreted as a dimension scale.
d.
Use an attribute for storing the CLASS attribute.
e.
We tentatively recommend a header message for storing a
dimension scale name, but would like to hear other opinions.
f.
We recommend against requiring that dimension scale
names to be unique within a file.
3.
Info connecting datasets with dimension scales:
a. A reference to a dimension scale should be stored in a dataset.
b.
Allow dimension scale names to be stored optionally,
but consider not enforcing consistency between the name stored in the dataset
and the name in the scale itself.
4.
Shared dimensions
a.
Do not maintain information about shared dimensions in
the library.
5.
Expanding raw data options
a.
extend the dataset model to
allow a new storage option for datasets whereby datasets can be represented by
a function, or formula.
b.
Study the possibility of allowing formulas be used for attributes.
Proposals [1] and [3] differ in their conceptual models. The following proposal includes (and excludes) features of both.
Our study of dimension scale use cases has revealed an enormous variety of ways that dimension scales can be used. We recognize the importance of having a model that will be easy to understand and use for the vast majority of applications. It is our sense that those applications will need either no scale, a single 1-D array of floats or integers, or a simple function that provides a scale and offset.
At the same time, we want to place as few restrictions as possible on other uses of dimension scales. For instance, we don’t want to require dimension scales to be 1-D arrays, or to allow only one scale per dimension.
So our goal is to provide a model that serves the needs of
two communities. We want to keep the
dimension scale model conceptually simple for the majority of applications, but
also to place as few restrictions as possible how dimension scales are interpreted
and used. With this approach, it becomes
the responsibility of applications to make sure that dimension scales satisfy
the constraints of the model that they are assuming, such as constraints on the
size of dimension scales and valid range of indices.
Perhaps the biggest difference between [1] and [3] is in the requirement in [3] that a coordinate system support be available. We argued in [4] that we felt it was premature to support coordinate systems in HDF5 at this time, and it was our sense that the reviewers agreed. Also, we have since discovered that the coordinate system design proposed in [3] is not the same as the dimension space that is defined in a netCDF file, so would not seem to be usable in the netCDF implementation.[1] For these reasons, it seems premature to support coordinate systems at this time.
Recommendation: do not include coordinate systems at this time.
There seems to be good agreement that the model should accommodate scales that consist of a stored 1-D list of values, certain simple functions, and “no scale.” Higher dimensional arrays are more problematic, but we would recommend them as well:
No
scale. Frequently no scale is
needed, so it should not be required. In
some of these cases, an axis label may still be needed, and should be available.
1-D array. Both fixed length and extendable arrays should be available. We recommend that the size not be required by HDF5 to conform to the size of the corresponding dimension, so that then number of scale values could be less than, equal to, or greater than the corresponding dimension size.
Simple function. At a minimum, a linear scale of the form A + Bx should be available. Beyond this, the initial scope is TBD, but should probably be linear.
Higher dimensional arrays. Proposal [3] makes a good case for including arrays with dimension greater than 1, and we are recommending these. The recommendations for 1-D arrays as to size and extendibility would seem to apply here as well.
Recommendation: support the following types of dimension scale: no scale, array of any dimension, simple
function at least of the form A+Bx.
There is also the issue of whether there should be restrictions on the datatypes of scale values. It seems reasonable not to restrict applications in this regard.
Recommendation: make no restriction of the datatypes of scale values.
A number of use cases have been proposed in which more than one scale is needed for a given dimension. If this can be done without overly complicating the model, it seems to be a valuable feature, and hence is recommended.
Recommendation: make
no restrictions on the number of scales that can be assigned to a
dimension.
It is proposed that dimension scales be stored as datasets. This approach would seem to satisfy the data model describe above, and has the advantages of simplifying the API and concentrating much of the code on one common structure, rather than two. Details of the special characteristics and operations for dimension scale datasets are covered in the next several sections.
Recommendation: store dimension scales as HDF5 datasets.
Recommendation: in the proposed model, support the following structures
and relationships:
·
A dimension
scale is an object that is assigned to a dimension of a dataset.
·
A
dimension scale can have at most one primary name.
·
A
dimension scale may be assigned to more than one dimension in more than one
dataset.
·
Every
dimension of a dataset may have one or more dimension scales assigned to
it. If there are
more than one scale assigned to a dimension, each scale is identified by an
index value.
·
A
dimension scale has all of the properties of an HDF5 dataset.
·
There are
no restrictions on the size, shape, or datatype of a dimension scale.
Comments::
· On the recommendation to restrict dimension scales to having one primary name. netCDF requires this, and it seems to be a fairly common assumption of applications. If an application needs to assign more than one name to a dimension scale, it can use HDF5 attributes to do so, but this concept will not be dealt with by the API. See section 3.4 for further discussion of this issue.
· Allowing two or more scales to be assigned to a dimension. The index value of a given dimension scale will not be persistent when deletions and additions are performed. The exact behavior in this case needs to be determined.
· On allowing scales that are not 1-D. No distinction will be made in the API between the simple case (one 1-D array per dimension) and the general case (any number of arrays of any dimension) described above. It will be left to the application to manage this.
· Assigning different dimension scale combinations, depending on coordinate system (examples 7 and 10 (maybe) in [3]). Because we don’t include coordinate systems in the proposed model, the model does not provide all of the information needed in examples 7 and 10 in [3]. In these cases, the coordinate system in [3] specifies which of several scales go with each dimension. The model proposed here would require the application to keep track of this information.
· Allowing attributes and array datatypes to have dimension scales. We know of no request for this feature, which we believe would complicate the model for users, so we do not at this time feel that attributes and array datatypes should support dimension scales.
Recommendation: make the following new functions available for dealing
with dimension scales
·
Convert dataset
to scale (D) – convert dataset D to a dimension scale.
·
Attach
scale (D, S, i) – attach dimension scale S to the ith
dimension of D.
·
Detach scale
(D, i, j) – detach the jth dimension scale
from the ith dimension of D.
·
Get number
of scales (D, i) – get the number of scales assigned
to the ith dimension of D.
·
Get OID of
scale (D, i,
j) – get the OID for the jth scale assigned to the ith
dimension dataset D.
·
Get info
(S) – get info about dimension scale S (existence, size, name, etc.)
·
The operations available for datasets are also
available for dimension scales.
In [1] there are three additional operations listed: destroy scale, change size of scale, and iterate through scales. The destroy and change-size operations can be done with existing HDF5 dataset operations; on the other hand, it could be argued that dimension scale-specific versions would be more natural for users. We are not recommending them, but are open to arguments to the contrary.
The iterate operation becomes difficult with the implementation that we are proposing, and seems to us to be best left to the applications. We are, of course, open to opposing views as to whether these operations should be included.
(We have not yet developed a programming model for operating
on dimension scales. Nor have we tested
these functions by showing how they would apply in various use cases.)
Operations not
recommended. Because dimension
scales add meaning to datasets, it is reasonable to look for ways to maintain
the proper relationships between datasets and their corresponding dimension
scales. Two operations that might be
desired involve (1) automatically extending dataset dimensions, and (2) automatically
deleting dimension scales. We recommend
against supporting these operations in the library, and letting applications
enforce them according to their needs. A
discussion of this follows
Automatically extending dataset dimensions. When a dimension of a dataset is extended, should the library automatically extend the corresponding dimension scale, or should this be left to the application? Since a dimension scale can be shared among many datasets, this raises a number of issues that are difficult to address in a general way. For instance, which dimension scale should be extended when only one dataset is extended, and what values are to be added? We have seen no compelling reason to implement an automatic extension of dimension scales when dataset dimensions are extended, so we suggest letting applications be responsible for this operation.
Recommendation: do not automatically extend
dimension scales when dataset dimensions are extended.
Automatically deleting dimension scales. Should a dimension scale be deleted when all datasets that use if have been deleted? This is another case where different applications might have different requirements, so a general policy would be difficult to devise. Furthermore, enforcing a deletion policy, even a simple one, adds complexity to the library, and could also affect performance. Deletion policies seem best left to applications.
Recommendation: do not automatically delete dimension
scales in any circumstances involving dataset deletion.
In this section we recommend characteristics of datasets
that are to be treated as dimension scales.
Should the read/write
behavior for dimension scales be different from that of other datasets?
There may be some reasons for this, such as how fill values
are treated, and how dataset extension is dealt with, but we are not aware that
the lack of such special behavior would offset the advantages of letting the
behavior be the same for dimension scales and for datasets that are not
dimension scales.
Recommendation: make read/write
behavior the same for dimension scales as it is for normal dataset.
Should we have both
public and private dimension scales?
It might be useful in some case to be able to hide dimension scales from the top level, but this creates a more complicated model for the users. Sometimes all dimension scales would be visible, sometimes some would be visible and some not, and sometimes none would be visible.
Recommendation: make all dimension scales public.
(We recognize that most of the HDF staff recommended
otherwise, so this is probably a controversial recommendation.)
How do we avoid
confusing dimension scales with other datasets?
One disadvantage of using datasets for dimension scales is that dimension scales might be confused as being something other than what they are. To lessen the confusion, we could specify a dimension scale class attribute (e.g. CLASS = “DIMENSION_SCALE”), or it could be a header message.
Recommendation: use a CLASS attribute to specify that a dataset is to
be interpreted as a dimension scale.
Should attributes be
used for storing the CLASS attribute?
Since a similar approach is used to identify images and tables, we recommend that attributes be used. It also would make this information apparent to higher level views of datasets, such as those provided by HDFView.
Recommendation: use an attribute for storing the CLASS attribute..
Dimension scales are often referred to by name, so we have recommended that dimension scales have names. Since some applications do not wish to apply names to dimension scales, we recommended that dimension scale names be optional.
We also recommended that a dimension scale should have at most one name. If a dimension scale has one name, bookkeeping involving the scale name may be significantly simplified, as would the model that applications must deal with. On the other hand, some applications may want to use different names to refer to the same dimension scale, but we do not consider this a capability that the HDF5 library needs to provide. (See further discussion of this in section 4.)
How is a name
specified?
Three options seem reasonable: (1) the last link in the pathname, (2) an attribute, (3) a header message.
We tentatively recommend a header message for storing a dimension scale
name, but would like to hear other opinions.
Should
dimension scale names be unique among dimension scales within a file?
We have seen a number of cases in which applications need more than one dimension scale with the same name. We have also seen applications where the opposite is true: dimension scale names are assumed to be unique within a file. One way to address this is for the library not to enforce uniqueness, and leave it to applications to enforce a policy of uniqueness when they need it. We recommend this approach.
We recommend against requiring that dimension
scale names to be unique within a file.
What new information is stored in the dataset about its dimension scales? The following have been suggested at one time or another:
· Reference to dimension scale
· Name(s) of dimension scale
· Units for dimension scale
· Current and maximum sizes of dimension scale
· Mapping between dimension and dimension scale
Reference to
dimension scale
A dataset needs some way to identify a dimension scale. A dataset reference provides an unambiguous identifier, and is very compatible with the HDF5 programming model, so recommend it.
We recommend that a reference to a dimension scale be stored in a
dataset.
Storing
dimension scale name in original dataset.
It has been asserted that it would be useful to allow dimension scales to have a local name – a name that is stored in a dataset together with the reference to a dimension scale. This would allow easy look-up of the dimension scale name, it would allow a dataset to name a dimension without pointing to a dimension scale, and it would also allow applications to assign local names that differ from a dimension scale’s own name.
Regarding the first reason (easy look-up), we do now know how important this would be. Without it, a dimension scale’s header must be read in order to find a dimension scale’s name. If this extra look-up is a problem, then applications could, of course, provide their own solutions to this problem in a variety of ways.
Regarding the second reason (axis name without attached scale), this seems to be a fundamental need, and is cited in example #5 in [2].
Regarding the third reason (alternate names), we have seen no compelling example of this requirement. (We welcome examples showing why a different local name would be good.)
The arguments against using local names have to do with complexity and bookkeeping requirements. The use of a different local name could make the model more confusing for users, and would probably require an extra query function for both the local name and the real name. The bookkeeping requirement would seem to be minor if there is no requirement that the local name be the same as the real name.
Recommendation: Allow dimension scale names to be stored optionally, but
consider not enforcing consistency between the name stored in the dataset and
the name in the scale itself.
Units
Support for units have been
requested, but this would seem to be beyond the scope of the current task, so
it is not recommended.
Current and max size
The main argument for including these is quick look-up. The main arguments for omitting them are costs in complexity and potential performance of synchronizing dataset dimensions with their corresponding dimension scales, without any compelling argument as to the value of such synchronizing. We are not recommending that this information be tracked at this time.
One-to-many mapping
When there are fewer values in a dimension scale than in the corresponding dimension, it is useful to have a mapping between the two. This is used by HDF-EOS to map geolocation information to dimensions. On the other hand, the way that mappings are defined can be very idiosyncratic, and it would seem to be very challenging to provide a mapping model that satisfied a large number of cases. Hence, we are not recommending that mappings be included in the model.
Given the design described in above, datasets can share
dimensions. The following additional
capabilities would seem to be useful.
1. When a dimension scale is deleted, remove the reference to the dimension scale in all datasets that refer to it.
2. Determine how many datasets are attached to a given dimension scale
3. Determine what datasets are attached to a given dimension scale
These capabilities can be provided in several ways:
a) Back pointers. If every dimension scale contained a list of back pointers to all datasets that referenced it, then it would be relatively easy to open all of these datasets and remove the references, as well as to answer questions #2 and #3. This would require the library to update the back pointer list every time a link was made.
b) Alternatively, such lists could be maintained in a separate table. Such a table could contain all information about a number of dimension scales, which might provide a convenient way for applications to gain information about a set of dimension scales. For instance, this table might correspond to the coordinate variable definitions in a netCDF file.
c) If no such list were available, an HDF5 function could be available to search all datasets to determine which ones referenced a given dimension scale. This would be straightforward, but in some cases could be very time consuming.
d) Finally, it could be argued that these capabilities are so unlikely to occur that HDF5 need not provide a solution. When a dimension is deleted, dangling references would occur, but that is already possible for any dataset that uses object references. And if this were a problem, an application could always implement any of the three solutions on its own, although this would require more knowledge of HDF5 than might be available.
We recommend against supporting these three additional capabilities in
the library.
One dataset feature has been recommended that is not currently available – datasets represented by functions of the indices, which we tentatively call a formula dataset.
More formally, consider the following definitions
D – a dataset array.
r – the rank of D.
dimi – the size of the ith dimension of D.
v – an index vector for D. v is a vector of r integers, were v = (i1, i2, … ir), ij ≤ dimj. ij refers to the jth element of v. The components of v identify the coordinates of an element in D.
If D is an r-dimensional standard dataset, D(v) is the element of D whose coordinates are (i1, i2, … ir).
If D is an r-dimensional formula dataset, there is a function f(v) defined for D, and D(v) is f(i1, i2, … ir).
What formulas should be allowed? The design proposed in [3] has three options: linear, logarithmic, and other. We propose that the design should accommodate expansion, but the first implementation be linear expression only. That is an expression of the form
f(v) = a1i1 + a2i2 +… + arir + k., where the ai and k are constants.
Recommendation: extend the dataset model to allow a new storage option
for datasets whereby datasets can be represented by a function, or formula.
Further study would seem to be advisable before proceeding
with this option.
If formulas are permitted for dataset, it may also be valuable to enable attributes to have the “formula” option.
Recommendation: Study the
possibility of allowing formulas to be used for attributes.
1.
McGrath, Robert E. “Dimension Scales in HDF5:
Preliminary Ideas.” May 2001. http://hdf.ncsa.uiuc.edu/RFC/ARCHIVE/DimScales/H5dimscales.htm.
2. McGrath, Robert E. “Needed: A convenience API to Support Dimensions in HDF5.” July 2001. http://hdf.ncsa.uiuc.edu/RFC/ARCHIVE/DimScales/H5dims.htm.
3. Koziol,
Quincey. “Coordinate Systems in HDF5.” A
set of slides.
4. Folk, Mike. “Should Dimension Scales be basic HDF5 constructs or higher level constructs?” May 2004. http://hdf.ncsa.uiuc.edu/RFC/ARCHIVE/DimScales/How_H5dimscales.htm.
[1] The current HDF5 proposal requires a different coordinate system for each different combination of dimensions. This complicates bookkeeping quite a bit, and could lead to confusion. This may be fixable, but that will take time.