Should Dimension Scales be basic HDF5 constructs
or higher level constructs?
Mike Folk
(with input from Bob McGrath)
4th draft
(Changes in this draft: Comments by staff added at the end.)
We have been trying to some time to determine how dimension scales should be supported in HDF5. Bob wrote some documents about this in 2002, and Quincey has been working on a requirements and design document for more than a year. Their approaches and conclusions seem, to me, to be quite different. Here is my interpretation of the two general approach and conclusions of these studies, so far.
Approach A. This approach is described in Bob’s documents at http://hdf.ncsa.uiuc.edu/RFC/ARCHIVE/DimScales/. These papers describe a number of different types of scales, and suggest an implementation and API similar to HDF4, but with some enhancements. Unlike with HDF4, they would be implemented as special versions of datasets, similar to the way that images in HDF5 are special versions of datasets. Any special use of dimension scales, such as sharing of scales among different datasets or treatment of dimension scales in the context of a particular coordinate system, would be handled by a higher level application, such as netCDF, SAF, or IBM’s Data Explorer. Bob proposes a limited implementation that might later be revised, and could be adapted in a variety of ways to different kinds of implementations.
Approach B. This is described by Quincey’s documents. It is much broader and more general, and goes beyond the initial idea of dimension scales. Whereas a simple one-dimensional dimension scale can be used to locate values in some instances, it is not sufficient for locating values in others, such as in a curvilinear grid. This idea goes beyond the original idea of dimension scales, because it supports multidimensional structures. This more ambitious objective, perhaps together with other discoveries, has led to the idea that “dimension scales” need to be supported in the context of coordinate systems. Having identified a number of use cases, Quincey’s documents show how these use cases can be handled by describing coordinate systems and their relationships to the structures that contain location information. They also seem (to me) to take the position that these constructs should be closely tied to dataspaces, and that the HDF5 data model would define the structures and operations available for instantiating coordinate systems and corresponding dimension scales.
We have identified a number of general questions that each approach deals with, and that help us compare the two approaches. The following table summarizes the two approaches with respect to these topics.
|
Approach A (Bob) |
Approach B (Quincey) |
a) How should coordinate systems be handled? |
Not covered. Leaves that up to application. |
Might support coordinate systems, and if so, would enforce connections between dimension scales and coordinate systems. Would allow multiple coordinate systems to be attached to a dataspace. |
b) Do we need dimension scale support in HDF5? |
Yes, they are needed in order to (a) support netCDF, and (b) provide a way for HDF4 files to be converted fully to HDF5. |
Yes. Same reasons. Plus there are many other uses of dimension scales that users would find useful. |
c) What structures should be available for dimension scales? |
1-D datasets. A few new dataset types, such as a generated dataset. |
Same, but also multi-D, to accommodate curvilinear spaces. |
d) How are dimension scales to be related to the HDF5 data model and implementation? |
Not part of HDF5 data model. Instead, high level API implements simple dimension scales. Goal is modest: to accommodate HDF4 and netCDF models. Other applications may find it also useful. |
Becomes integral part of HDF5 data model. Extends dataspace and dataset model to include dimension scales, and possibly coordinate systems. Shared dataspaces support this concept. Supports netCDF and HDF4, but much more. |
e) Need for dimensions to be global to all variables. |
Could be supported using an approach similar to the HDF5 netCDF prototype. |
Handled by having shared dataspaces and shared dimensions. |
f) How should shared dimension scales be handled? |
Uses h4toh5 conventions. However, currently no way to handle 'shared names'. Also, since attributes used to maintain relationships, deletion of API has to take care of this. Could be done be adding a table of which datasets are using which dimensions. This has not been specified. |
Sharing handled through dataspace and dimension sharing mechanism. (I need to study this more to understand how it works.) ‘Shared names’ automatic because ‘axis’ has a name. Not sure how this will work in terms of API – probably the same as approach A. |
g) How are unlimited dimensions coupled to scales? |
No automatic coupling between dataset extension and dimension extension. Library would have to support this in some way. |
Extensions automatically handled by the model. This is explained in Example #6. (I’m not clear about how this happens, and what role API plays.) |
At issue are the following six questions:
a. How should coordinate systems be handled?
b. Do we need dimension scale support in HDF5?
c. What structures should be available for dimension scales?
d. How are dimension scales to be related to the HDF5 data model and implementation?
e. Need for dimensions to be global to all variables.
f. How should shared dimension scales be handled?
g. How are unlimited dimensions coupled to scales?
Quincey suggests: In order to avoid “coordinate system” term
and its possible meanings, why not use “location”, or “location system?”
Approach A |
Approach B |
Not covered. Leaves that up to application. |
Might support coordinate systems, and if so, would enforce connections between dimension scales and coordinate systems. Would allow multiple coordinate systems to be attached to a dataspace. Quincey: Possibly drop “coordinate system” part of
model and just use “axis” and “scale” part as more powerful version of
HDF4/netCDF dimensions. |
The idea of including a coordinate system raises a number of questions.
1. How should coordinate systems be handled?
2. What structures should be available for dimension scales?
3. How are dimension scales to be related to the HDF5 data model and implementation?
4. Need for dimensions to be global to all variables.
5. How should shared dimension scales be handled?
6. How are unlimited dimensions coupled to scales?
Should a file format describe coordinate systems? Since scientific applications often assume a strong connection between dimension scales and coordinate systems, it can be argued that scientific formats should model that connection. Indeed, coordinate systems sometimes are represented in file formats, either implicitly or explicitly. For example, the SAF data model allows for explicit descriptions of coordinate systems. In the FITS format there are implicit assumptions about the coordinate space that dimension scales refer to. However, these formats represent more domain-specific data models than HDF5.
How should dim scales
be associated with coordinate systems? Each
domain has its own view of what coordinate systems mean, and of the kinds of
information that are needed to give spatial information about values based on
their coordinate systems. We have a very
poor understanding of the range of ways that this is done. We can prescribe certain common ways to do
it, but then we leave out those applications that have their own ways to do it.
Do all dimension
scales even need a corresponding coordinate system? Although dimension scales are often understood
in the context of a coordinate system, this is not always the case. For instance, nominal scales (e.g. “
How difficult would a comprehensive coordinate system model be? We do not know. In the documents that we have so far, we have not seen anything close to a general model of coordinate systems. We have some examples, and our implementation supposedly addresses the needs of those examples, but no case has been made that these examples are comprehensive.
Would our implementation be the “right” implementation for all applications? We do not know. We have developed some use cases involving coordinate systems, but these use cases may just be the tip of the iceberg. Except for a few examples, such as HDF-EOS, SAF and netCDF, we have not examined other implementations of coordinate systems to determine what data structures and what coordinate system metadata users find necessary to achieve the information and performance goals that they need.
Are we the right people to develop a coordinate system model? Although we are acquainted with users who use coordinate systems, we have not demonstrated that we understand fully what their information and performance requirements are, nor that we understand the mathematical requirements of their coordinate system usage. We do not currently have the knowledge to design a comprehensive coordinate system model.
Are coordinate systems within the scope of HDF5? Most coordinate systems are about physical phenomena, which is beyond the scope of what HDF5 is supposed to cover. This is not so say that HDF5 should never incorporate information about physical phenomena, but letting the scope of HDF5 creep in this way needs to be done with extreme caution. Adding coordinate systems based on the small amount of research that we have done is not advisable.
Do our users need coordinate systems in the basic HDF5 data model? Our users have done well without coordinate system support in the base library. It is not something that has been requested often. That said, there are many users who need to represent coordinate system information in some form, such as HDF-EOS users, and in that case HDF-EOS provides the necessary model and API. It seems reasonable, at least for now, to let applications such as SAF and netCDF do a good job supporting those coordinate systems.
My recommendation: Given the
complexity of the problem and our lack of understanding of the topic, we are
not ready to support coordinate systems in HDF5 at any level.
Bob comments: I would note that if dimension scales are
managed by a high level library, it is very natural to then create one or more
high level libraries that do coordinates. In general, I agree with Mike's
conclusion that we are not ready to support coordinate systems in HDF5.
Approach A (Bob) |
Approach B (Quincey) |
Yes, they are needed in order to (a) support netCDF, and (b) provide a way for HDF4 files to be converted fully to HDF5. |
Yes. Same reasons. Plus there are many other uses of dimension scales that users would find useful. |
If coordinate systems are not recommended, why should dimension scales be? Like coordinate systems, dimension scales are more than domain-independent file structures, and if coordinate systems don’t belong in HDF5, perhaps neither do dimension scales.
The role of HDF4 is particularly relevant here. HDF4-to-HDF5 conversion needs some way to
convert dimension scales. Some HDF4 users seem unwilling go move to HDF5
because it does not have dimension scales, which they use heavily in HDF4. Beyond H4 to H5 conversion, users want to
carry forward their programming model and general structure of their
applications. Dimension scales are the last major feature of HDF4 that
has no support at all in HDF5.
The netCDF project is also relevant. Early on, we assumed that the a high level HDF5 implementation of dimension scales, similar to the image API, would be used for the netCDF implementation. Thus far this hasn’t happened, and indications are that it is not necessary. Probably because no implementation was available, the first netCDF prototype (based on netCDF 3) implements dimension scales without their being supported at any level with HDF5.
HDF-EOS5 also needs dimension scales. Unfortunately, it is
may be too late for HDF-EOS5 to change. Also,
DODS, VisAD, and other software needs
dimensions to work reasonably.
In conclusion, it seems that we still do need dimension scale support in HDF5.
Approach A |
Approach B |
1-D datasets. A few new dataset types, such as a generated dataset. |
Same, but also multi-D, to accommodate curvilinear spaces. |
In Approach A, the dimension scales were 1D dataset. In Approach B, they could have higher dimensions, as would be used to describe curvilinear structures. However, there is nothing in the design of Approach A that precludes 2D or higher datasets, so this restriction is not essential. Except for the curvilinear case, the approaches seem to be close to agreement on the first topic – types of dimension scales that should be available.
Both do a good job of listing the structures that our users have identified. A good case is also made that new structures that support dimension scales, such as generating functions, are useful for datasets as well. Indeed, taken out of the context of coordinate systems, dimension scales can be thought of as data sets with possible restrictions on them that constrain how they are to be interpreted and used.
Bob also comments: “I remain lukewarm on "generating
functions". Essentially, this feature requires storing a description
of an algorithm, and I don't think we have a good grasp of this problem.”
Approach A (Bob) |
Approach B (Quincey) |
Not part of HDF5 data model. Instead, high level API implements simple dimension scales. Goal is modest: to accommodate HDF4 and netCDF models. Other applications may find it also useful. |
Becomes integral part of HDF5 data model. Extends dataspace and dataset model to include dimension scales, and possibly coordinate systems. Shared dataspaces support this concept. Supports netCDF (what about HDF4?), but much more. |
Initially we tried to come up with a simple answer to this question, but we now see that there may be some aspects of dimension scales that belong in the data model and some that do not. We cover those in the later discussion, let us examine the question in a more general sense.
Are dimension scales like the HDF5 “images”? HDF5 supports raster images, which, it could be argued, are also beyond the scope of HDF5. Images are supported as higher level objects, with an “image” API. From the beginning of HDF5’s development, it was assumed that HDF5 would support images and similar specialized objects in this way. The goal of the HDF5 image model was to develop an HDF5 standard for simple types of images, and the purpose of the API is to enforce the standard and make it easy for users to read and write images using the standard. No attempt was made to make the image model comprehensive or rigorous, but just to provide something that we knew many users would find convenient.
Like images, it seems possible that a simple dimension scales model could be defined and implemented that would still be generally useful. And also as with images, perhaps we do not need a comprehensive model or implementation of dimension scales. If an applications needs to treat dimension scales differently from the way we do, that should be acceptable, and our implementation should not get in the way of that.
We have identified a number of criteria for comparing the two different approaches. The following table summarizes the two approaches with respect to these criteria.
|
Base format |
Higher level |
· Visible outside dataset context |
· Pro: If implemented as new object types, less likely to be visible and to be confused as datasets by naďve user. |
· Con: Could be confused by naďve user with other datasets that have other purposes. |
· Performance |
· Pro: Optimizations can be made that apply to all uses. ·
Pro: Back pointer management will be fast. (QK
claim.) ·
Bob comments: In general, the performance
claims are not convincing, since there is limited data (especially, there is
no data on the usage of the features). · Con: complex relationships have to be maintained, even when scales are not needed. |
· Pro: If necessary, application-specific implementations can be done to improve performance. |
· Appropriateness of model |
· Pro: Fits use cases that have been identified. (Swath case seems unconvincing.) · Pro: Users for whom it is the right model will like having it be the one and only model. · Con: we don’t know what other requirements will occur, and how adaptable the model will be to them. |
· Pro: Fits netCDF and HDF4 use cases, which were the original goal. · Pro: For applications for whom the existing model is inappropriate, there will be less confusion. · Pro: Application can replace it with other model or implementation if other requirements occur. |
· Maintainability |
· Pro: as part of the base code, should be readily maintainable. · Pro: don’t have to maintain yet another HL library. · Con: incorporation of many conventions increases code complexity. ·
Bob comments: The complexity claims are a
concern, especially because they impact the schedule. But mainly we are
considering where to place the complexity, and therefore the risk. As
Mike points out, the core library is much higher risk. · Con: Hard to change or abandon for different or extended model. |
· Pro: as a separate, stand-alone library, should be readily maintainable. · Con: have to maintain yet another HL library. ·
Con (Quincey): May have to maintain (or
transition) both high-level and base format versions… · Pro: The first implementation should be a prototype, and a prototype built on top of the library can be revised more easily and over a longer time period. · Pro: Easier to extend or abandon. · Pro: Does not have to incorporate so many conventions because it doesn’t have to conform to dataspace requirements. (Not sure about this one.) Quincey: ? |
· Ease of implementation |
· Con: must interoperate correctly with dataspace and dataset innards. |
· Pro: Can be implemented without touching library innards. |
Bob comments: In my view, there are several crucial features that are awkward
to support in a high level library, namely items e, f, and g below.
Approach A (Bob) |
Approach B (Quincey) |
Could be supported using an approach similar to the HDF5 netCDF prototype. |
Handled by having shared dataspaces and shared dimensions. |
Bob comments:
There is only one way to do this,
with some kind of global state that is stored in the file. The decision is
whether the core format and library do it, or a high level profile.
IMO, it is far better to have the
library manage this name space, for reliability and transparency to the user.
But the high level solution should
be OK, and is obviously lower risk.
Approach A (Bob) |
Approach B (Quincey) |
Uses h4toh5 conventions. However, currently no way to handle 'shared names'. Also, since attributes used to maintain relationships, deletion of API has to take care of this. Could be done be adding a table of which datasets are using which dimensions. This has not been specified. |
Sharing handled through dataspace and dimension sharing mechanism. (I need to study this more to understand how it works.) ‘Shared names’ automatic because ‘axis’ has a name. Not sure how this will work in terms of API – probably the same as approach A. |
Bob comments that the "high level" memo (Approach
A) notes that "shared names" can't be supported (easily) by
attributes. This is the case where you need to change the name of a
dimension (e.g., from "laptitude" to
"latitude"). If the dimension are scattered about as attributes
of datasets, this operation requires a global search and replace across all the
datasets.
Mike’s new comments:
I agree with Bob, and the table option
seems like a good possibility.
A case against shared dimension
scales in the base library:
Regarding the
idea of sharing of objects in the format generally: Although the idea of “shared” objects, such
as dataspaces and datatypes is already part of the HDF5 design, I am not sure
this will ultimately prove to have been an important feature in HDF5. Currently only datatype sharing is supported,
and I have not noticed that it is used very much. It adds complexity to the format and library,
and that complexity may not be worth the benefits.
Regarding the
sharing of dimensions in the format: The
same concerns apply. The
relationships and structures that were used to provide general support for the various
kinds of sharing (Quincey: what does this mean?)[1] described in Approach B were (to me) confusing and
complicated, and I wonder if the benefits will be worth the added complexity. Separate, higher level implementations of
these different kinds of sharing would likely be less complicated because they
would individually focus on just certain kinds of sharing. The netCDF case illustrates this.
One could argue that separate high
level implementations that supported different rules for sharing would be less
compatible. For this reason, it could be
argued that Approach B would be better because it would provide a single universal
solution. Yes and no. It would
provide a universal model that would support all desired rules for sharing
(though this has not been proven), but then the individual sharing rules would
still need to be enforced by some code somewhere.
One could also argue that separate
high level implementations result in duplication of effort. This is a good point, and we should try to
head off unnecessary duplication where possible by providing a small number
(one or two) of libraries that provide most of what people need.
Other possibilities
It would be interesting to see
what would happen if we did not try to satisfy all types of sharability. For instance, would it simplify matters
appreciably if we assumed that all shared dimensions had to have precisely the
same dimension sizes and dimension scale sizes?
Approach A (Bob) |
Approach B (Quincey) |
No automatic coupling between dataset extension and dimension extension. Library would have to support this in some way. |
Extensions automatically handled by the model. (I’m not clear about how this happens, and what role the API plays.) |
Bob comments:
Given a table of the dimensions,
this can be implemented by the high level library. But it is probably
more reliable and likely more efficient to do this at a lower level. If
nothing else, the operation can be more precise, and can avoid wasted space.
E,f and
g:
(e) need for shared dimensions to
be global
(f) how should shared dimensions be
handled?
(g) how are unlimited dimensions
coupled to scales?
Bob Comments:
These features are going to be
awkward to implement in a high level library.
On the other hand, the risk and
cost of putting them in the core library may be more important than the quality
of the implementation of this feature. Therefore, I vote "I'm not
sure".
Mike’s new comments:
From Quincey’s
survey of users, it seems that the library would need to support these three
options in a variety of different ways individually and in combination. The design that was presented can, perhaps,
handle them all, but it is complex, would probably confuse applications, and
may be difficult to implement correctly and maintain.
If these were implemented through
higher level interfaces, each of those interfaces could implement the specific
interpretation that they wanted to. For
instance, Nexus does not want dimensions to be shared, but netCDF does. Each could implement its own model. This is what was done with HDF4.
My vote, at this time, is not to
implement all of these capabilities in the library, but to implement a
high-level library that would facilitate the implementation by others of the
different views that might be desired by others.
As to what HDF5 dimension scales should do, some criteria have been identified as important for dimension scales:
· They should support the netCDF and HDF4 data models.
· We need to produce an implementation soon, and it may be already too late to affect the netCDF implementation.
In order to determine how we should support dimension scales, there are still some questions to addressed, including the following.
· How should dimension scales be associated with datasets?
· How should dimension scales be associated with on another.
· How should dimension scale sharing be supported?
· How should dimension scales be treated in dynamic situations, in which the corresponding data aggregates change? How should this apply to the associations among dimension scales, and to shared dimension scales?
Quincey has done a good job examining these questions, based on user input. He has developed use cases showing how different applications use dimension scales. I think we would all agree that, whatever model we implement for dimension scales in HDF5, it would be good if applications could build their own dimension scale implementations on top of this. At question, then, is how much we can reasonably provide in support of dimension scales, and how much should be left to applications.
· What are the applications that would build upon HDF5 dimension scales?
· If we develop and implement a model for associating dimension scales with datasets, and for sharing dimension scales, how do we know that this will be comprehensive enough for most future uses of the implementation?
·
Given the complexity of the model, does it present
a solution that users will find intuitive and easy to understand and use?
How should dimension scales to be associated with datasets? Based on some user input, Quincey has done a
good job of listing examples showing how dimension scales and datasets might be
related. He has presented a UML-based model
that accommodates those examples. In considering these results, I have the
following concerns.
· We still haven’t really articulated what the model is. Quincey’s examples imply a model, but it is unclear what the underlying model actually is.
· From what we have seen so far, a user would find the model (if there were one) to be quite complicated. This may be due to the fact that we have only seen examples. One can’t be sure without first seeing a general model, or at least an API.
· The model is not necessarily comprehensive. It seems to speak to most of the examples that are given, but the universe of possible uses is much larger than these. Users are going to want to do things in ways that do not match the model, which may mean that the underlying software would have to be expanded (making it more complex).
·
The model attempts to include coordinate
systems. Each domain has its own view of
what coordinate systems mean, and of the kinds of information that are needed
to give spatial information about values based on their coordinate
systems. We have a very poor
understanding of the range of ways that this is done. We can prescribe certain common ways to do
it, but then we leave out those applications that have their own ways to do it.
· In view of these concerns, the risk of changing the base library and format is high. The original plan was to implement a high level library, then, after some experience, possibly migrate some or all of the structures and functions into the format and base library. We could make a bad decision with a high level library, but the cost of a bad decision at this level is much lower.
Principle of simplicity: avoid unnecessary features and capabilities.
Principle of humility: recognize what others might be better at than you.
Principle of specialization: do a small number of things well and don’t try to do lots of things.
Principle of prototyping: the first version is always wrong in some fundamental way, and hence should be done in a way that the cost of re-doing it is low.
Several felt that coordinate systems should be supported in some way.
Some felt that it should be left to the application.
It was pointed out that John Caron (netCDF) is hoping
something would be “pushed down to the disk.”
There seemed to be consensus that we do, though Elena is not convinced there is a real need.
Quincey pointed out that general functions would be really
complex, but something like
Unlimited dimensions and multidimensional scales were both recommended.
Should not be “either-or.”
Visibility outside
the dataset context
Some don’t see as a problem.
QK says it could be confusing to user.
AC says be careful about pushing into name space. Would put in as basic object.
RM says quality would be higher if pushed further down.
PC says it would be less confusing.
EP says public ones would be visible (not sure what this means).
Performance
EP: Could be good or bad.
Maintenance.
RM says it’s easier to make a change to a separate module.
AC says when low level changes, high level gets out of synch and then needs changing.
AC says there’s a different standard an user commitment. I apply same standard, many cons disappear, so don’t implement the whole enchilada at the beginning.
PC feels it would be the same effort either way.
Ease of
implementation.
PC: it’s just as hard to implement this at a high level as at a low level, and hence we should go for the bas format.
PC: Forces people to use the HL libraries.
EP: Not difficult. API efficiency could be an issue.
PN: wrote HL dim scales already.
KY: should be easy to implement.
EP: it depends on what you implement.
KY: there should still be a little coordinate system.
AC: How about interoperability with other programming languages?
Will netCDF use it? Yes, if done soon enough.
KY (via email): I don't like the way H4toH5 handles
dimensional scale when doing conversion. Here are two points from my recollections of doing the implementation, hope it
gives you some help.
1. For h4toh5, I have to link with at least two libraries, HDF4 and HDF5. If turning on compression, I may also need zlib, jpeg and szlib etc. external libraries. It really takes me quite some time to know how to configure this software with a lot of struggles and limited helps from other developers. I think that's why people are very glad to see when h4h5tools was pulled out of HDF5 library. Adding another high-level library on top of HDF5 when doing 4to5 conversion, it will make things worse and hard to maintain.
2. Another thing I don't like is that I used object reference to build links between HDF5 datasets and dimensional datasets. It is not straightforward to figure this out and the content of the dimensional attribute doesn't make sense and sometimes get changed accordingly.
3.
The problem is not just with a high level library. Even if these are added to the base library,
maintenance would be harder.
The coordinate system design that is proposed is not the same as the dimension space that is defined in a netCDF file. The current HDF5 proposal requires a different coordinate system for each different combination of dimensions. This complicates bookkeeping quite a bit, and could lead to confusion. This may be fixable, but that will take time.
There are also other concerns about coordinate systems, as described above.
We recommend against
including coordinate systems at this time.
The following types of
dimension scales are recommended:
1-D array scale. Stored as 1-D array. Both fixed length and extendable arrays should be available.
Simple function. Stored as function. Initial scope TBD, but should probably be linear.
Higher dimensional arrays. That is, arrays with dimension greater than 1. Interpretation can vary, depending on the number of dimensions. Recommend: provide link, but don’t offer specific methods for.
It is proposed that dimension scales be stored as datasets. This approach would seem to satisfy the data model describe above, and has the advantages of simplifying the API and concentrating much of the code on one common structure, rather than two. Details of the special characteristics and operations for dimension scale dataset are covered in the next several sections.
We recommend that
dimension scales be stored as HDF5 datasets.
We present the basic model for dimension scales in this
section.
In the discussion that follows, the following notation will be used
D – a dataset array.
r – the rank of D.
dimi – the size of the ith dimension of D.
v – an index vector for D. v is a vector of r integers, were v = (i1, i2, … ir), ij ≤ dimj. ij refers to the jth element of v. The components of v identify the coordinates of an element in D. That is, D(v) is the element of D whose coordinates are (i1, i2, … ir).
Si – a 1-D array scale or simple function corresponding to the ith dimension of D.
Si(j) – the jth element of array Si if Si is an array. If Si is a function, Si(j) is the value of the function Si(j), where j e I, j ≥ 0.
We want to provide a model that is easy to understand and use for the majority of applications, which we feel are likely to assign at most one dimension scale per dimension, and are likely to use only 1-D dimension scales.
At the same time, we want to place as few restrictions as possible on other uses of dimension scales. Hence, we don’t want to require dimension scales to be 1-D arrays, or to allow only one scale per dimension. That is, any number of dimension scales of any size or shape can be assigned to a given dimension.
In the interest of
simplicity, we are recommending the following restrictions:
· Dimension scale names are unique. If an application needs to assign more than one name to a dimension scale, it can use HDF5 attributes to do so, but this concept will not be dealt with by the API.
· In the case where there are more than 1 scale assigned to a dimension, the index value of a given dimension will not be persistent when deletions and additions are performed. (Exact behavior is TBD.)
· No distinction will be made in the API between the simple case (one 1-D array per dimension) and the general case (any number of arrays of any dimension) described above. It will be left to the application to manage this.
· Because we don’t include coordinate systems in the proposed model, it does not provide all of the information needed in examples 7 and 10 (maybe) in [2]. In these cases, the coordinate system in [2] specifies which of several scales go with each dimension. The model proposed here would require the application to keep track of this information.
With this approach, it is the responsibility of applications to make sure that dimension scales satisfy the constraints of the model that they are assuming, such as constraints on the size of dimension scales and valid range of indices.
We recommend the
following functions are proposed in the dataset API
· Convert to scale (D) – convert dataset D to a dimension scale.
· Attach scale (D, S, i) – attach dimension scale S to the ith dimension of D.
· Detach scale (D, i, j) – detach the jth dimension scale from the ith dimension of D.
· Get number of scales (D, i) – get the number of scales assigned to the ith dimension of D.
· Get OID (D, i, j) – get the OID for the jth scale assigned to the ith dimension dataset D.
· Get info (S) – get info about dimension scale S (existence, size, name, etc.)
Operations not
recommended. Because dimension
scales add meaning to datasets, it is reasonable to look for ways to maintain
the proper relationships between datasets and their corresponding dimension
scales. Two operations that are
suggested involve (1) automatically extending dataset dimensions, and (2) automatically
deleting dimension scales. We recommend
against supporting these operations in the library, and letting applications
enforce them according to their needs. A
discussion of this follows
Automatically extending dataset dimensions. When a dimension of a dataset is extended, should the library automatically extend the corresponding dimension scale, or should this be left to the application? Since a dimension scale can be shared among many datasets, this raises a number of issues that are difficult to address in a general way. For instance, which dimension scale should be extended when only one dataset is extended, and what values are to be added? We have seen no compelling reason to implement an automatic extension of dimension scales when dataset dimensions are extended. This operation can be carried out by applications.
We recommend against automatic extension of dimension scales when dataset dimensions are extended.
Automatically deleting dimension scales. Should a dimension scale be deleted when all datasets that use if have been deleted? This is another case where different applications might have different requirements, so a general policy would be difficult to devise. Furthermore enforcing a deletion policy, even a simple one, adds complexity to the library, and could also affect performance. Deletion policies seem best left to applications.
We recommend against automatic deletion of dimension scales in any circumstances involving dataset deletion.
Should the read/write behavior for dimension scales be different from that of other datasets?
There may be some reasons for this, such as how fill values
are treated, and how dataset extension is dealt with, but we are not aware that
the lack of such special behavior would offset the advantages of letting the
behavior be the same for dimension scales and for datasets that are not
dimension scales.
We recommend that no
special read/write behavior be the same for dimension scales as it is for
normal dataset.
Should we have both public and private dimension scales?
It might be useful in some case to be able to hide dimension scales from the top level, but this creates a more complicated model for the users. Sometimes all dimension scales would be visible, sometimes some would be visible and some not, and sometimes none would be visible.
We recommend that all
dimension scales be public. (We recognize
that most of the HDF staff recommended otherwise, so this is probably a
controversial recommendation.)
One disadvantage of using datasets for dimension scales is that dimension scales might be confused as being something other than what they are. To lessen the confusion, we could specify a dimension scale class attribute (e.g. CLASS = “DIMENSION_SCALE”), or it could be a header message.
We recommend using a
CLASS attribute to specify that a dataset is to be interpreted as a dimension
scale.
We recommend using an for attribute for
this because it is information that should be apparent to a higher level view.
Dimension scales are often referred to by name. We recommend that dimension scales have names.
Should dimension scale names be unambiguous? That is, should a given dimension scale have only one name? If a dimension scale has one name, bookkeeping involving the scale name may be significantly simplified, as would the model that applications must deal with. On the other hand, some applications may want to use different names to refer to the same dimension scale, but we do not consider this a capability the the HDF5 library needs to provide. (See discussion of this in section 4.6.)
We recommend that each
dimension scales have at most one unambiguous name.[2]
If uniqueness is the rule, how is a name specified? Three options seem reasonable: (1) the last link in the pathname, (2) an attribute, (3) a header message.
We tentatively
recommend a header message, but would like to hear other opinions.
Should dimension scale names be unique among dimension scales within a file? We have seen a number of cases in which applications need more than one dimension scale with the same name. We have also seen applications where the opposite is true: dimension scale names are assumed to be unique within a file. One way to address this is for the library not to enforce uniqueness, and leave it to applications to enforce this a policy of uniqueness when they need it.
We recommend that dimension scale names not be required to be unique
within a file
What new information about a dimension scale is stored in the dataset? The following have been suggested at one time or another:
· Reference to dimension scale
· Name(s) of dimension scale
· Units for dimension scale
· Current and maximum sizes of dimension scale
· Mapping between dimension and dimension scale
We recommend that only
a reference to a dimension scale be stored in a dataset. A discussion follows of the reasons for
and against omitting the other information.
Storing dimension scale name in dataset. It has been asserted that it would be useful to allow dimension scales to have a local name – a name that is stored in a dataset together with the reference to a dimension scale. This would allow easy look-up of the dimension scale name, and would also allow applications to assign local names that differ from a dimension scale’s own name.
Regarding the first reason (easy look-up), we do now know how important this would be. Without it, a dimension scale’s header must be read in order to find a dimension scale’s name. If this extra look-up is a problem, then applications could, of course, provide their own solutions to this problem in a variety of way. The question is whether it is enough of a problem that the library should store dimension names locally.
Regarding the second reason (alternate names), we have seen no compelling example of this requirement. (We welcome examples showing why a different local name would be good.)
The arguments against using local
names have to do with complexity and bookkeeping requirements. The use of a different local name could make
the model more confusing for a users, and would
probably require an extra query function for both the local name and the real
name. The bookkeeping requirement would
seem to be minor if there is no requirement that the local name be the same as
the real name, but could increase if it is necessary to insure that it the two
names are the same.
Units. We feel that the same
arguments apply as for name.
Current and max size. The main argument for including these is quick look-up. The main arguments for omitting them are costs in complexity and potential performance of synchronizing dataset dimensions with their corresponding dimension scales, without any compelling argument as to the value of such synchronizing.
One-to-many mapping. When there are fewer values in a dimension scale than in the corresponding dimension, it is useful to have a mapping between the two. This is used by HDF-EOS to map geolocation information to dimensions. On the other hand, the way that mappings are defined can be very idiosyncratic, and it would seem to be very challenging to provide a mapping model that satisfied a large number of cases.
Given the design described in Section 4.4, datasets can share dimensions. The following additional capabilities would
seem to be useful.
1. When a dimension scale is deleted, remove the reference to the dimension scale in all datasets that refer to it.
2. Determine how many datasets are attached to a given dimension scale
3. Determine what datasets are attached to a given dimension scale
These capabilities can be provided in several ways:
a) Back pointers. If every dimension scale contained an list of back pointers to all datasets that referenced it, then it would be relatively easy to open all of these datasets and remove the references, as well as to answer questions #2 and #3. This would require the library to update the back pointer list every time a link was made.
b) Alternatively, such lists could be maintained in a separate table. Such a table could contain all information about a number of dimension scales, which might provide a convenient way for applications to gain information about a set of dimension scales. For instance, this table might correspond to the coordinate variable definitions in a netCDF file.
c) If no such list were available, an HDF5 function could be available to search all datasets to determine which ones referenced a given dimension scale. This would be straightforward, but in some cases could be very time consuming.
d) Finally, it could be argued that these capabilities are so unlikely to occur that HDF5 need not provide a solution. When a dimension is deleted, dangling references would occur, but that is already possible for any dataset that uses object references. And if this were a problem, an application could always implement any of the three solutions on its own, although this would require more knowledge of HDF5 than might be available.
We recommend against
supporting these three additional capabilities in the library.
This is independent of dimension scales, but supports one desired feature that isn’t currently available – formula-generated value.
Two primary types: large and small. Each of these currently has two subtypes.
Large – separate from header
· Chunked – same as current chunked
· Contiguous – same as current contiguous
Small – included in header
· Compact – same as contiguous, but data contained in header
· Formula – formula stored in header, used to generate data
Suggest changing names from “large” vs. “small” to something different. These terms are used because they probably usually distinguish the two classes, but not always. E.g. chunked datasets could actually be very small, and formula datasets could be large.
The layout class describes how much is actually stored, as compared to the size of the dataset is. Fill values are returned when a request is made for data that hasn’t been stored yet. We may want to use a different term here, as this was confusing to me. Not sure what, yet, but we can think about it.
Chunked. Any changes from current implementation?
Contiguous. Any changes from current implementation?
Compact. Any changes from
current implementation?
Formula. This
is new and needs to be scoped. The
proposed design has three options. We proposal that the design should accommodate expansion, but the
first implementation be linear expression
only. At least, we can put
that in the RFC and see what the response is.
As with datasets, extend attributes by adding “formula” option.
Old stuff
Our assumption is that the great majority of applications will use this representation. The structures should be sufficiently constrained that their meaning is easily comprehended and similar to current models, such as netCDF, CDF and HDF4. The API should be easy to use, with few options, and requiring only simple querying.
We also propose that the simple model allow at most one dimension scale per dataset dimension. We place no restrictions of the size of dimension scales – they may be larger than or smaller than the corresponding dimensions of the dataset. More formally:
· For each dimension of D, there can be at most dimension scale.
· A dimension scale may be either a 1-D array or a simple function scale.
The following functions are available in the dataset API
· Convert a dataset to a dimension scale.
· Attach dimension scale Si to the ith dimension of D.
· Detach dimension scale Si from the ith dimension of D.
· Get OID (D, i) – get the OID of Si.
· Get info (S) – get info about dimension scale S (existence, size, name, etc.)
Notes:
· We place no explicit meaning on how these operations are to be interpreted, including valid ranges for indices or assumptions about how elements in dimension scales correspond to those in a dataset. An application such as netCDF will impose constraints, such as on the size of dimension scales and valid range of indices.
· We need to determine how the simple API responds when it encounters dimensions that do not adhere to the simple case.
·
How do we handle the case where some dimensions
adhere to the simple case and some to the general case.
In the general case, we want to provide few restrictions on
the meaning or use of dimension scales.
In this case, dimension scales can be any size or shape, there may be
any number of dimension scales attached to a given dimension, and no
assumptions are made regarding the mappings between dimension scales and their
corresponding dataset dimensions. More
formally:
Let D be a dataset of rank r. The following conditions apply
· Any number of dimensions scales can be attached to a given dimension of D.
· For each dimension of D, any valid dataset can be defined as a dimension scale .
The following new functions are recommended
· Get number of scales (D,i) – get the number of scales assigned to the ith dimension dataset D.
· Get OID (D, i, j) – get the OID for the jth scale assigned to the ith dimension dataset D.
Notes:
·
· The API to support this generality may be complex. For instance, allowing multidimensional dimension scales may require the user to query as to the shape of a dimension scale.
· This approach may present difficulties in different language implementations.
· Because we don’t include coordinate systems in the proposed model, it does not provide all of the information available in examples 7 and 10 (maybe) in [2]. In these cases, the coordinate system in [2] specifies which of several scales go with each dimension. The model proposed here would require the application to keep track of this information.
Goal: Allow the general case, but eliminate the need for a separate API.
In this approach, we merge the two API, essentially adding the three query functions to the simple case: Get number, Get OID, and Get info general. The simple routines would return an error for those cases where the dimension scale did not satisfy the constraints given.
[1] It means
there are a number of different combinations of dimensions and dimension scales
that can be shared. Here are five
examples: (1) two dataspaces may share exactly the same dimensions in exactly
the same way; (2) two dataspaces may share the same dimensions but in different
orders; (3) two dataspaces may share some dimensions but not others; and (4) a
given shared dimension scale may map 1-1 to the corresponding array in one
dataset, but 10-1 in another and 2-1 for another; (5) a shared dimension scale
may have a different actual length for the corresponding dimension in one
dataset than in another.
[2] This feature could also be made available to all datasets, as was proposed in the original HDF5 design.