Strawhorse: NetCDF to HDF5 mapping


Robert E. McGrath
NCSA
July 10, 2003

Introduction


This note sketches some preliminary ideas on how netCDF4 should be mapped to HDF5. This is intended as a model for the kind of abstract mapping I believe we need to create, and as a starting place.  It is not intended as a final statement on any design issue.

 In earlier work, a basic design was proposed [Yeager 99]. This note follows the earlier work, with some important modifications.

The overal goal is to define a standard layout for netCDF objects in HDF5. The underlying assumptions are:
  1.  the netCDF4 profile will implement extensions of the netCDF3 programming model, storing and retrieving data from HDF5 objects. 
  2. In some cases, there may be additional HDF5 objects in the file, i.e., not created or managed by the netCDF4 layer. In this case, software that conforms to the netCDF4 profile should ignore the objects. 
  3. The netCDF4 layer may need to store and retrieve objects besides the user-defined data objects, e.g., tables to index all the netCDF variables in the file.

General layout of the HDF5 File

At the global level, one approach is to segregate all the 'netCDF' objects--i.e., the objects defined by the netCDF profile and  managed by the netCDF layer-- in one or a few conventional diretories. The rule would be:
If an application wants to create additional (non-netCDF objects), it is free to do so, but should place them outside the netCDF groups.

As an initial suggestion, the netCDF objects should be stored under two HDF5 groups:
So, when the calling program creates a variable, it will be stored in an HDF5 Dataset under /netCDF. If the netCDF library needs to store a table, e.g., of all variables in the file, it will be stored in an HDF5 dataset under /netCDF.meta.

netcdf4 diagram

Mapping of netCDF objects to HDF5 objects

Fundamentally, a netCDF variable can be stored as an HDF5 dataset. Table 1 shows the main concepts.

Table 1.

netCDF
HDF5
Variable
Dataset
name
? (TBD)
Dimensions
Dimensions (TBD), Dataspace
Datatype
Corresponding HDF5 Datatype
Data
Data

Several issues are open:

1. A convention must be established for naming

HDF5 Datasets are identified by path names, so it would be possible to store a variable named 'foo' as an HDF5 Dataset named '/netCDF/foo'. If the variable is renamed, the dataset would be renamed.

The semantics of netCDF variable names may require auxilliary data structures, e.g., to keep track of the order that variables are define, and to cycle through them in order.

2. Dimension scales

HDF5 has not completed a design for dimension scales. The ultimate implementation will support netCDF dimensions.

However, as in the case of variables, the netcDF  semantics of dimension names may require additional data structures.

3. Datatypes

HDF5 datatypes are a superset of netCDF data types, so it will be necessary to define what will be used.

HDF5 can store the data in native format or a standard layout. NeCDF3 has specified a single storage format. It will e necessary to specify what netCDF4 should do.

4. Data

HDF5 has a rather different storage model than netCDF, so some of the semantics will need to be defines. In particular, the desired storage for unlimited dimensions needs to be definded.

To support unlimited dimensions, the HDF5 dataset must be chunked.  The specification will need to define
a default chunking strategy.

Data Access Operations

Reading and writing data from/to a variable maps to equivalent HDF5 operations for datasets. In general, there is a natural correspondence, although the APIs and programming models are different.

Table 2.
netCDF
HDF5
put_var..., etc. for partial write, read
H5Dwrite, etc.,
   partial write/read
    selections
_FillValue, nc_set_fill
Dataset creation properties
nc_set_fill (fill behavior)
Dataset creation properties

HDF5 support alternative storage methods (AKA 'file drivers'), e.g., memory files, split files, etc.. These will work transparently for netCDF4 if they are specified.

Attributes


NetCDF attributes map naturally to HDF5 attributes. HDF5 has no reserved attributes (e.g., for units), we may want to add the netCDF conventions to HDF5 as part of the spec..

As in the case of Variables/Datasets, HDF supports many more datatypes than netCDF.

Programming model

The netCDF programming model may require storing additional attributes and/or indexes. What ever persistent global (to the library or file/dataset) state is needed will need to be mapped to some HDF5 object.

managing id's

NetCDF and HDF5 use ID's to reference open objects, but the models are not identical. Also, netCDF has functions to access by ID and to discover IDs of objects.  It may be necessary or desirable to create indexes of objects, and it may be necessary to store these in the file.  So, even though the ids and indexes are not part of the stored file per se, there may be some stored objects for netCDF4 library internal data.

'define mode'

HDF5 has no equivalent of the netCDF 'define mode'.  It will be necessary to determine what the behavior of the library must be.

Other

This note is incomplete.

This note has not considered extensions to netCDF4.

References


Yeager99. Nancy Yeager, Design of NetCDF-H5 Prototype, May,1999. http://hdf.ncsa.uiuc.edu/apps/netcdfh5/design.html