A DTD for HDF5: Design Notes

Robert E. McGrath
March 29, 2000

Introduction

We have developed an XML DTD to describe the contents of an HDF-5 file.[1] This DTD specifies a standard for using XML to describe the structure and contents of a single HDF5 file. The DTD can be used in a variety of ways, by standard software and by application specific software that leverages standard XML features.

This document discusses some of the key features of the HDF5 DTD, and some of the design decisions that were discussed during its development.

The HDF5 data model is somewhat complex, with a great deal of flexibility and expressive power. The DTD is intended to be able to describe almost any HDF5 file, and to describe most of the details of the file. For this reason, the HDF5 DTD is more complicated than some similar DTDs, such as XDF[7] and netCDF[8].

The DTD to some extent is redundant with the previously published "Data Definition Language", and the accompanying "h5dump" and related tools.[2] XML descriptions will contain similar or identical information as the dumper DDL. The important difference is that the XML is machine readable, but not necessarily human readable, and XML is a standard format that can be exchanged with standard tools and other XML languages.

Requirements and Use Cases

We wanted the DTD to be useful for a variety of purposes. For this reason, we considered a number of "Use Cases". This analysis showed that there are indeed many different uses for XML, which have different requirements. Our DTD is intended to support as many of these uses as possible.

Use Cases

The Use Cases are described in the document "Suggested Use Cases for XML with HDF-5". [3] Briefly, the use cases are:

These different uses for XML require different information in the XML. For instance, an XML catalog record is intended to be a description of the dataset and its location. This record should have all the attributes, and a pointer to the dataset at a data service, but the data should not be included. By contrast, an XML based validation tool needs to have a complete description of the file, including the data (if present).

The HDF5 DTD is designed to support many uses. In some cases, there are alternative descriptions provided, e.g., data in the file can be represented by a pointer to the original file or by a description of the data values themselves--or both. The DTD defines a formal and machine verifiable syntax which is rigorously enforced by validating XML tools. This guarantees that the producer and consume can exchange the description. It cannot guarantee that a particular XML description is sufficient for a given use.

Main Components of the HDF5 DTD

The DTD is intended to describe the structure and contents of an HDF5 file. For the most part, closely follows the HDF5 data model [4], which defines the shape and data types of datasets and attributes. These descriptions are similar to other general descriptions of scientific data [ 5, 6, 7, 8, 11], although HDF5 is more general than some.

An important feature of the HDF5 data model is the Group structure, which allows the HDF5 file to be structured as a rooted directed graph. XML descriptions are restricted to be a tree, so we needed to specially handle cases where objects might have more than one connection, i.e., when the graph is not a tree.

The XML standard does not define numeric types, nor representations for arrays, tables, etc. In the case where we need to describe actual data values (the value of an attribute, or values of an array), we need to define a mark up language. There is no current standard to do this, so we were guided by the best practices we could find. Still, this is an area where our DTD will likely have to evolve in the future.

Description of Datasets (Dataspace and Datatypes, and Attributes)

The HDF5 data model provides a complete and well defined description for most kinds of scientific data. The DTD follows the HDF5 model in a simple and clear way. An HDF5 Dataset object is described by and XML <Dataset> element, each <Dataset> has a <Dataspace> and <Datatype> object, a la HDF5.

There were two challenging details. First, HDF5 has a very elaborate model of types, including arbitrary "compound datatypes" (i.e., structured records with heterogeneous components) as well as a completely general model of number representation. Expressing this in XML was easy, if somewhat elaborate. It should be noted that we made some seemingly arbitrary decisions about how to express the attributes of a datatype: sometimes an XML element is used and sometimes an XML attribute is used.

The second detail was how to handle the data values of an attribute or dataset. This is discussed below.

Description of the Structure (Groups)

An HDF5 file is a rooted directed graph, with at least one Group, "/". Some files are very simple, containing a few datasets, all in the root group. Other files have elaborate grouping structures, organizing the objects as a tree or graph. Objects can be shared, i.e., they can be members of more than one group. In this case, the graph is not a tree, because some objects have more than one parent. It is also possible for Groups to directly or indirectly contain an ancestor. In other words, the graph can have a loop in it.

XML descriptions are trees, with exactly one root, and objects nested in their parent. XML has no concept of elements which have more than one owner. This forced us to define our own approach to describing the graph structure of the HDF5 file.

First, there is an issue of what is the desired relationship between HDF5 objects and XML elements/objects. It is clear that XML is general enough to describe almost any structure. For example, the "Resource Description Framework" (RDF) can represent complex semantic networks.[10] So the issue is not a lack of expressive power in XML.

The issue is that standard XML software, e.g., SAX parsers and the DOM, naturally create objects (data structures) which correspond to the elements of the XML description. To the degree that the objects of HDF5 can be mapped to elements of XML, then general purpose XML-based software will be presented with an approximation of the semantics of the HDF5 objects, simply from the XML itself. In other words, the HDF5 objects are mapped naturally to XML elements, and general purpose XML tools will understand the structure of the HDF5.

In this approach, the difficult problem is how to represent group membership. For a simple HDF5 file in which the objects are structured as a tree, then the objects can be represetned as elements, and members of a group can be nested in a <Group> element. The XML nesting directly expresses the HDF5 membership in a natural way. But what should be done to represent a more general graph, e.g., where a dataset is a member of two dfferent groups?

One possibility is to represent the struture of the file in a general set notation, with a set of nodes and a set of edges. Each dataset and group is a "node", and the membership is represented as "edges". (There are many variants of this basic approach.) Software can read these two sets and construct the graph. This sort of representation is very natural for many algorithms that manipulate graphs, and can be easily transformed into different data structures. However, standard XML software would have no notion of the meaning of the edges and vertices, nor any clue to the structure of the file.

These approaches can be combined, nesting the objects as in XML, with a special "link" or cross reference to represent a second occurrence of the same object. This hybrid approach has the advantage that in simple cases the structure of the XML closely follows the structure of the HDF5 file, while capturing the complex cases when needed.

After considering each alternative in detail, a hybrid approach was chosen. For HDF5 objects that may be shared (Groups, Datasets, Named Datatypes) the XML element is defined to be either a description of the object or a "pointer" to an element that describes the object. A shared object should be described in exactly one element, and all other instances should point to that element.

It should be noted that the XML parser can verify that the "pointer" points to a valid XML element, but not that it points to the correct element, nor that there is only one description of a given HDF5 object. These rules must be enforced by the applications that create and use the XML description.

The Data Values

While representing metadata with XML was fairly straightforward, it was less obvious what should be done with the data values. For different purposes, it may be better to:

A second design choice is whether to mark up the data elements or include data as a single block of undelimited text. For example, the values of a two dimensional array could be included either as a single block of values, or tagged with XML elements for each row, or tagged individually for each row and column.

Examination of existing practice shows that there is no outstanding agreement on these issues. This is not surprising, since the choice depends on the requirements of the intended use. Interesting examples of related work include:

We wanted to support as many variations as possible, so our design allows many representations (including omission) for data values.

One point to note about the HDF5 DTD: many of the other approaches (e.g., XDF) include substantial metadata about the shape and type of arrays. This information is provided in great detail by the HDF5 metadata, so our markup of the data values is less elaborate than some other DTDs. On the other hand, certain facts such as the order of the dimensions and elements in the XML description must still be included, because the XML is not required to be laid out in the order that the HDF5 file specified.

References

HDF5 DTD
http://hdf.ncsa.uiuc.edu/HDF5/XML

DDL in BNF for HDF5
http://hdf.ncsa.uiuc.edu/HDF5/doc/ddl.html

Suggested Use Cases for XML with HDF-5
http://hdf.ncsa.uiuc.edu/HDF5/XML/UseCases/use-cases.html

HDF5 Abstract Data Model
http://hdf.ncsa.uiuc.edu/HDF5/ADM_990506/

VisAD
http://www.ssec.wisc.edu/~billh/visad.html

XSIL: Extensible Scientific Interchange Language
http://www.cacr.caltech.edu/SDA/xsil/

XDF (eXtensible Data Format)
http://tarantella.gsfc.nasa.gov/xml/

netcdf
http://hdf.ncsa.uiuc.edu/HDF5/XML/NetCDF/netcdf.dtd

XML-data
http://www.w3.org/TR/1998/NOTE-XML-data/

Resource Description Framework (RDF)
http://www.w3.org/RDF/

Scientific Data Management (SDM)
http://www-xdiv.lanl.gov/XCI/PROJECTS/SDM