Robert E. McGrath
March 29, 2000
This document discusses some of the key features of the HDF5 DTD, and some of the design decisions that were discussed during its development.
The HDF5 data model is somewhat complex, with a great deal of flexibility and expressive power. The DTD is intended to be able to describe almost any HDF5 file, and to describe most of the details of the file. For this reason, the HDF5 DTD is more complicated than some similar DTDs, such as XDF[7] and netCDF[8].
The DTD to some extent is redundant with the previously published "Data
Definition Language", and the accompanying "h5dump" and related tools.[2]
XML descriptions will contain similar or identical information as the dumper
DDL. The important difference is that the XML is machine readable,
but not necessarily human readable, and XML is a standard format that can
be exchanged with standard tools and other XML languages.
Use Cases
The Use Cases are described in the document "Suggested Use Cases for XML with HDF-5". [3] Briefly, the use cases are:
The HDF5 DTD is designed to support many uses. In some cases,
there are alternative descriptions provided, e.g., data in the file can
be represented by a pointer to the original file or by a description of
the data values themselves--or both. The DTD defines a formal and
machine verifiable syntax which is rigorously enforced by validating XML
tools. This guarantees that the producer and consume can exchange
the description. It cannot guarantee that a particular XML description
is sufficient for a given use.
An important feature of the HDF5 data model is the Group structure, which allows the HDF5 file to be structured as a rooted directed graph. XML descriptions are restricted to be a tree, so we needed to specially handle cases where objects might have more than one connection, i.e., when the graph is not a tree.
The XML standard does not define numeric types, nor representations for arrays, tables, etc. In the case where we need to describe actual data values (the value of an attribute, or values of an array), we need to define a mark up language. There is no current standard to do this, so we were guided by the best practices we could find. Still, this is an area where our DTD will likely have to evolve in the future.
Description of Datasets (Dataspace and Datatypes, and Attributes)
The HDF5 data model provides a complete and well defined description for most kinds of scientific data. The DTD follows the HDF5 model in a simple and clear way. An HDF5 Dataset object is described by and XML <Dataset> element, each <Dataset> has a <Dataspace> and <Datatype> object, a la HDF5.
There were two challenging details. First, HDF5 has a very elaborate model of types, including arbitrary "compound datatypes" (i.e., structured records with heterogeneous components) as well as a completely general model of number representation. Expressing this in XML was easy, if somewhat elaborate. It should be noted that we made some seemingly arbitrary decisions about how to express the attributes of a datatype: sometimes an XML element is used and sometimes an XML attribute is used.
The second detail was how to handle the data values of an attribute or dataset. This is discussed below.
Description of the Structure (Groups)
An HDF5 file is a rooted directed graph, with at least one Group, "/". Some files are very simple, containing a few datasets, all in the root group. Other files have elaborate grouping structures, organizing the objects as a tree or graph. Objects can be shared, i.e., they can be members of more than one group. In this case, the graph is not a tree, because some objects have more than one parent. It is also possible for Groups to directly or indirectly contain an ancestor. In other words, the graph can have a loop in it.
XML descriptions are trees, with exactly one root, and objects nested in their parent. XML has no concept of elements which have more than one owner. This forced us to define our own approach to describing the graph structure of the HDF5 file.
First, there is an issue of what is the desired relationship between HDF5 objects and XML elements/objects. It is clear that XML is general enough to describe almost any structure. For example, the "Resource Description Framework" (RDF) can represent complex semantic networks.[10] So the issue is not a lack of expressive power in XML.
The issue is that standard XML software, e.g., SAX parsers and the DOM, naturally create objects (data structures) which correspond to the elements of the XML description. To the degree that the objects of HDF5 can be mapped to elements of XML, then general purpose XML-based software will be presented with an approximation of the semantics of the HDF5 objects, simply from the XML itself. In other words, the HDF5 objects are mapped naturally to XML elements, and general purpose XML tools will understand the structure of the HDF5.
In this approach, the difficult problem is how to represent group membership. For a simple HDF5 file in which the objects are structured as a tree, then the objects can be represetned as elements, and members of a group can be nested in a <Group> element. The XML nesting directly expresses the HDF5 membership in a natural way. But what should be done to represent a more general graph, e.g., where a dataset is a member of two dfferent groups?
One possibility is to represent the struture of the file in a general set notation, with a set of nodes and a set of edges. Each dataset and group is a "node", and the membership is represented as "edges". (There are many variants of this basic approach.) Software can read these two sets and construct the graph. This sort of representation is very natural for many algorithms that manipulate graphs, and can be easily transformed into different data structures. However, standard XML software would have no notion of the meaning of the edges and vertices, nor any clue to the structure of the file.
These approaches can be combined, nesting the objects as in XML, with a special "link" or cross reference to represent a second occurrence of the same object. This hybrid approach has the advantage that in simple cases the structure of the XML closely follows the structure of the HDF5 file, while capturing the complex cases when needed.
After considering each alternative in detail, a hybrid approach was chosen. For HDF5 objects that may be shared (Groups, Datasets, Named Datatypes) the XML element is defined to be either a description of the object or a "pointer" to an element that describes the object. A shared object should be described in exactly one element, and all other instances should point to that element.
It should be noted that the XML parser can verify that the "pointer" points to a valid XML element, but not that it points to the correct element, nor that there is only one description of a given HDF5 object. These rules must be enforced by the applications that create and use the XML description.
The Data Values
While representing metadata with XML was fairly straightforward, it was less obvious what should be done with the data values. For different purposes, it may be better to:
Examination of existing practice shows that there is no outstanding agreement on these issues. This is not surprising, since the choice depends on the requirements of the intended use. Interesting examples of related work include:
We wanted to support as many variations as possible, so our design allows many representations (including omission) for data values.One point to note about the HDF5 DTD: many of the other approaches (e.g., XDF) include substantial metadata about the shape and type of arrays. This information is provided in great detail by the HDF5 metadata, so our markup of the data values is less elaborate than some other DTDs. On the other hand, certain facts such as the order of the dimensions and elements in the XML description must still be included, because the XML is not required to be laid out in the order that the HDF5 file specified.
DDL in BNF for HDF5
http://hdf.ncsa.uiuc.edu/HDF5/doc/ddl.html
Suggested Use Cases for XML with HDF-5
http://hdf.ncsa.uiuc.edu/HDF5/XML/UseCases/use-cases.html
HDF5 Abstract Data Model
http://hdf.ncsa.uiuc.edu/HDF5/ADM_990506/
VisAD
http://www.ssec.wisc.edu/~billh/visad.html
XSIL: Extensible Scientific Interchange Language
http://www.cacr.caltech.edu/SDA/xsil/
XDF (eXtensible Data Format)
http://tarantella.gsfc.nasa.gov/xml/
netcdf
http://hdf.ncsa.uiuc.edu/HDF5/XML/NetCDF/netcdf.dtd
XML-data
http://www.w3.org/TR/1998/NOTE-XML-data/
Resource Description Framework (RDF)
http://www.w3.org/RDF/
Scientific Data Management (SDM)
http://www-xdiv.lanl.gov/XCI/PROJECTS/SDM