Standard HDF5 file XML DTD: HDF5-File.dtd

The XML DTD for HDF5: Design Notes

June 12, 2000
Feedback to: hdfdev@ncsa.uiuc.edu

1. Introduction

The XML "Document Type Definition" (DTD) [17] for HDF5 is a markup language to describe the contents of an HDF5 file.[1] This DTD specifies a standard for using XML to describe the structure and contents of a single HDF5 file. The DTD can be used in a variety of ways, by standard software and by application specific software that builds on standard XML features. The DTD will enable descriptions of HDF5 files to be used with and translated to other similar XML markup languages.

This document discusses some of the key features of the HDF5 DTD, and some of the design decisions that were considered during its development.

The HDF5 data model is somewhat complex, with a great deal of flexibility and expressive power. The DTD is intended to be able to describe almost any HDF5 file, and to describe most of the details of the file. For these reasons, the HDF5 DTD is more complicated than some similar DTDs, such as XDF[7] and netCDF[8].

The DTD to some extent is redundant with the previously published "Data Definition Language", and the accompanying "h5dump" and related tools.[2] XML descriptions will contain similar or identical information as the dumper DDL. The important difference is that the XML is machine readable, but not necessarily human readable, and XML is a standard format that can be exchanged with standard tools and other XML languages.

The DTD defines a formal and machine verifiable syntax which is rigorously enforced by validating XML tools. This guarantees that the producer and consume can exchange the description. The rules of XML will guarantee that the description is syntactically correct and follows the grammar defined in the DTD. However, XML cannot assure that a particular XML description is a correct description of the HDF5 file, or even that it follows all the semantic rules of HDF5. For example, the XML description can assure that every Dataset element belongs to at least one enclosing Group element, but can't assure that the Dataset is in the correct Group, or that the Dataset has the correct name, type, etc. The overall correctness of the XML description must be assured by tools that generate the XML.

2. Requirements and Use Cases

An important goal is for the DTD to be useful for a variety of purposes. For this reason, we considered a number of "Use Cases". This analysis showed that there are indeed many different uses for XML, which have different requirements. Our DTD is intended to support as many of these uses as possible. In this section, seven use cases are described and discussed.

2.1 Case 1: Viewing Structure and Contents of HDF5 File Using a Web Browser

XML descriptions of HDF5 files will be readable by standard Web browsers. Some standard Web browsers will be able to display the XML directly, and many servers will be able to generate HTML from XML on the fly. We will also be able to construct style sheets to control the rendering of information about HDF5 files.

This use of XML will make it easy to get at least a general view of the contents of an HDF5 file without any special software.

2.2 Case 2: XML as a Catalog Record

One use of XML will be for catalog records, e.g., at a NASA DAAC or similar archive. The contents of HDF-5 files will be described in XML records which will be stored in a database or otherwise served through search services. The XML will be delivered to clients or proxies without the original HDF-5 file. The client will use the records to locate and obtain the datasets they want.

In this use, the XML is likely to be used separately from the original HDF-5 file, and any 'pointers' to the file must have complete URLs and other information, in order to locate the actual dataset.

In general, the purpose of these records is not to deliver all the data, nor to reconstruct the contents of the file from the XML. However, the content of the attributes is likely to be vital. Also, there may be a desire for the records to be comparatively compact, as you might be searching thousands of candidate datasets, and hence might receive thousands of XML descriptions.

It is difficult to know what kinds of searches will be required. However, it seems likely that details such as format version and storage strategy are less likely to be of great interest, compared to searches on the attributes.

2.3 Case 3: XML as an Intermediate Form for Programs

A second case is using XML as a machine readable description of HDF-5 files while manipulating the file itself. An HDF5 editor is an example of this kind of use. Here, the XML and HDF-5 file are both accessible to the program.

It is difficult to generalize what applications might intend to do. For purposes of this use case, assume the following:

The XML description of the file is used because the application wants to use standard XML tools and interfaces to manipulate the objects. For instance, standard packages will read XML into DOM objects, which provides not only standard data structures for the XML objects, but also standard interfaces for manipulating the tree (insert, delete, etc.). There are already standard editors for XML trees.

The notion of this use case is that you would build tools for HDF-5 by extending such standard XML functions. The main trick will be to keep the XML and the HDF-5 correlated. When an XML object is created or changed, the tool will perform the equivalent operation on the file.

Thus, in this case, it is very important for the XML to be closely related to the structure of the HDF-5 file, and that this be maintained. This contrasts to Case 2 (a catalog) where the XML might be generated once, and possibly used many times without ever accessing the file. Also, we definitely want the XML to point to the objects in the file, whether the data is in the XML or not.

Depending on the application, many details of the file may be needed, certainly including things like storage strategy. However, since the file is available, these things can be obtained from the file rather than XML. This means that much of the detail could be optional for the XML.

2.4 Case 4: Generation, Validation, and Reconstruction of HDF-5

A third case for using XML is as a tool for validating, comparing, or generating HDF-5 files. We have proposed tools for checking, correcting, and diff-ing HDF-5 files, which might use XML as a canonical description of the file. Similarly, an 'h5gen' utility might use XML as the template to create HDF-5 files.

These applications need to be able to represent essentially everything about the HDF-5 file. In the case of a validator or diff-er, even boot block information is important.

Also, it may well be the case that the data must be included in the XML, either because the HDF-5 file is not available, or because it must be arranged in a canonical form for comparison, e.g., to confirm that two files have the same contents.

While it is necessary for everything (or "everything important") to be in the XML, it is not necessary that the XML representation itself follows all of the rules of HDF-5. For instance, it is not required that the XML objects are in the same order as the HDF-5 objects (if such can even be determined), or that storage offsets in the HDF5 file are faithfully represented in the XML.

2.5 Case 5: XML as Intermediate to Other Formal Languages and File Formats

XML is ideally suited for automatic transformation into various formal languages, either directly or via additional XML languages. For example, an XML description of an HDF5 file could be transformed into ODL.[13] Similarly, XML can be transformed to other XML languages, such as XDF[7].

XML may also be a good intermediate language for translating between file formats. For example, the XML description of HDF5 could be transformed into the XML description for netCDF, and then the data could be written as netCDF[ 8].

It is likely that there will be "hub" languages, such as XDF, that are very general languages for data. Translating from HDF5-XML to XDF will lose information, but will then make the data translatable to any other format that can be mapped to XDF. Similarly, data could be imported to HDF5 from any format that can be translated to XDF, albeit with some loss of information.

It should also be noted that an XML description of HDF5 could be used to transform or translate individual objects from a file. For example, an HDF5 file might contain several datasets, one of which can be mapped to an OGIS gridded map. In this case, software could read the XML, locate the datasets that can be handled, and translate them to OGIS GML.[16] In this way, similar kinds of data can be made to work together regardless of storage format, and without requiring that the entire file be limited to a particular kind or format of data. This would be a very powerful tool for sharing data.

2.6 Case 6: Store XML in Archive or in Dataset as Machine Readable Documentation

The XML description of an HDF5 file is a promising candidate to be a machine readable format to be stored in archives. The XML would likely be interpretable in the future, and could be mapped to whatever technology is available.

In this scenario, the XML should contain sufficient information to access and translate the data if necessary.

One variation of this theme is to store the descriptions of the files in a repository, while the files may reside in some storage media. Or the XML might be stored in the file itself, as a machine readable table of contents.

2.7 Case 7: Templates, Skeleton Files, etc.

XML can be used as a medium for creating templates or skeletons for HDF5 files. For example, the skeleton of a data product could be defined in XML, and read by software to produce the file and then fill in the specific values. This is a very useful tool for standardization. This is very similar to how the HCR tools for HDF-EOS worked.[12]

It might also be possible to have XML templates for parts of HDF5 files, which can be composed to form datasets. For instance, there could be a library of XML templates for storing gridded data of various kinds, which would be coordinated with software to efficiently store and retrieve the data. A user could compose a data product by selecting appropriate templates to construct the dataset. This could also provide code modules to create and read the dataset.

2.8 Implications

These different use cases for XML require different (and sometimes conflicting) information in the XML. For instance, an XML catalog record is intended to be a description of the dataset and its location. This record should be compact, and should have all the attributes, and a pointer to the dataset at a data service, but the data should not be included. By contrast, an XML based validation tool needs to have a complete description of the file, including the data (if present).

The HDF5 DTD is designed to support many uses. In some cases, there are alternative descriptions provided, e.g., data in the file can be represented by a pointer to the original file or by a description of the data values themselves--or both.

3. Main Components of the HDF5 DTD

The HDF5 DTD is intended to describe the structure and contents of an HDF5 file. For the most part, the DTD closely follows the HDF5 data model, as described in [2, 3, 4]. THe HDF5 data model defines the shape and data types of datasets and attributes. These descriptions are similar to other general descriptions of scientific data [ 5, 6, 7, 8, 11], although HDF5 is more general than some these. The description of the HDF5 objects is discussed in Section 3.1.

An important feature of the HDF5 data model is the Group structure, which allows the HDF5 file to be structured as a rooted directed graph, analogous to a Unix file system. In the HDF5 file objects may be shared, and it is possible for objects to be a parent of their own ancestor( i.e., the graph may have loops). In other words, the structure of the HDF5 file is not limited to be a tree. In contrast, XML descriptions are restricted to be a tree, so it was necessary to map the directed graph of HDF5 onto a tree of XML elements. This is discussed in Section 3.2.

The XML standard does not define numeric types, nor representations for arrays, tables, etc. In the case where it is necessary to describe actual data values (the value of an attribute, or values of an array), there is no current standard to follow, so we were guided by the best practices we could find. Still, this is an area where our DTD must evolve in the future. These issues are discussed in Section 3.4.

Finally, the DTD needs to support the ability to describe an HDF5 file in detail. This description must be able to include storage properties, compression properties, and the like. The DTD defines optional elements for this information. These are described in Section 3.4.

3.1 Description of Datasets (Dataspace and Datatypes, and Attributes)

The HDF5 data model provides a complete and well defined description for most kinds of scientific data. The DTD follows the HDF5 model in a simple and clear way. An HDF5 Dataset object is described by and XML <Dataset> element; each <Dataset> has a <Dataspace> and <Datatype> object, corresponding to the HDF5 model.

HDF5 has a very elaborate model of types, including arbitrary "compound datatypes" (i.e., structured records with heterogeneous components) as well as a completely general model of number representation. Expressing this in XML was easy, if somewhat elaborate. It should be noted that we made some seemingly arbitrary decisions about how to express the attributes of a datatype: sometimes an XML element is used and sometimes an XML attribute is used.

One point ot note is that the XML describes the structure and properties of the HDF5 objects, not XML elements. The <Datatype> and <Dataspace> elements describe the data in the HDF5 file, not the layout of the data in the XML file.

3.2 Description of the Structure (Groups)

An HDF5 file is a rooted directed graph, with at least one Group, "/". Some files are very simple, containing a few datasets, all in the root group. Other files have elaborate grouping structures, organizing the objects as a tree or graph. Objects can be shared, i.e., they can be members of more than one group. In this case, the graph is not a tree, because some objects have more than one parent. It is also possible for Groups to directly or indirectly contain an ancestor. In other words, the graph can have a loop in it.

XML descriptions are trees, with exactly one root, and objects nested in their parent. XML has no concept of elements which have more than one owner. This raised the issue of how to map the graph structure of the HDF5 file to a tree of XML elements.

First, there is an issue of what is the desired relationship between HDF5 objects and XML elements/objects. It is clear that XML is general enough to describe almost any structure. For example, the "Resource Description Framework" (RDF) can represent complex semantic networks.[10] So the issue is not a lack of expressive power in XML.

The issue here is that standard XML software, e.g., SAX parsers [14] and the DOM [15], naturally create objects (data structures) which correspond to the elements of the XML description. To the degree that the objects of HDF5 can be mapped to elements of XML, then general purpose XML-based software will be presented with an approximation of the semantics of the HDF5 objects, simply from the XML itself. In other words, the HDF5 objects are mapped naturally to XML elements, and general purpose XML tools will approximately understand the structure of the HDF5.

In this approach, the difficult problem is how to represent group membership. For a simple HDF5 file in which the objects are structured as a tree, then the objects can be represented as elements, and members of a group can be nested in a <Group> element. The XML nesting directly expresses the HDF5 membership in a natural way. But what should be done to represent a more general graph, e.g., where a dataset is a member of two different groups?

One possibility is to represent the structure of the file in a general set notation, with a set of nodes (vertices) and a set of arcs (edges). Each dataset and group is a "node", and the membership is represented as "arcs". There are many variants of this basic approach, and it is easy to develop that software can read these two sets and construct the graph. This sort of representation is very natural for many algorithms that manipulate graphs, and can be easily transformed into different data structures. However, standard XML software would have no notion of the meaning of the edges and vertices, nor any clue to the structure of the file.

These approaches can be combined, nesting the objects as in XML, with a special "link" or cross reference to represent a second occurrence of the same object. This hybrid approach has the advantage that in simple cases the structure of the XML closely follows the structure of the HDF5 file, while capturing the complex cases when needed.

After considering each alternative in detail, a hybrid approach was chosen. For HDF5 objects that may be shared (Groups, Datasets, Named Datatypes) the XML element is defined to be either a description of the object or a "pointer" to an element that describes the object. A shared object should be described in exactly one element, and all other instances should point to that element.

It should be noted that the XML parser can verify that the "pointer" points to a valid XML element, but not that it points to the correct element, nor that there is only one description of a given HDF5 object. For instance, XML can confirm that a "pointer" has a reference to exactly one element (HDF5 object), but that object could be any valid XML element, including the link itself or any of the other elements of DTD. This makes no sense according to the rules of HDF5, and those rules must be enforced by the applications that create and use the XML description.

3.3 The Data Values

While representing metadata with XML was fairly straightforward, it was less obvious what should be done with the data values. For different purposes, it may be better to:

include the data values as formatted text, e.g., "-7.5"
include the data values in some text encoded form, e.g., binhex
omit the data values and point to an external XML file
omit the data values and point to the HDF5 file

A second design choice is whether to mark up the data elements or include data as a single block of undelimited text. For example, the values of a two dimensional array could be included either as a single block of values, or tagged with XML elements for each row, or tagged individually for each row and column.

Examination of existing practice shows that there is no outstanding agreement on these issues. This is not surprising, since the choice depends on the requirements of the intended use. Interesting examples of related work include:

XSIL [6]
XDF [7]
netCDF DTD [8]
XML-Data [9]

We wanted to support as many variations as possible, so our design allows many representations (including omission) for data values.

One point to note about the HDF5 DTD: many of the other approaches (e.g., XDF) include substantial metadata about the shape and type of arrays. This information is provided in great detail by the HDF5 metadata, so our markup of the data values is less elaborate than some other DTDs. On the other hand, certain facts such as the order of the dimensions and elements in the XML description must still be included, because the XML is not required to be laid out in the order that the HDF5 file specified.

We were not able to create a satisfactory markup for data in the HDF5 file for the first release. The initial version of the DTD has a limited <Data> element, which does not support all the desired features. This will be revised in a future release.

3.4 File Format Details

The DTD must be able to support applications that need to fully describe the details of a specific HDF5 file. For example, in order to verify the correctness of a specific dataset in an archive, it may be necessary to confirm the storage layout and compression parameters are correct, as well as the structure, attributes, and data values.

For these applications, optional elements are included in the DTD, including:

<UserBlock> and <BootBlock> (sic), which are described in the HDF5 specification [3]
<StorageLayout>, which describes the organization of a dataset in the file
<Compression>, which describes the compression parameters for a dataset, if applicable.

These elements are only partly defined in the first release of the DTD.