Some Suggested Use Cases for XML with HDF-5

Robert E. McGrath
2-25-00

Introduction

This note discusses some "use cases" for XML with HDF-5. These are descriptions of how XML descriptions of HDF-5 files might be used, along with some analysis of the implications for the requirements for the DTD. These cases are all made up by me, and I would expect others to have additional use cases to consider.

It is important to note that we definitely want and intend to support all these uses (and others). This note is not intended to argue for one class of use over another, only to explore the 'space' of uses that we want to support.

The cases are ordered approximately by the amount of 'nit-picky detail' they appear to require about the HDF-5 file.

Case 1: XML as a catalog record

One use of XML will be for catalog records, e.g., at a DAAC or similar archive. The contents of HDF-5 files will be described in XML records which will be stored in a database or otherwise served through search services. The XML will be delivered to clients or proxies without the original HDF-5 file. The client will use the records to locate and obtain the datasets they want.

In this use, the XML is likely to be used separately from the original HDF-5 file, and any 'pointers' to the file must have complete URLs and other information, in order to locate the actual dataset.

In general, the purpose of these records is not to deliver all the data, nor to reconstruct the contents of the file from the XML. However, the content of the attributes is likely to be vital. Also, there may be a desire for the records to be comparatively compact, as you might be searching thousands of candidate datasets, and hence might receive thousands of XML descriptions.

It is difficult to know what kinds of searches will be required. However, it seems likely that details like library version and storage strategy are not likely to be of great interest, compared to searches on the attributes.

Case 2: XML as an intermediate form for programs

A second case is using XML as a machine readable description of HDF-5 files while manipulating the file itself. An HDF5 editor is an example of this kind of use. Here, the XML and HDF-5 file are both accessible to the program.

It is difficult to generalize what applications might intend to do. For purposes of this use case, I will assume the following:

The XML description of the file is used because the application wants to use standard XML tools and interfaces to manipulate the objects. For instance, standard packages will read XML into DOM objects, which provides not only standard data structures for the XML objects, but also standard interfaces for manipulating the tree (insert, delete, etc.). There are already standard editors for XML trees.


The notion of this use case is that you would build tools for HDF-5 by extending such standard XML functions. The main trick will be to keep the XML and the HDF-5 correlated. When an XML object is created or changed, the tool will perform the equivalent operation on the file.

Thus, in this case, it is very important for the XML to be closely related to the structure of the HDF-5 file, and that this be maintained. (This contrasts to Case 1 where the XML is generated once, and possibly used many times without ever accessing the file.) Also, we definitely want the XML to point to the objects in the file, whether the data is in the XML or not.

Depending on the application, many details of the file may be needed, certainly including things like storage strategy. However, since the file is available, these things can be obtained from the file rather than XML. This means that much of the detail could be optional for the XML.

Case 3: Generation, validation, and reconstruction of HDF-5

A third case for using XML is as a tool for validating, comparing, or generating HDF-5 files. We have proposed tools for checking, correcting, and diff-ing HDF-5 files, which might use XML as a canonical description of the file. Similarly, an 'h5gen' utility might well use XML as the template to create HDF-5 files.

These applications need to be able to represent essentially everything about the HDF-5 file. In the case of a validator or diff-er, even boot block information is important.

Also, it may well be the case that the data must be included in the XML, either because the HDF-5 file is not available, or because it must be in a canonical form for comparison.

While it is necessary for everything (or "everything important") to be in the XML, it is not necessary that the XML representation itself follows all of the rules of HDF-5. For instance, it is not required that the XML objects are in the same order as the HDF-5 objects (if such can even be determined), or (I think) that storage offsets are represented in the XML.