3-9-00
This note discusses some "use cases" for XML with HDF-5. These are descriptions of how XML descriptions of HDF-5 files might be used, along with some analysis of the implications for the requirements for the DTD. We definitely want and intend to support all these uses (and others).
An earlier version of some of these Use Cases is here.
XML descriptions of HDF5 files will be readable by standard Web browsers. Some standard Web browsers will be able to display the XML directly, and many servers will be able to generate HTML from XML on the fly. We will also be able to construct style sheets to control the rendering of information about HDF5 files.
This use of XML will make it easy to get at least a general view of the contents of an HDF5 file without any special software.
One use of XML will be for catalog records, e.g., at a DAAC or similar archive. The contents of HDF-5 files will be described in XML records which will be stored in a database or otherwise served through search services. The XML will be delivered to clients or proxies without the original HDF-5 file. The client will use the records to locate and obtain the datasets they want.
In this use, the XML is likely to be used separately from the original HDF-5 file, and any 'pointers' to the file must have complete URLs and other information, in order to locate the actual dataset.
In general, the purpose of these records is not to deliver all the data, nor to reconstruct the contents of the file from the XML. However, the content of the attributes is likely to be vital. Also, there may be a desire for the records to be comparatively compact, as you might be searching thousands of candidate datasets, and hence might receive thousands of XML descriptions.
It is difficult to know what kinds of searches will be required. However, it seems likely that details like library version and storage strategy are not likely to be of great interest, compared to searches on the attributes.
A second case is using XML as a machine readable description of HDF-5 files while manipulating the file itself. An HDF5 editor is an example of this kind of use. Here, the XML and HDF-5 file are both accessible to the program.
It is difficult to generalize what applications might intend to do. For purposes of this use case, I will assume the following:
The XML description of the file is used because the application wants to use standard XML tools and interfaces to manipulate the objects. For instance, standard packages will read XML into DOM objects, which provides not only standard data structures for the XML objects, but also standard interfaces for manipulating the tree (insert, delete, etc.). There are already standard editors for XML trees.
The notion of this use case is that you would build tools for HDF-5
by extending such standard XML functions. The main trick will be
to keep the XML and the HDF-5 correlated. When an XML object is created
or changed, the tool will perform the equivalent operation on the file.
Thus, in this case, it is very important for the XML to be closely related to the structure of the HDF-5 file, and that this be maintained. (This contrasts to Case 1 where the XML is generated once, and possibly used many times without ever accessing the file.) Also, we definitely want the XML to point to the objects in the file, whether the data is in the XML or not.
Depending on the application, many details of the file may be needed, certainly including things like storage strategy. However, since the file is available, these things can be obtained from the file rather than XML. This means that much of the detail could be optional for the XML.
A third case for using XML is as a tool for validating, comparing, or generating HDF-5 files. We have proposed tools for checking, correcting, and diff-ing HDF-5 files, which might use XML as a canonical description of the file. Similarly, an 'h5gen' utility might well use XML as the template to create HDF-5 files.
These applications need to be able to represent essentially everything about the HDF-5 file. In the case of a validator or diff-er, even boot block information is important.
Also, it may well be the case that the data must be included in the XML, either because the HDF-5 file is not available, or because it must be in a canonical form for comparison.
While it is necessary for everything (or "everything important") to be in the XML, it is not necessary that the XML representation itself follows all of the rules of HDF-5. For instance, it is not required that the XML objects are in the same order as the HDF-5 objects (if such can even be determined), or (I think) that storage offsets are represented in the XML.
XML is ideally suited for automatic transformation into various formal languages, either directly or via additional XML languages. For example, an XML description of an HDF5 file could be transformed into ODL. Similarly, XML can be transformed to other XML languages, such as XDF.
XML may also be a good intermediate language for translating between file formats. For example, the XML description of HDF5 could be transformed into the XML description for netCDF, and then the data could be written as netCDF.
It is likely that there will be "hub" languages, such as XDF, that are very general languages for data. Translating from HDF5-XML to XDF will lose information, but will then make the data translatable to any other format that can be mapped to XDF. Similarly, data could be imported to HDF5 from any format that can be translated to XDF, albeit with some loss of information.
It should also be noted that an XML description of HDF5 could be used to transform or translate individual objects from a file. For example, an HDF5 file might contain several datasets, one of which can be mapped to an OGIS gridded map. In this case, software could read the XML, locate the datasets that can be handled, and translate them to OGIS XML or other OGIS representations. In this way, similar kinds of data can be made to work together regardless of storage format, and without requiring that the entire file be limited to a particular kind or format of data. This would be a very powerful tool for sharing data.
The XML description of an HDF5 file is a promising candidate to be a machine readable format to be stored in archives. The XML would likely be interpretable in the future, and could be mapped to whatever technology is available.
In this scenario, the XML should contain sufficient information to access and translate the data if necessary.
One variation of this theme is to store the descriptions of the files in a repository, while the files may reside in some storage media. Or the XML might be stored in the file itself, as a machine readable table of contents.
XML can be used as a medium for creating templates or skeletons for HDF5 files. For example, the skeleton of a data product could be defined in XML, and read by software to produce the file and then fill in the specific values. This is a very useful tool for standardization. This is very similar to how the HCR tools for HDF-EOS worked.
It might also be possible to have XML templates for parts of HDF5 files, which can be composed to form datasets. For instance, there could be a library of XML templates for storing gridded data of various kinds, which would be coordinated with software to efficiently store and retrieve the data. A user could compose a data product by selecting appropriate templates to construct the dataset. This could also provide code modules to create and read the dataset.