Experimental DTD for HDF5

Robert E. McGrath
February 7, 2000

Introduction

I have been learning about XML and Java tools which work with XML. I need to know this for school work, and also for NCSA. In the last 2 weeks I've been learning how to create and use DTD's, particularly, using some IBM Java tools for editing DTD's. As a concrete exercise, I decided to try to do a DTD for HDF5 based on the DDL. This turned out to be pretty easy, and several iterations gave me a fairly complete DTD that is both valid and works.

To test this DTD I created the XML description for the sample file in the DDL documentation. This XML is complete except it doesn't attempt to include the data in the objects--just the description. Note that this XML was constructed by hand. We don't have any programs to generate this automatically, although that would be pretty simple to do.

The purpose of this was to gain experience and to give me a concrete DTD and test file to experiment with several free Java toolsets for XML. This is not intended to be the final, official DTD for HDF-5.

This note includes an explanation of the DTD I made, and the XML description.

The Grammar and Sample File

This DTD is hand constructed based on the DDL as described in the HDF-5 User's Guide,

http://hdf.ncsa.uiuc.edu/HDF5/doc/ddl.html
This appendix has the complete grammar and sample output for the file 'example.h5'. The DTD and XML file below should be compared to this web page to see the relationship.

The DTD

An XML "Document Type Description" is a structural template for a class of structured descriptions, not unlike a grammar. The HDF5 DTD is essentially one-to-one with the DDL grammar, expressed in a different language.

Here is the DTD, with line numbers:

HDF5 DTD

There are a few things that I would like to point out about the DTD.

First, the "root object" is HDF5-File which is defined to have elements (line 1 of DTD.txt):

(BootBlock?,RootGroup,(Group?,Dataset?,DataType?)*)
This says that there is 0 or 1 "BootBlock" elements, exactly 1 "RootGroup" element, and 0 or more other "Group", "Dataset", and "DataType" objects. Initially, I had simply an optional BootBlock and a mandatory RootGroup, but I changed it so that "named" objects are at the top level of the XML file, regardless of where they are in the HDF-5 structure. (This is a design issue to be discussed.)

The objects are expressed in a pretty obvious way, so a "DataSet" is:

 23 <!ELEMENT Dataset
 (Attribute*,Dataspace,DataType,DataObjectInFile,
 Compression?,StorageLayout?)>
 24 <!ATTLIST Dataset
 25 OBJ-XID ID #IMPLIED
 26 Parents IDREFS #REQUIRED
 27 >
showing it has zero or more "Attributes", one "Dataspace", one "Datatype", and a "DataObjectInFile" (which is TBD), also optional "Compression" and "StorageLayout" information. These objects are all defined in the DTD. This is all read straight from the DDL grammar.

Linking

In this DTD, I expressed the links in the file explicitly with a "Link" object (line 17-22 of DTD.txt). The "Link" has a "Name" attribute and the connection is expressed using XML "ID" and "IDREF" attributes. Note that these names are internal to the XML description. They do not have to be the same or even related to the HDF-5 object paths or ids. The XML validator can assure that a "Link" has ID's for 2 valid objects in the XML file, but it cannot assure that they make sense. (E.g., a buggy XML file might have a Dataset be the 'parent' of a 'Group'--this would be a valid XML document, but not a valid HDF-5 file.)

To partly assure that all objects must be linked by some link, I included a "Parents" attribute in the Dataset, Group, and Datatype objects. XML validataion will assure that these objects have some valid XML ID in that field, but not that the value is actually correct.

So in this DTD, the structure of the HDF-5 file is represented by a series of links, as will be clearer in the example. This makes it possible to represent any HDF-5 file correctly, as long as tools understand how to extract the link information.

The Example XML Description

To make the DTD real, I hand coded an XML description of the example from the DDL. This is here:

Example XML
Note that the XML file refers to the HDF-5 DTD (line 3 of XML.txt). This says: "Please validate this file against the DTD at this URL." When I open the example XML with any XML tool, it automatically fetches the DTD and checks the XML against the DTD. This actually works!

The example shows the root group (lines 6-29 of XML.txt), which has one attribute (a string), and 6 members, represented by 5 hard links and one soft link. The objects linked to are in the XML after the RootGroup.

For example, Dataset "dset1" is described in lines 30-47 of XML.txt. This gives the no "Attributes", the DataSpace, DataType, and DataObjectInFile. Note that the Dataset does not have a "name" attribute. Instead, it has a "Parent" pointer, to the root group, which has one or more links to this Dataset. This back pointer doesn't exist in HDF-5 directly, otherwise, this reflects the HDF-5 naming scheme.

The DataTypes are a hierarchy of XML objects, which eventually boil down to objects with attributes such as "size" and "typeCode". E.g., lines 39-46 of XML.txt describe an H5T_STD_I32BE.

Similarly, the DataSpace is a hierarchy of objects. The simple DataSpace for "dset1" has 2 dimensionas, and has two "Dimension" objects, with current and mximum size attributes. (lines 31-37 of XML.txt).

What About the Data?

This DTD aims to describe the structure of the HDF-5 file. The "data" parts are pointed to with faked XML xlinks. (E.g., line 45 of XML.txt)

These are generalized HTML 'href' links, which may contain extra information, such as "use this program to read this data" and "extract this subset from this dataset in this way".

I didn't attempt to actually define this object at this time.

Comments

As noted at the beginning, the purpose of this DTD and XML example is for experiments with tools. There are a lot of different ways to express the HDF-5 model in XML, so I don't think this will necessarily be the final way we do it.

Issues to consider: