Revised HDF5 DTD (Version 1)

2/21/00
REMcG

Summary

I revised the HDF-5 DTD following the discussions last week.

The changes are:

  1. The hierarchy is represented as a tree, which is represented as XML nesting. Several ramifications of this change are noted below.
  2. The data is now defined as an element that is either a 'pointer' or 'data from the file'. The pointer is sketched, the handling of data is TBD.

The revised DTD is here:
http://hdf.ncsa.uiuc.edu/HDF5/XML/DTD/HDF5-DTD-1.dtd

The example XML file was revised to use the new DTD, and is here:
http://hdf.ncsa.uiuc.edu/HDF5/XML/Version-1/HDF5-DTD-1-TEST.xml

For reference, the previous version is available at:
http://hdf.ncsa.uiuc.edu/HDF5/XML/Version-0/HDF5-DTD.html


Discussion of Revisions

1. Represent the File as a Tree, Mapped to the XML Tree

The HDF5-File now has exactly two objects, an optional BootBlock, and exactly one RootGroup.

<!ELEMENT HDF5-File (BootBlock?,RootGroup)>
<!ELEMENT BootBlock EMPTY>
<!ATTLIST BootBlock
 LibraryVersion CDATA #IMPLIED
>
<!ELEMENT RootGroup (Attribute*,Group*,Dataset*,DataType*,Link*,SoftLink*)>
<!ATTLIST RootGroup
 Name CDATA #FIXED "/"
 OBJ-XID ID #IMPLIED
>
The RootGroup and Group now may contain zero or more other Groups, Datasets, and dataTypes, as well as Attributes and links.

Each HDF-5 object is nested within one Group. If there are more than one reference to the object, the subsequent references are Link objects.

Groups, Datasets, and DataTypes now have an XML attribute "Name", to store the name of the link when they are nested within a group.

The back links are still included, the object points to it's enclosing group(s).

2. Represent the Data as either a Pointer or in-line Data

I created a new XML element to represent data from the file. This element contains either a 'pointer' to the data or formatted data. <!ELEMENT Data (PointerToDataInHDF?,DataFromFile?)> The definition shown here may not be quite correct: it allows data to be empty.

The pointer is sketched out, using the XML standard for linking, "XLINK". This standard is extremely flexible, so we can surely do whatever we need. The current version looks like:

<!ELEMENT PointerToDataInHDF (Selection)?>
<!ATTLIST PointerToDataInHDF
 xlink:type CDATA #FIXED "locator"
 xlink:href CDATA #REQUIRED
 H5Path CDATA #IMPLIED
 H5ObjectType (HDF5Attribute|HDF5Dataset) #REQUIRED
 OBJ-XID ID #IMPLIED
>
The element has a URL ('href'), and information about the kind of object, a path to access it by, and the ID of the object within the XML. The principle here is to provide enough information so that an application program will be able to find the file and the object. Note also that I included a sub-element which is a 'Selection' object. This allows pointing to a selection, when applicable.

The other element is called 'DataFromFile', and is intended to contain formatted and/or marked up representation of the actual data. The <DataFromFile/> element is 'EMPTY', it remains to be defined. The actual details of this are definitely an open question.


Important Questions For Discussion

The most important question is clearly 'what to do about data?' We resolved to optionally include data in some kind of text encoding. How do we want to encode the data?

Also, except in the simplest cases, the data itself is structured (e.g., it is a multidimensional array, and may have structured records). Do we want to define markup elements for data structures? These would nest inside the <DataFromFile> element.

Also, in the case of simple, atomic data, we can potentially use different formats, which can be marked with attributes. E.g., numbers can be represented in FORTRAN fashion, or C printf format, or whatever. The XML parser will not be able to validate the format itself, but there isn't any reason we can have tags or attributes that indicate the intention.