Some key points from the Data Modelling meeting, 2/16/00
(Discussion of XML DTD for HDF5)

REMcG


Purpose and use of DTD

Some global issues were suggested in passing, which should be noted.


What to do about data, especially binary data?

In genernal, XML does not have numeric types, or understand the semantics of numbers.

XML can:


For handling the data in HDF5 files, there seems to be two basic strategies:

  1. point to the data, e.g., with a URL+path, with tags/attributes indicating that this should be read with HDF-5 software.
  2. include the data in a character encoding
The consensus is to have the DTD define both treatments, which can be used element by element. E.g.,

<DATA_FROM_FILE>
<POINTER_TO_DATA>
URL, path etc.
</POINTER_TO_DATA>
<DATA>
... character/unicode endoding of the data....
</DATA>
</DATA_FROM_FILE>

To Do:

Investigate and propose details of both pointer and character representation of data.

RE pointer: Need to investigate XML standards for Xpointer and Xlink, and "do the right thing".

RE character representation: Data in an HDF-5 file is often strucutred, i.e., it can be an array of structures. This opens the question of whether we want to "mark up" the data elements themselves, e.g., marking the rows, cols, fields, etc. of the data. This could be done be defining additional tags to be used within the '<DATA>' element.

An alternative is to have a standard for "flattening" data into a one dimensional array of UniCode.

And, of course we will probably follow a mixed approach. Strings and scalars can be represented in a straightforward way as UniCode strings with standard formats. Other data elements might be represented as several sub-elements, with further structure flattenned. For example, a 2D array of compound data types might be represented as several "<ROW>" elements with <CELL> elements, but perhaps each cell might be stored as a flat array of bytes with no further mark up.


Attributes/Elements

We discussed the use of XML attributes and elements. There is some freedom here, and sometimes the decision may be a matter of taste. We can and should choose whichever makes sense in a given case.

Rules and tips for choosing

To Do:

Case by case, decide about using attributes or elements....


File Structure, Links

We discussed how best to represent the structure of an HDF-5 file with XML.

Two approaches were discussed:
  1. All the nameable objects are at the top level, all links are explicitly included. The structure of the file has to be reconstructed from all the links.
  2. The objects are all nested in the RootGroup, with extra link objects for cases where there are multiple references to the same object
These approaches are theoretically equivalent, any file that can be expressed one way can be expressed the other, and there is a formal mapping between the two ways to show the file.

The first approach is 'elegant', and represents the actual way that HDF-5 works. This isn't the way the documentation describes the file, and isn't how the API or dumper works. Also, this approach does not take advantage of the 'treeness' of XML, even when the file really is a tree. It is more complicated than needed for the common simple cases.

The second approach still has links, but they are needed only for objects with more than one link. In most cases, the object will be nested in a natural way, with the XML matching the HDF-5 (and the DDL).

The general consensus was to do the tree plus links, because this makes the common case easy.

To Do:

Revise DTD to do the tree with aux. links. Note: will need to define hueristics beyond the DTD for how to construct the tree.