DTD for Describing the Data in an HDF5 File: Design Notes

Robert E. McGrath
March 17, 2000

Goals


The DTD for HDF5 should support four alternatives for describing or including data from and HDF5 file in the XML description:

  1. A pointer to the data in the HDF5 file. This includes the URL of the file, at least one valid name (path) for the object, and a description of what to read (all, part, etc.).
  2. A pointer to an external file in some dialect of XML, e.g., XDF.
  3. A formatted block of characters representing the data values, e.g., "1, 2, 3, 4".
  4. A character encoded representation of the data values, e.g., uuencode or binhex encoded ASCII.
It should be possible to use any or all of these representations for a given data element, to mix them within a single XML file, and even in a single array, if desired.

When included in the XML, the data itself is represented as one or more blocks of characters. The HDF5 DTD provides a detailed description of the Dataspace and the Datatype of the elements in the space, so the markup for the data need not repeat this information (it can point to the relevant XML objects). The representation of the data has three primary functions:

  1. To describe the identity of the data being described in terms of the HDF5 file: i.e., what data object (dataset or attribute) the data come from and which elements are described.
  2. Describe the representation(s) of the data
  3. (Optionally) describe the data values themselves.


Since HDF5 data is likely to be multidimensional arrays, the DTD should provide flexible ways to express the structure of the array, position of elements described, and the data itself. It should be possible to describe individual elements of the arrays, if desired. It should also be possible to include a descriptions many elements in a single block of characters, up to the entire contents of an array. The XML should make clear what data is represented by each block, and how it is intended to be interpreted.

The data elements can be of three types:

  1. scalar -- a single element
  2. simple -- array of a single type of numbers or string
  3. compound -- array of records, each record is composed of other types
HDF5 may also include user defined types (which are fully described) and other types such as OPAQUE.

The XML description should be able to describe data of all types.

Basic Design

The design is a single XML element, tentatively called "<Data>". The <Data> element has one of three possible forms:
<Scalar> A single number or string.
<Array> A (multidimensional) array of numbers or strings.
<Table> A (multidimensional) array of compound data types (i.e., records)

The <Scalar> Element:


The <Scalar> element represents a data element that is a single number or string. The <Scalar> may only be represented as either a formatted string of characters, i.e., "10.8". The other representations (see below) are not allowed.

The <Array> Element:


An Array element has a dimension list followed by one or more blocks of data. The dimension list is partly redundant with the information in the Dataspace element, but the order of the dimensions indicates the layout of the data in the XML file, which need not be identical to the order of the HDF5 file.

Each block of data are represented as an <ArrayData> element. An <ArrayData> element consists of a single <DataFromFile> element or else one or more nested <ArrayData> elements. Multidimensional arrays can be represented by nested <ArrayData> elements, down to a single data element, or to a contiguous group of elements, which would be described by a single <DataFromFile> element. These two elements form a tree, with the <ArrayData> elements as internal nodes, and <DataFromFile> as the leaves.

The <DataFromFile> element consists sets of two elements, the <HowToRead> element and an optional <BlockOfData> element.

The members of a <DataFromFile> element
<HowToRead> How to get and interpret the data.
NativeHDF5 Read data from the indicated object using HDF5. (No data is in the XML.)
XML Read data from XML from another file.
FormattedText The <BlockOfData> contains characters in a designated format, e.g., C or FORTRAN.
BinaryEncoded The <BlockOfData> contains the data encoded into a character string, e.g., with uuencode or binhex.
<BlockOfData> Characters representing the data values.

A <DataFromFile> element can be repeated for the same <ArrayData> element. This supports multiple representations of the same data, e.g., both a pointer to the HDF5 file and formatted text.

The <NativeHDF5> element:

The <NativeHDF5> element describes how to obtain the data described by the enclosing <ArrayData> element from a specific HDF5 file. This description is not completely defined yet, but will include:

URI (probably xlink:) The URL and other location information for the HDF5 file.
dataobject The dataset or attribute to read. this will be one or more valid path names sufficient to open the object for reading data.
start, nelems, etc. What elements to read from the object.

The <XML> element:

The <XML> element points to an external XML file which describes the data. This file might be in any legal language, such as XDF or netCDF, so long as it can be interpreted to provide appropriate data for the enclosing <ArrayData> element.

The <FormattedText> element:

The <FormattedText> element describes how to read the enclosed text. The attributes include the format type (e.g., "C" or "FORTRAN"), the delimiter, the number of elements, and a format specification. The following <BlockOfData> must be interpreted according to the rules of the designated format scheme.

The <BinaryEncoded> element:

The <BinaryEncoded> element describes a text encoding of binary data, such as uuencode or binhex. The attributes denote the encoding, the number of elements encoded, and the size of the encoded text. The following <BlockOfData> must be interpreted with the appropriate algorithm to decode the binary values.

The <Table> Element (Compound Datatypes)


The <Table> element is used to describe arrays of compound data types. The <Table> element is similar to the <Array> element, with the addition of a <FieldList> element, which describes the order of the components of the compound type.

The <DimensionList> of a <Table> is the same as for an <Array> element.

The <Field> element describes one component, with a pointer to the HDF5 Datatype element that describes the compound data type, and the name of the field in the type. The order of the <Field> records defines the order of the data elements in the XML, which need not be identical to the order in the HDF5 type.

Attributes of the <Field> element.
FieldID The XML ID for this field
TypeRef The XML IDREF for the HDF5 Compound Datatype
FieldName The "name" of the field in the HDF5 Compound Datatype.

The <DataFromFile> element of the <Table> is the same as for an <Array> element, except that each element must contain a description of entire compound data item(s), rather than numbers or strings.


Annotated Examples

In this section, we give annotated examples of how data can be represented. The example DTD is here. Note that this is a partial DTD, which is intended to be included as part of the HDF5 DTD. Some of the attributes have 'IDREFS' that point to the HDF5 DTD, which is not present in this example.


Scalar


TBD


Array

In this example, the HDF5 file contains a 3 by 2 array of 32-bit integers, with the values:

0 1
0
00
01
1 10 11
2 20 21

This data can be described in several ways in XML. There are two dimensions of the representation:

  1. What level of "tagging" is done, i.e., are there tagged elements for each individual data element, for a "row", for the whole data array, or for blocks of data?
  2. How the data is described (pointer, formatted text, etc.)
To give a flavor of these options, here are some examples of how the 3 by 2 array above could be represented.

Case 1: All elements tagged individually, values included as text

It is possible to tag every element of the array. The XML is voluminous, but it is possible for XML based tools to locate, index, and read each value. Here is a sample of how this would look. The entire example file is here.

Example 1. Each number tagged.

<?xml version="1.0" encoding="UTF-8"?>
<!-- 3 by 2 Array of Integers -->
<!-- each element tagged, values in the file as formatted text -->
<Data>
 <Array NDIMS="2">
 <DimensionList NDIMS="2">
<!-- The data in XML is laid out with Dim0 on the outside -->
 <dimension size="3" DIMID="Dim0"/>
 <dimension size="2" DIMID="Dim1"/>
 </DimensionList>
 <ArrayData Dim="Dim0" index="0">
<!-- Begin data for first index of outer dimension -->
 <ArrayData Dim="Dim1" index="0">
<!-- Begin data for first index of inner dimension -->
 <DataFromFile>
 <HowToRead>
 <FormattedText nelems="1" formatType="C" formatDescriptor="%d"/>
 </HowToRead>
<!-- value [0][0] is 00 -->
 <BlockOfData>00</BlockOfData>
 </DataFromFile>
 </ArrayData>
 <ArrayData Dim="Dim1" index="1">
<!-- Begin data for second index of inner dimension -->
 <DataFromFile>
 <HowToRead>
 <FormattedText nelems="1" formatType="C" formatDescriptor="%d"/>
 </HowToRead>
<!-- value [0][1] is 01 -->
 <BlockOfData>01</BlockOfData>
 </DataFromFile>
 </ArrayData>
 </ArrayData>
<!-- etc. ... -->


Case 2: Elements tagged by "row", values included as text

An alternative form of the same representation would be to have each data block have a whole row of data. This reduces the size of the XML, but limits XML tools to be able to address rows but not elements within a row.

The example looks like this. (The whole file is here.)

Example 2. Each row tagged.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Data SYSTEM "file://localhost/D:/data/mcgrath/XML/XDF/data-2.dtd" >
<!-- 3 by 2 Array of Integers -->
<!-- each row tagged, data is formatted text -->
<Data>
 <Array NDIMS="2">
 <DimensionList NDIMS="2">
 <dimension size="3" DIMID="Dim0"/>
 <dimension size="2" DIMID="Dim1"/>
 </DimensionList>
 <ArrayData Dim="Dim0" index="0">
<!-- Begin data for first index of outer dimension -->
 <DataFromFile>
 <HowToRead>
 <FormattedText nelems="2" formatType="C" delimiter="," formatDescriptor="%d"/>
 </HowToRead>
<!-- values of row [0] are 00, 01 -->
 <BlockOfData>00,01</BlockOfData>
 </DataFromFile>
 </ArrayData>
 <ArrayData Dim="Dim0" index="1">
<!-- Begin data for second index of outer dimension -->
 <DataFromFile>
 <HowToRead>
 <FormattedText nelems="2" formatType="C" delimiter="," formatDescriptor="%d"/>
 </HowToRead>
<!-- values of row [1] are 10, 11 -->
 <BlockOfData>10,11</BlockOfData>
 </DataFromFile>
 </ArrayData>
<!-- etc.... -->


Case 3: The whole array included as a single block of text

Another option would be to have the whole array in a single data block. This reduces the size of the XML, but limits XML tools to be able to address only the whole array.

The example looks like this. (The whole file is here.)

Example 3. All the data in a single block (complete XML)

<?xml version="1.0" encoding="UTF-8"?>
<!-- 3 by 2 Array of Integers -->
<!-- the whole array in one list of values -->
<Data>
 <Array NDIMS="2">
 <DimensionList NDIMS="2">
<!-- The data in XML is laid out with Dim0 on the outside -->
<!-- These records define how to read the data values below.... -->
 <dimension size="3" DIMID="Dim0"/>
 <dimension size="2" DIMID="Dim1"/>
 </DimensionList>
 <ArrayData Dim="Dim0" index="0">
 <DataFromFile>
 <HowToRead>
 <FormattedText nelems="6" formatType="C" delimiter="," formatDescriptor="%d"/>
 </HowToRead>
<!-- all the data -->
 <BlockOfData>00,01,10,11,20,21</BlockOfData>
 </DataFromFile>
 </ArrayData>
 </Array>
</Data>


Case 4. The whole array in a single block, binary encoded

All of the above cases could have included the data in the form of a text encoding for binary, e.g., binhex. For large blocks of data, this is probably more efficient (both space and time) than formatted text. Of course, the data in the XML is not human readable.

Here is the example from Case 3, using binhex. The file is here. (The example does not use real binhex, random characters are used to suggest what the binhex would look like.)

Example 4. The whole array in single block of (fake) binhex

<?xml version="1.0" encoding="UTF-8"?>
<!-- 3 by 2 Array of Integers -->
<!-- whole array in one binhex -->
<Data>
 <Array NDIMS="2">
 <DimensionList NDIMS="2">
<!-- The data in XML is laid out with Dim0 on the outside -->
<!-- This defines the order of the elements in the binhex block -->
 <dimension size="3" DIMID="Dim0"/>
 <dimension size="2" DIMID="Dim1"/>
 </DimensionList>
 <ArrayData Dim="Dim0" index="0">
 <DataFromFile>
 <HowToRead>
 <BinaryEncoded nelems="6" Encoding="binhex" binsize="24"/>
 </HowToRead>
<!-- all the values of the array, binhex-ed -->
 <BlockOfData>XYZQabchyasdfjsalxldsafs</BlockOfData>
 </DataFromFile>
 </ArrayData>
 </Array>
</Data>

Case 5. Point to the data in the HDF5 file

For some applications, it is better to have a pointer to the data in the HDF5 file, rather than the data itself. This is very compact, but the XML application must call the library to obtain any data values. However, the XML contains enough information to be able to read only the parts of the file needed, or to send a request to a remote server for just the data needed.

To do this, the <HowToRead> element should be a <NativeHDF5> element, which should include the following:

The details of this element are not final. Here is an example to suggest how Case 3/4 above would look. The file is here.

Example 5. Point to the data in the HDF5 file.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Data SYSTEM "file://localhost/D:/data/mcgrath/XML/XDF/data-2.dtd" >
<!-- 3 by 2 Array of Integers -->
<!-- read the whole thing from the file-->
<Data>
 <Array NDIMS="2">
 <DimensionList NDIMS="2">
 <!-- The dimensions point to DataSpace information -->
 <dimension size="3" DIMID="Dim0"/>
 <dimension size="2" DIMID="Dim1"/>
 </DimensionList>
 <ArrayData Dim="Dim0" index="0">
 <DataFromFile>
 <HowToRead>
 <!-- the URL, dataset, and which elements to read -->
 <NativeHDF5 URI="thefile.h5" dataobject="DatasetA"
start="0,0" nelems="3,2" />
 </HowToRead>
 </DataFromFile>
 </ArrayData>
 </Array>
</Data>

Table (Compound Data)

Compound data is represented similarly to array data, except the order and types of the components or the data must be defined. Like the Array cases, the actual values can be tagged, or some group of the elements can be included in a block. And the data can be represented in all the forms, or pointed to.

In this example, the HDF5 file contains a table that is a one-dimensional array of compound data. Each element has three components:

Elements of the Compound data type
Name Type
I 32-bit integer
S String (fixed length 10)
F 32-bit float

The array has 6 elements.

Case 1: Each element included individually as formatted text

As in the examples of array data, it is possible for the XML to tag each element of the table. In the current design, there is no provision for tagging individual components of the compound data.

Here is an example of the table described in XML, with each element tagged. The XML file is here.

Example 6. Compound Data

<?xml version="1.0" encoding="UTF-8"?>
<!-- 1D table, 6 records, type (int 32, String[10], float 32) -->
<!-- each individual element (record) in the XML -->
<Data>
 <Table NDIMS="2" NFIELDS="3" Type="HDF5CompoundDataType">
 <DimensionList NDIMS="1">
 <dimension size="6" DIMID="RecNo"/>
 </DimensionList>
 <FieldList NFIELDS="3">
<!-- This defines the order of the elements in the XML -->
<!-- Each Field points to the HDF5 datatype in the in the XML -->
<!-- and gives the field name -->
<!-- This information should be sufficient to determine the type of -->
<!-- the element -->
 <Field FieldID="Int" TypeRef="HDF5CompoundDataType" FieldName="I"/>
 <Field FieldID="String10" TypeRef="HDF5CompoundDataType" FieldName="S"/>
 <Field FieldID="Float" TypeRef="HDF5CompoundDataType" FieldName="F"/>
 </FieldList>
 <TableData Dim="RecNo" index="0">
 <DataFromFile>
 <HowToRead>
<!-- Describe the format of the data -->
 <FormattedText nelems="1" formatType="C" formatDescriptor="%d, %10s, %f""/>
 </HowToRead>
<!-- The first record -->
 <BlockOfData>0, " abc", 3.2</BlockOfData>
 </DataFromFile>
 </TableData>
 <TableData Dim="RecNo" index="1">
 <DataFromFile>
 <HowToRead>
 <FormattedText nelems="1" formatType="C" formatDescriptor="%d, %10s, %f""/>
 </HowToRead>
 <BlockOfData>1, " xyzq", 7.00</BlockOfData>
 </DataFromFile>
 </TableData>
<!-- etc.... -->