Robert E. McGrath
March 17, 2000
The DTD for HDF5 should support four alternatives for describing
or including data from and HDF5 file in the XML description:
When included in the XML, the data itself is represented as one or more blocks of characters. The HDF5 DTD provides a detailed description of the Dataspace and the Datatype of the elements in the space, so the markup for the data need not repeat this information (it can point to the relevant XML objects). The representation of the data has three primary functions:
Since HDF5 data is likely to be multidimensional arrays, the DTD
should provide flexible ways to express the structure of the array, position
of elements described, and the data itself. It should be possible
to describe individual elements of the arrays, if desired. It should
also be possible to include a descriptions many elements in a single block
of characters, up to the entire contents of an array. The XML should
make clear what data is represented by each block, and how it is intended
to be interpreted.
The data elements can be of three types:
The XML description should be able to describe data of all types.
Basic Design
The design is a single XML element, tentatively called "<Data>".
The <Data> element has one of three possible forms:
| <Scalar> | A single number or string. |
| <Array> | A (multidimensional) array of numbers or strings. |
| <Table> | A (multidimensional) array of compound data types (i.e., records) |
The <Scalar> element represents a data element that
is a single number or string. The <Scalar> may only be
represented as either a formatted string of characters, i.e., "10.8".
The other representations (see below) are not allowed.
An Array element has a dimension list followed by one or more blocks
of data. The dimension list is partly redundant with the information
in the Dataspace element, but the order of the dimensions indicates the
layout of the data in the XML file, which need not be identical
to the order of the HDF5 file.
Each block of data are represented as an <ArrayData> element. An <ArrayData> element consists of a single <DataFromFile> element or else one or more nested <ArrayData> elements. Multidimensional arrays can be represented by nested <ArrayData> elements, down to a single data element, or to a contiguous group of elements, which would be described by a single <DataFromFile> element. These two elements form a tree, with the <ArrayData> elements as internal nodes, and <DataFromFile> as the leaves.
The <DataFromFile> element consists sets of two elements,
the <HowToRead> element and an optional <BlockOfData>
element.
| <HowToRead> | How to get and interpret the data. | ||||||||
|
|||||||||
| <BlockOfData> | Characters representing the data values. |
A <DataFromFile> element can be repeated for the same <ArrayData> element. This supports multiple representations of the same data, e.g., both a pointer to the HDF5 file and formatted text.
The <NativeHDF5> element:
The <NativeHDF5> element describes how to obtain the data described
by the enclosing <ArrayData> element from a specific HDF5 file.
This description is not completely defined yet, but will include:
| URI (probably xlink:) | The URL and other location information for the HDF5 file. |
| dataobject | The dataset or attribute to read. this will be one or more valid path names sufficient to open the object for reading data. |
| start, nelems, etc. | What elements to read from the object. |
The <XML> element:
The <XML> element points to an external XML file which describes the data. This file might be in any legal language, such as XDF or netCDF, so long as it can be interpreted to provide appropriate data for the enclosing <ArrayData> element.
The <FormattedText> element:
The <FormattedText> element describes how to read the enclosed text. The attributes include the format type (e.g., "C" or "FORTRAN"), the delimiter, the number of elements, and a format specification. The following <BlockOfData> must be interpreted according to the rules of the designated format scheme.
The <BinaryEncoded> element:
The <BinaryEncoded> element describes a text encoding of
binary data, such as uuencode or binhex. The attributes denote the
encoding, the number of elements encoded, and the size of the encoded text.
The following <BlockOfData> must be interpreted with the appropriate
algorithm to decode the binary values.
The <Table> element is used to describe arrays of compound
data types. The <Table> element is similar to the <Array>
element, with the addition of a <FieldList> element,
which describes the order of the components of the compound type.
The <DimensionList> of a <Table> is the same as for an <Array> element.
The <Field> element describes one component, with a pointer
to the HDF5 Datatype element that describes the compound data type, and
the name of the field in the type. The order of the <Field>
records defines the order of the data elements in the XML, which need not
be identical to the order in the HDF5 type.
| FieldID | The XML ID for this field |
| TypeRef | The XML IDREF for the HDF5 Compound Datatype |
| FieldName | The "name" of the field in the HDF5 Compound Datatype. |
The <DataFromFile> element of the <Table> is the same as for an <Array> element, except that each element must contain a description of entire compound data item(s), rather than numbers or strings.
TBD
| 0 | 1 | |
| 0 |
|
01 |
| 1 | 10 | 11 |
| 2 | 20 | 21 |
This data can be described in several ways in XML. There are two dimensions of the representation:
Case 1: All elements tagged individually, values included as text
It is possible to tag every element of the array. The XML is voluminous, but it is possible for XML based tools to locate, index, and read each value. Here is a sample of how this would look. The entire example file is here.
Example 1. Each number tagged.
<?xml version="1.0" encoding="UTF-8"?> <!-- 3 by 2 Array of Integers --> <!-- each element tagged, values in the file as formatted text --> <Data> <Array NDIMS="2"> <DimensionList NDIMS="2"> <!-- The data in XML is laid out with Dim0 on the outside --> <dimension size="3" DIMID="Dim0"/> <dimension size="2" DIMID="Dim1"/> </DimensionList> <ArrayData Dim="Dim0" index="0"> <!-- Begin data for first index of outer dimension --> <ArrayData Dim="Dim1" index="0"> <!-- Begin data for first index of inner dimension --> <DataFromFile> <HowToRead> <FormattedText nelems="1" formatType="C" formatDescriptor="%d"/> </HowToRead> <!-- value [0][0] is 00 --> <BlockOfData>00</BlockOfData> </DataFromFile> </ArrayData> <ArrayData Dim="Dim1" index="1"> <!-- Begin data for second index of inner dimension --> <DataFromFile> <HowToRead> <FormattedText nelems="1" formatType="C" formatDescriptor="%d"/> </HowToRead> <!-- value [0][1] is 01 --> <BlockOfData>01</BlockOfData> </DataFromFile> </ArrayData> </ArrayData> <!-- etc. ... -->
An alternative form of the same representation would be to have each data block have a whole row of data. This reduces the size of the XML, but limits XML tools to be able to address rows but not elements within a row.
The example looks like this. (The whole file is here.)
Example 2. Each row tagged.
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE Data SYSTEM "file://localhost/D:/data/mcgrath/XML/XDF/data-2.dtd" > <!-- 3 by 2 Array of Integers --> <!-- each row tagged, data is formatted text --> <Data> <Array NDIMS="2"> <DimensionList NDIMS="2"> <dimension size="3" DIMID="Dim0"/> <dimension size="2" DIMID="Dim1"/> </DimensionList> <ArrayData Dim="Dim0" index="0"> <!-- Begin data for first index of outer dimension --> <DataFromFile> <HowToRead> <FormattedText nelems="2" formatType="C" delimiter="," formatDescriptor="%d"/> </HowToRead> <!-- values of row [0] are 00, 01 --> <BlockOfData>00,01</BlockOfData> </DataFromFile> </ArrayData> <ArrayData Dim="Dim0" index="1"> <!-- Begin data for second index of outer dimension --> <DataFromFile> <HowToRead> <FormattedText nelems="2" formatType="C" delimiter="," formatDescriptor="%d"/> </HowToRead> <!-- values of row [1] are 10, 11 --> <BlockOfData>10,11</BlockOfData> </DataFromFile> </ArrayData> <!-- etc.... -->
Another option would be to have the whole array in a single data block. This reduces the size of the XML, but limits XML tools to be able to address only the whole array.
The example looks like this. (The whole file is here.)
Example 3. All the data in a single block (complete XML)
<?xml version="1.0" encoding="UTF-8"?> <!-- 3 by 2 Array of Integers --> <!-- the whole array in one list of values --> <Data> <Array NDIMS="2"> <DimensionList NDIMS="2"> <!-- The data in XML is laid out with Dim0 on the outside --> <!-- These records define how to read the data values below.... --> <dimension size="3" DIMID="Dim0"/> <dimension size="2" DIMID="Dim1"/> </DimensionList> <ArrayData Dim="Dim0" index="0"> <DataFromFile> <HowToRead> <FormattedText nelems="6" formatType="C" delimiter="," formatDescriptor="%d"/> </HowToRead> <!-- all the data --> <BlockOfData>00,01,10,11,20,21</BlockOfData> </DataFromFile> </ArrayData> </Array> </Data>
All of the above cases could have included the data in the form of a text encoding for binary, e.g., binhex. For large blocks of data, this is probably more efficient (both space and time) than formatted text. Of course, the data in the XML is not human readable.
Here is the example from Case 3, using binhex. The file is here. (The example does not use real binhex, random characters are used to suggest what the binhex would look like.)
Example 4. The whole array in single block of (fake) binhex
<?xml version="1.0" encoding="UTF-8"?> <!-- 3 by 2 Array of Integers --> <!-- whole array in one binhex --> <Data> <Array NDIMS="2"> <DimensionList NDIMS="2"> <!-- The data in XML is laid out with Dim0 on the outside --> <!-- This defines the order of the elements in the binhex block --> <dimension size="3" DIMID="Dim0"/> <dimension size="2" DIMID="Dim1"/> </DimensionList> <ArrayData Dim="Dim0" index="0"> <DataFromFile> <HowToRead> <BinaryEncoded nelems="6" Encoding="binhex" binsize="24"/> </HowToRead> <!-- all the values of the array, binhex-ed --> <BlockOfData>XYZQabchyasdfjsalxldsafs</BlockOfData> </DataFromFile> </ArrayData> </Array> </Data>
Case 5. Point to the data in the HDF5 file
For some applications, it is better to have a pointer to the data in the HDF5 file, rather than the data itself. This is very compact, but the XML application must call the library to obtain any data values. However, the XML contains enough information to be able to read only the parts of the file needed, or to send a request to a remote server for just the data needed.
To do this, the <HowToRead> element should be a <NativeHDF5> element, which should include the following:
Example 5. Point to the data in the HDF5 file.
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE Data SYSTEM "file://localhost/D:/data/mcgrath/XML/XDF/data-2.dtd" > <!-- 3 by 2 Array of Integers --> <!-- read the whole thing from the file--> <Data> <Array NDIMS="2"> <DimensionList NDIMS="2"> <!-- The dimensions point to DataSpace information --> <dimension size="3" DIMID="Dim0"/> <dimension size="2" DIMID="Dim1"/> </DimensionList> <ArrayData Dim="Dim0" index="0"> <DataFromFile> <HowToRead> <!-- the URL, dataset, and which elements to read --> <NativeHDF5 URI="thefile.h5" dataobject="DatasetA" start="0,0" nelems="3,2" /> </HowToRead> </DataFromFile> </ArrayData> </Array> </Data>
In this example, the HDF5 file contains a table that is a one-dimensional
array of compound data. Each element has three components:
| Name | Type |
| I | 32-bit integer |
| S | String (fixed length 10) |
| F | 32-bit float |
The array has 6 elements.
Case 1: Each element included individually as formatted text
As in the examples of array data, it is possible for the XML to tag each element of the table. In the current design, there is no provision for tagging individual components of the compound data.
Here is an example of the table described in XML, with each element tagged. The XML file is here.
Example 6. Compound Data
<?xml version="1.0" encoding="UTF-8"?> <!-- 1D table, 6 records, type (int 32, String[10], float 32) --> <!-- each individual element (record) in the XML --> <Data> <Table NDIMS="2" NFIELDS="3" Type="HDF5CompoundDataType"> <DimensionList NDIMS="1"> <dimension size="6" DIMID="RecNo"/> </DimensionList> <FieldList NFIELDS="3"> <!-- This defines the order of the elements in the XML --> <!-- Each Field points to the HDF5 datatype in the in the XML --> <!-- and gives the field name --> <!-- This information should be sufficient to determine the type of --> <!-- the element --> <Field FieldID="Int" TypeRef="HDF5CompoundDataType" FieldName="I"/> <Field FieldID="String10" TypeRef="HDF5CompoundDataType" FieldName="S"/> <Field FieldID="Float" TypeRef="HDF5CompoundDataType" FieldName="F"/> </FieldList> <TableData Dim="RecNo" index="0"> <DataFromFile> <HowToRead> <!-- Describe the format of the data --> <FormattedText nelems="1" formatType="C" formatDescriptor="%d, %10s, %f""/> </HowToRead> <!-- The first record --> <BlockOfData>0, " abc", 3.2</BlockOfData> </DataFromFile> </TableData> <TableData Dim="RecNo" index="1"> <DataFromFile> <HowToRead> <FormattedText nelems="1" formatType="C" formatDescriptor="%d, %10s, %f""/> </HowToRead> <BlockOfData>1, " xyzq", 7.00</BlockOfData> </DataFromFile> </TableData> <!-- etc.... -->