h5dump XML support
Design Notes

Robert E. McGrath
December 14, 2000

Abstract

This document describes the details of the changes to the h5dump utility to enable XML output. An earlier document described the goals and approach. (Adding XML output to the h5dump Utility)

This document is intended to serve as a review before the code is committed to CVS.

Contents

1. Changes to the main Program
1.1. New command line options
1.2. Initialize reference path table (XML only)
1.3. Alternative head and tail
1.4. dump_bb
1.5. dump_all == 0
1.6. dump_all == 1
2. Dump functions
2.1. Implementation of the main dump functions
2.1.1. xml_dump_group
2.1.2. xml_dump_dataset
2.1.3. xml_dump_data
2.1.4. xml_dump_attr
2.1.5. xml_dump_datatype and xml_dump_named_datatype
2.1.6. xml_dump_dataspace
2.2. Replacement output functions
2.2.1. xml_print_refs
2.2.2. xml_print_strs
2.2.3. xml_print_enum
2.2.4. xml_print_datatype
3. Handling of object references
4. Miscellaneous issues
4.1. Options not supported for XML
4.2. Options not supported for standard output
4.3. The dataformat and dump_header tables
4.4. Forward References
5. Changes to the DTD
6. Metrics and Code

1. Changes to the main Program

1.1. New command line options

Two new command line options are processed, '-xml' and '-dtd'. Also, some options are disallowed if XML is selected. See below, section 4.1 and 4.2.

If '-xml' is selected, the dump_function_table is set to the xml_function_table, otherwise the standard ddl_function_table is used. See below, section 4.

Usage: h5dump -xml file.h5
or
 h5dump -xml -dtd alternative.dtd file.h5

1.2. Initialize reference path table (XML only)

If XML is selected, a table of object reference/path names is constructed. This table is not built when standard output is requested. See below, section 3 and 2.2.

1.3. Alternative head and tail

If XML is selected, the initial and final markers (file_begin, etc.) are replaced with appropriate XML.

1.4. dump_bb

Dump_bb is disabled if XML is selected.

1.5. dump_all == 0

If XML is selected, the whole file must be dumped. All the partial dump options are disabled when XML is selected.

1.6. dump_all == 1

The call to 'dump_group(gid, "/")' is directed through the dump_function_table->dump_group_function.

2. Dump functions

The XML output is implemented in an alternative set of dump functions. The main code is changed to call the dump functions via a table, which is set for either the standard functions or the XML versions. The standard functions are unchanged except for how they are invoked.

The functions are:

Dump functions re-implemented for XML
Standard DDL Output XML Output Calling:
dump_function_table->
dump_group xml_dump_group dump_group_function()
dump_named_datatype xml_dump_named_datatype dump_named_datatype_function()
dump_dataset xml_dump_dataset dump_dataset_function()
dump_dataspace xml_dump_dataspace dump_dataspace_function()
dump_datatype xml_dump_datatype dump_datatype_function()
dump_attr xml_dump_attr dump_attr_function()
dump_data xml_dump_data dump_data_function()

2.1. Implementation of the main dump functions


The main dump functions (e.g., xml_dump_group) have identical interfaces and similar semantics to the standard functions. However, the order of output and other details are different.

2.1.1. xml_dump_group

Sort the members by type: XML is a linear file format, and XML parsers may have difficulty with forward references of any kind: hardlinks, references to shared datatypes, and object references. The h5dump standard output puts the objects in the order returned by the library, which is essentially alphabetical by their name. This order sometimes results in 'forward references' simply due to object names.

To reduce one case of this, the xml_dump_group sorts the members of the group by type, outputting potential targets of references first. The xml_dump_group does the following:

original: dump all the objects in library order

revised:

dump all the H5_TYPE
then
dump all the H5_DATASET
then
dump all the H5_LINK
then
dump all the H5_GROUP

Note that this applies only to the order of the objects in the XML output. Nothing is changed for the standard output, and the information is otherwise the same.

2.1.2. xml_dump_dataset

2.1.3. xml_dump_data 2.1.4. xml_dump_attr

Similar to dump_dataset, but doesn't have chunking, attributes, etc.

2.1.5. xml_dump_datatype and xml_dump_named_datatype

Similar to standard dump_datatype, calls xml_print_datatype. (See below, section 2.2.4.)

2.1.6. xml_dump_dataspace

This routine is straightforward, just outputs XML format.

2.2. Replacement output functions

HDF5 names and strings may contain almost any ASCII characters. XML has reserved characters, with
standard escape sequences, termed 'external entities'. The algorithms are documented in the note: Technical Note: Escape Characters for XML/HDF5.

2.2.1. xml_print_refs

Print out object references as a full path (with some characters escaped). In pseudocode:

 for each obj_ref
 do
 char * apath = xml_lookup_ref_path( obj_ref );
 printf("\"%s\"\n", xml_escape_the_string( apath ));
 done


2.2.2. xml_print_strs

Print out strings with some characters escaped, and suppressing NULL padding. In pseudocode:

for each obj_ref
 do
 char * apath = xml_lookup_ref_path( obj_ref );
 printf("\"%s\"\n", xml_escape_the_string( apath ));
 done


2.2.3. xml_print_enum

This is similar to the standard print_enum, except it does the correct XML elements.

2.2.4. xml_print_datatype

This is similar to the standard print_datatype, except it does the correct XML.

3. Handling of object references


The standard h5dump prints out object references as numbers. For XML, we need to be able to reconstruct the reference, so we need something to indicate what the reference refers to. The current design prints out an absolute path for the object that is the target of the object reference. E.g., a data value that is a reference to the the dataset 'palette-1' in '/PAL-GROUP' will be output in the XML as:

"/PALGROUP/palette-1"
If the object has more than one path to it, one of the paths is used.

To implement this feature, it is necessary to be able to find a path for an object from it's reference. This is done with a table of (object_reference, "a full path") records. The table is constructed by walking the tree to visit every object, to create an entry for any thing that could be the target of an object reference. This is done once during initialization (but only if XML output is requested). Duplicate paths are not entered in the table, each object has only one entry.

When processing the output, each object reference is looked up, and the path is printed in the XML output. References to nonexistent objects are a fatal error.

This implementation does not support region references, because the XML DTD does not specify them yet.

The reference lookup table is managed by three functions. This table and the functions are only used by XML output functions. The only change to the standard code is calls to initialize the table in the main program. This is called only if XML output is selected.


Data structures and functions of the ref_path table.
Code Description
struct ref_path_entry_table_t {
hsize_t obj;
hobj_ref_t * obje_ref
char * apath
struct ref_path_entry-table_t *next;
}

struct ref_path_entry_table_t * ref_path_table;

The table.
static herr_t fill_ref_path_table(hid_t group, const char *name, void UNUSED *op_data) The iterator function, used to initialize the table. Called once at startup.
char *lookup_ref_path(hobj_ref_t * ref) Lookup a path for a give object reference.
hobj_ref_t*ref_path_table_put(hid_t obj, char *path) Insert a record. Used only when building the table.

4. Miscellaneous issues

4.1. Options not supported for XML

Certain options that may be used with the standard output cannot be used with XML. If '-xml' is selected, these options will be a fatal error.
Options that are not available when -xml is selected
Option Note
-header This could be supported, but is not implemented at this time.
-bb Not implemented for either standard or XML.
-v The DTD does not define how to report OIDs. Also, the meaning of an OID in an XML description is not clear.
-o Output to another file is not implemented.
-a, -g, -t, -d, -l The XML DTD defines a description of the whole file. It is not clear how to report selected objects.

4.2. Options not supported for standard output


Options specific to XML do not apply when XML is not selected.

Options that are available only when -xml is selected.
Option Note
-dtd <URI> The DTD is irrelevant to standard output, so a warning is issued.

4.3. The dataformat and dump_header tables

The original h5dump has two tables with output format for standard and XML. When XML is selected, the format strings are set to xml_dataformat, and the headers are set to xml_format, otherwise the standard versions are used.

These formats in the xml_dataformat table are mostly set to null or blank, which controls the output from the h5tools routines. For example, data separators are set to " " for XML. This table controls the appearance of the data, and is critical.

The headers in the xml_format table are largely not used. Most XML elements have one or more attributes in them, and are not really compatible with the way this table is used. This table could be eliminated and replaced with hand coded strings.

4.4. Forward References

HDF5 is a random access format. XML is a sequential format. There are many cases where HDF5 objects must 'refer' to other objects that are defined after them in the XML. These occur for: There are very few cases where the dumper cannot construct 'legal' XML. However, many XML parsers may have difficulty handling files with 'forward' references. In particular, when trying to construct an HDF5 file from an XML description (e.g., h5gen), it can be difficult to deal with a case where an object cannot be created because it refers to some other object farther down in the XML.

5. Changes to the DTD

The DTD has been revised to fix bugs, implement some omitted features, and to add a new data type for HDF 5.1.4. Except for the compound data, the DTD is compatible with the earlier revision, although some XML parsers may raise errors for older files.

The changes are:


The DTD for HDF 1.2.2, HDF 1.4, and the diff of these two are available.

6. Metrics and Code


The XML support adds 18 functions and more than doubles the number of lines of code in the dumper.

Version Lines (from 'wc') Functions
h5dump.c (1.80) 1860
h5dump with XML 4308 +18

The revised code is here: h5dump.c
The diffs are here: diffs