February 21, 2001
Robert E. McGrath
Contents
1. Overview
2. Prerequisites
2.1 HDF5 DTD and Tools
2.2. NetCDF DTD and Tools
3. Mapping netCDF to HDF5
3.1. Background
3.2. The mapping used in this experiment
3.3. netCDF variables and attributes
3.4. netCDF data types and values
3.5. netCDF Dimensions
4. Design and Implementation with XSL
4.1. Software used
4.2. Architecture
4.3. Design of stylesheet
4.4. Considerable in-line code is needed
5. Results and Conclusions
5.1. It works!
5.2. Why XSL?
5.3. Performance?
5.5. Suggested Revisions for the netCDF and HDF5 DTDs
5.6 Implications for Other Translations
6. Resources
6.1. Software used:
6.2. References
Conversion from netCDF to HDF5 is conceptually straightforward because the HDF5 data model is a superset of netCDF's model, so all concepts in netCDF map to an appropriate concept in HDF5. Converting from HDF5 to netCDF would be more difficult, because HDF5 has quite afew concepts not represented in netCDF. These include groups and a much more elaborate set of data types, including compound datatypes.
This experiment is made possible by the existence of standard (or pseudo-standard) Document Type Definitions (DTDs) for each binary format, and tools for translating between the binary format and XML and vice versa.
Given these standard tools, XSL style sheets can be defined to translate between an XML description of a dataset and another XML description of the data. The translation can be a filter (selecting some parts), and can perform substantial rearrangement and reformatting.
Significantly, the XSL stylesheet is a template, and therefore comparatively easy to distribute and modify.
While limited to one example (netCDF to HDF5), the techniques here can be applied to many data formats.
This report summarizes a recent experiment. Section 2 describes the important technical prerequisites that enable this experiment. Section 3 explains the conceptual mapping from netCDF to HDF5 used in this experiment. (There are many possible such mappings.) Section 4 describes the design and implementation of the software used in this experiment. Section 5 reports some results and conclusions.
A second prerequisite is a standard XML Document Type Definition (DTD) (or Schema) for both formats. The conceptual mapping is applied to the two DTDs, transforming all the applicable elements.
The third prerequisite is the availability of tools to reliably convert between the binary format and XML, and vice versa. Given a DTD, these tools are straightforward, and provide a critical foundation for other tools.
This experiment was possible because HDF5 has a standard DTD [4] and the h5gen [5] tool for converting XML descriptions to HDF5, and netCDF has a proposed DTD [1] (actually two), and tools for converting between XML and netCDF. [1]
It should be noted that these tools are completely general, and were not specifically constructed for this experiment.
A second DTD and tool has been proposed for nctCDF (ncML, [2]), but was not used for this experiment. The ncML is similar to the ncxdump DTD, but does not include the values of data in the variables (i.e., the numbers in the dataset). Instead, they point to the data in the binary netCDF file. The authors explicitly state their assumptions to include:
"we assume that a ncML document is created from an existing netCDF file, basically to provide the users with a more convenient means of exchanging metadata information." ([2], p.30In other words, they do not attempt to support the use of XML for transforming the whole dataset.
Neither of these DTDs have been officially designated or supported by Unidata to date. Results of this experiment may suggest improvements in the DTD that is ultimately standardized.
The HDF5 data model was designed to be very general, and to subsume the conceptual models used in many scientific data formats. In particular, the concepts of netCDF can be mapped to HDF5 in a very simple fashion, with little loss of 'information'.
It should be emphasized that this conceptual compatibility does not mean that the HDF5 and netCDF software are compatible, nor even that the files are compatible. This means that the files are translatable, in that every object in an netCDF file can be converted to a conceptually identical object in HDF5. It would be perfectly possible to create software that operated on these conceptual objects using both netCDF and HDF5 software and files. (For example, see the project reported in [3]). This experiment does not demonstrate such software.
3.2. The mapping used in this experiment
For this experiment, the objects of the netCDF DTD from [1] are mapped to conceptually corresponding HDF5 objects. It is important to note that there are many possible ways to do such a mapping. The mapping used here is only a demonstration, a real implementation would need to carefully define and document this mapping.
The netCDF DTD captures the essence of the netCDF data model, defining ten XML tags:
The translation to HDF5 consists of rules for processing these tags, converting each into relevant marked up XML following the HDF5 DTD. Table 1 shows the basic mapping.<!ELEMENT netcdf (name,(dim|var|att)*)> <!ELEMENT dim (name,size)> <!ATTLIST dim id ID #REQUIRED> <!ELEMENT var (type,name,att*,data?)> <!ATTLIST var dims IDREFS #IMPLIED> <!ELEMENT att (type,name,value)> <!ELEMENT data (record+|value)> <!ELEMENT record (value)> <!ELEMENT name (#PCDATA)> <!ELEMENT size (#PCDATA)> <!ELEMENT type (#PCDATA)> <!ELEMENT value (#PCDATA)>
|
|
/netcdf/att | Attribute of root group |
/netcdf/dim | HDF5 Dataset in reserved Group |
/netcdf/var | HDF5 Dataset with same dimensionality and data type |
/netcdf/var/att | Attribute on the appropriate Dataset |
3.3. netCDF variables and attributes
In the case of attributes and variables, the netCDF object is converted
to an HDF5 object with 'equivalent' dimensions and data type, and the data
is copied. Tale 2 summarizes the conceptual mapping.
|
|
|
var/type | Datatype | Types are translated according to a standard rule. See below |
var dims="list" | Dataspace | The size of the dimensions are used to create and HDF5 dataspace with the corresponding dimensionality. |
var/data/* | Data | Data values are copied, with minor reformatting. |
3.4. netCDF data types and values
All netCDF data types can be mapped to appropriate HDF5 datatypes.
Table 3 shows this mapping.
|
|
<type>char</type> | <StringType Cset="H5T_CSET_ASCII" StrSize="<max size of strings>"
StrPad="H5T_STR_NULLPAD" /> |
<type>byte</type> | <IntegerType ByteOrder="LE" Sign="true" Size="1"/> |
<type>short</type> | <IntegerType ByteOrder="LE" Sign="true" Size="2"/> |
<type>int</type> | <IntegerType ByteOrder="LE" Sign="true" Size="4"/> |
<type>long</type> | <IntegerType ByteOrder="LE" Sign="true" Size="8"/> |
<type>float</type> | <FloatType ByteOrder="LE" Size="4" SignBitLocation="31" ExponentBits="8" ExponentLocation="23" MantissaBits="23" MantissaLocation="0" /> |
<type>double</type> | <FloatType ByteOrder="LE" Size="8" SignBitLocation="63" ExponentBits="11" ExponentLocation="52" MantissaBits="52" MantissaLocation="0" /> |
3.5. netCDF Dimensions
The dimensions of the netCDF variable are used to determine the dimensionality
of the HDF5 dataspace. Note that the netCDF XML gives the dimensions by
listing the names of the <dim> elements that define the dimensions.
E.g.,
Describes a variable (dataset) named "foo" which has three dimensions. The dataset is d0 by d1 by d3 elements, where 'd0' must be a <dim> element in the XML. E.g.:<var dims="d0 d1 d3"> <name>foo</name> ...
Note that the netCDF DTD uses an 'ID' that is different than the dimension's 'name'.<dim id="d0"> <name>i</name> <size>2</size> </dim>
The dimensions in the netCDF file are mapped to one dimensional HDF5
datasets, with a datatype that is 32 bit integers. All the dimensions
are stored in a special HDF5 group, called "/NC_DIMS". (This
is similar in spirit to the design in [cite].) The dimension above would
be stored as the following HDF5 dataset:
<Group Name="/NC_DIMS" OBJ-XID="/NC_DIMS" Parents="root" > <Dataset Name="i" OBJ-XID="/NC_DIMS/i" Parents="/NC_DIMS" > <Dataspace> <ScalarDataspace /> </Dataspace> <DataType> <AtomicType> <IntegerType ByteOrder="BE" Sign="false" Size="4" /> </AtomicType> </DataType> <Data> <DataFromFile> 2 </DataFromFile> </Data> </Dataset>
In addition to creating the HDF5 dataspace to correspond to the
netCDF dimensions specified in the 'dims', the HDF5 Dataset is given a
special HDF5 Attribute called "NC_DIMVARS", which is a one dimensional
array of object references, each reference points to the dataset that corresponds
to the dimension. Thus, the dataset "foo" above would have an attribute
something like:
One other translation was necessary to handle unlimited dimensions, and to make any HDF5 dataset that uses unlimited dimensions 'chunked'. It should be noted that the netCDF DTD used here did not have a tag for the 'current size' of an unlimited dimension. Instead, this is designated in a comment! For example:<Attribute Name="/NC_DIMVARS" /> <Dataspace> <SimpleDataspace Ndims="1" > <Dimension DimSize="3" MaxDimSize="3" /> </SimpleDataspace> </Dataspace> <DataType> <AtomicType> <ReferenceType> <ObjectReferenceType/> </ReferenceType> </AtomicType> </DataType> <Data> <DataFromFile> "/NC_DIMS/i" "/NC_DIMS/j" "/NC_DIMS/l" </DataFromFile> </Data> </Attribute>
One other anomaly in the netCDF DTD should be mentioned: the dimensionality of netCDF attributes are not included in the DTD. It is necessary to 'peek ahead' at the data values to discover if there is one or possibly more than one value.<!-- rec = UNLIMITED; --> <dim id="d6"> <name>rec</name> <size>unlimited</size> <!-- currently 3 --> </dim>
4.1. Software used
A suite of free Java software was used. The only Java code that was written was a simple main program, and it was actually copied from an example. All of the problem specific logic is in the style sheet.
The specific software for the translation was:
|
|
|
xerces.jar | XML parsing | Apache.org (http://www.apache.org/) |
xalan.jar | XSL transformation | Apache.org (http://www.apache.org/) |
bsf.jar, bsfengines.jar | "Bean Scripting Framework" | IBM Alphaworks (http://alphaworks.ibm.com/) |
js.jar | JavaScript | http://www.mozilla.org/rhino |
This amounts to some 2.6MB of Java code. Note that this is all general purpose code.
4.2. Architecture
The overall translation has three steps: from binary netCDF to XML (ncxgen), from XML to XML (nctoh5.xsl), and then from XML to HDF5 (h5gen). Figure 1 shows the overall data flow.
Figure 1. Data flow in netCDF to HDF5 Conversion
The first step uses the ncxgen program ([1]), which is a C program that calls the netCDF library to read the netCDF input file. The output is written as XML. Figure 2 shows the main software and data flow for this step.
Figure 2. Step 1: convert netCDF binary file to XML
(conformant to netCDF DTD)
The second step converts the XML description of the netCDF into an XML description of the target HDF5 file. This step uses the XSL style sheet and JavaScript (embedded in the style sheet). The netCDF DTD is used to validate the input XML file. The main program (nctoh5.class) is trivial, it simply calls the xalan.jar library, which does all the work. The xeena.jar library parses the XML input, the bsf libraries support the loading of the js.jar JavaScript interpreter. Figure 3 shows the architecture of this step. This software is 100% Java. Figure 4 shows the nctoh5 main program.
Figure 3. Step 2: convert netCDF XML to HDF5 XML, using XSL.
The third step converts the XML description of HDF5 into a binary HDF5 file. This step uses the h5gen program [5], which uses the HDF5 DTD to validate the input file. The h5gen also uses the xeena.jar library to parse XML, and the jhi5.jar JNI interface to the HDF5 library (C). Figure 4 shows this step.import org.xml.sax.SAXException; import org.apache.xalan.xslt.XSLTProcessorFactory; import org.apache.xalan.xslt.XSLTInputSource; import org.apache.xalan.xslt.XSLTResultTarget; import org.apache.xalan.xslt.XSLTProcessor; public class nctoh5 { public static void main(String[] args) throws java.io.IOException, java.net.MalformedURLException, org.xml.sax.SAXException { XSLTProcessor processor = XSLTProcessorFactory.getProcessor(); processor.process(new XSLTInputSource(args[0]), // input: nc.xml new XSLTInputSource(args[1]), // xsl: nctoh5.xsl new XSLTResultTarget(args[2])); // out: h5.xml } }Figure 4. Listing of nctoh5.java
Figure 5. Step 3: convert XML to HDF5 binary.
4.3. Design of stylesheet
Several revisions of the stylesheet have been tried. The essential design is one template for each of the elements of the netCDF DTD. However, some of the elements are related, e.g., the <data> cannot be correctly copied without checking the <type> element.
The stylesheet is approximately 800 lines long, of which about 350 lines are XSL templates (the rest is JavaScript, as discussed below). It should be noted that this is not expertly written XSL, well written templates might be much smaller.
For example, a considerable amount of the template code is in <xsl:choose> statements, which are switching on the data type (i.e., if type is byte, getbytes else if type is short then getshort, etc.). Each place that data is copied, there is one of these 36 line alternatives.
4.4. Considerable in-line code is needed
Unfortunately, the translation could not be accomplished solely with XSL templates. It was necessary to use JavaScript to process some of the data.
XSL style sheets can transform elements and strings from input to output in complex ways, but cannot have any 'memory' or saved state. JavaScript was used where it was necessary to 'remember' information for later look up. The primary example of this was the need to be able to look up the netCDF dimensions when processing the <var> elements.
In addition, it was necessary to process the values within elements. There were two important cases: the list of dimensions of a <var>, and the values of data to be copied. The former case required pulling out each dimension ID from the list, and locating the appropriate information from the corresponding <dim> element. The latter case involved the trivial matter of removing commas between elements and the nontrivial matter of checking for number format conventions such as '1b' meaning octal.
The data processing required was very simple, and all of it was accomplished using JavaScript defined in the style sheet. There was no need to invoke any outside code, nor to use any scratch disk or other resources.
While more than half the lines of the XSL style sheet are JavaScript, the procedural code is mostly trivial. Furthermore, some of may not be needed, and some can be eliminated by simple changes to the netCDF DTD.
The most important finding is that this translation works. Several test files have been completely converted from netCDF to XML, to XML to HDF5. One test was a file with a 5D dataset with 22,000 elements. All the objects of the netCDF file are represented in the HDF5 file, and the numbers are correctly transferred, at least according to hand checks of output dumps.
5.2. Why XSL?
Of course, it is perfectly possible to write a 'netCDF to HDF5' program in C. In fact, the HDF4 library has completely transparent interoperation with netCDF. What is different about using a stylesheet?
In either case, a conceptual mapping must be made, and the mapping can be implemented in many technologies.
In using a style sheet to translate, the logic of the translation is stated as templates, i.e., rules, rather than procedural code. These rules are relatively short, and closely related to the conceptual mapping. A template is more portable (it is interpreted) and may be easier to maintain or modify.
The total lines of code is probably comparable, if you count all the general purpose XML and XSL libraries used by the style sheet. On the other hand, all but the template itself is completely general and usable for other purposes. The code specific to translation is very much shorter than the C code would be.
5.3. Performance?
The use of XML and/or Java raises questions of performance, especially for realistic scientific data. Indeed, the XML to XML conversion is somewhat slow. For a dataset with an array of 22,000 numbers, the conversion from netCDF XML to HDF5 XML took 42 seconds on a SPARC running Solaris, and 10 minutes on a Pentium II laptop running Windows 98.
This experiment did not obtain sufficient information to determine what parts of the process are slow, or whether the results could be improved with a better style sheet, optimized Java, or by using C++ instead.
5.5. Suggested Revisions for the netCDF and HDF5 DTDs
As discussed above, the ncML DTD could not be used for this experiment. The HDF5 DTD has an option to include either the numbers or point to the data in the binary file. Perhaps ncML should have a similar alternative.
This experiment found the netCDF DTD to be complete, but would be easier to use with a few changes.
First, the <dim> elements should definitely have a current and max size. The use of a comment to indicate the current size is not really a good design.
Second, the handling of dimension lists should be reviewed. The ncML DTD uses name swizzling to assure unique XML IDs that are related to the actual dimension names. This could be used. On the other hand, it is questionable whether this is even necessary, since the reference to the dimensions in the XML is minimally useful: XML tools can't really interpret what the relation is, only 'netCDF-aware' software can. To the degree that this is so, it is more helpful to have the actual names of the dimensions, rather than an XML alias.
Third, it would be helpful to include dataspace information for attributes in the netCDF XML.
Fourth, netCDF may wish to consider adding an explicit <NoData> tag, as HDF's DTD has. This is easier to parse than a series of '_'.
Fifth, the netCDF DTD has a markup for <record> within <data>. This is a step towards marking up the data values, which HDF5 may want to follow. However, the <value> element is complete free form, (just like HDF5's <Data>) and <record> is only used for unlimited dimensions.
Marking up the data values is a significant open issue for both the netCDF and HDF5 DTDs. XML Schema is widely believed to be a good means to deal with this problem, but there is no widely accepted standard yet. It would be advantageous to develop a public standard for marking up large arrays of numbers, to be used by HDF5, netCDF, and others.
5.6 Implications for Other Translations
While the exact stylesheet and JavaScript is specific not only to the netCDF to HDF5 translation, it clearly illustrates the possibility of translating between complex data formats.
The most important prerequisite is conceptual compatibility, and a conceptual mapping of the data models. Then, given standard XML DTDs or other schemas, and general purpose tools for reading and writing XML, stylesheets can be used to translate.
It is important to note that there might well be different translations
desired, and it might also be valuable to filter, i.e., to select and/or
reformat objects in the conversion. XSL stylesheets are ideal for
this purpose, as all the other software, including the input XML, can be
common.