An Experimental Comparison of HDF4, HDF5, and XML Representations of the Same Dataset

May 1, 2001
Robert E. McGrath

Introduction

With the HDF5-1.4.1 release and the Java HDF5 tools, it is now possible to convert HDF4 files to HDF5, and to convert HDF5 to XML and back.  This experiment applied these tools to three HDF4 files containing realistic NASA EOS data.
 

Method

The experiment was run on a Sparc workstation.  Table 2 summarizes the hardware environment.
 
Table 1. Experimental Hardware
Platform
Sparc 5
Solaris 2.7
NFS mounted disk

The software was all based on the most recent release of HDF5 and HDF5 Java tools.  Minor changes were made in the h4toh5, which do not affect the data reported here.  Table 2 summarized the software environment.

The input files were created on other systems with several versions of HDF4.
 
Table 2. Software environment
Utilities: h4toh5, h5dump: from HDF5-1.4.1
Java:  h5gen 1.0 (April 15, 2001)
gzip 1.2.4

Data


The input data was three HDF4 files.  These files were selected as being somewhat representative of NASA EOS data.  None of the files is a strict HDFEOS file.  The files ranged from 16 to 138 MB, whit both data and metadata elements.  Table 3 describes the files.  The main data elements are the large arrays and tables identified with the JHV tool. These are the "problem size data", although not all the data values are necessarily written on disk, and some may be compressed on disk.  The total size of the file includes all data and metadata on the disk.
 
 
Table 3.  Summary of Input Files (HDF4)
File Main Data Elements Size of Main Data Elements (no compression) MB File Size (bytes)
AVHRR Ocean Pathfinder Equal Angle
1/1/96

1996001h09da-gdm

2 SDS, 2048 x 4096, uint8 (no data written)
2 RI8, 2048 * 4096 
16 16,862,109
CERES_ES8
has HDFEOS metadata

Organized as a swath

ceres.hdf

Geolocation
Vdata:  2148 records, float32
2* SDS 2148 * 660, float 32

Data
19 Vdata, 2148 records, float 32
18 * SDS 2148 * 660, float 32

113 81,229,295

MODIS Airborne Simulator
SCAR-B
23 August 1995
Flight line 15
 

scarb_mas_950823_015.hdf
 

Data: 
SDS  1525 * 50 * 716, uint16

Cal:
2 * 1525 *50, int8 
Lat lon:
2 * 1525 * 716, float32

Other:
35 * SDS 1 * 1525, float32 

118 138,444,964

Procedure


Each sample HDF4 file was converted to HDF5 using the h4toh5 utility.  The resulting HDF5 file was converted to XML using h5dump --xml.  The resulting XML file was compressed with gzip.  The XML was converted back to HDF5 with h5gen, and the gzipped XML was converted to HDF5 with:
zcat tile.xml.gzip | h5gen

File sizes were reported with 'ls -l'.

In each case, the wall clock time was collected with 'time'.  Each sequence of conversions was repeated 7 times, the median elapsed time is reported.
 

Results


The results are summarized in Table 4.  The MODIS file contained an array that was too large for h5gen to convert from XML to HDF5.  All other conversions completed correctly, although some warnings were ignored.

The file sizes were constant for each repetition.

The times did not vary greatly, although the variation was larger for the larger files.  This effect was likely due to random network congestion and other effects due to time sharing.  In all cases, there were no more than a few outliers in the time measurement.
 
 
Table 4.  Results
File HDF 4 Size (MB) Convert H4 to H5 (median sec) Converted H5 file
(MB)
Convert H5 to XML (median sec) Size of .XML file gzip the XML file
(median sec)
Size of .XML.gz (MB) Convert XML to H5 (median sec) convert XML.gz to H5 (median sec)
AVHRR (n = 7) 16.86 42  16.78 305  40 22 1.78 84 80
CERES (n = 7) 81.2 300 80.7 418 245 311 33 661 447
MODIS (n = 7) 138 663 206 1384 313 842.5 82 FAIL: too big FAIL; too big

Table 5 summarizes three observed sizes of the files produced by the conversions.  The size of the HDF4 to HDF5 is extremely similar, with often slightly smaller than HDF4.  The MODIS file was significantly larger when converted to HDF5.  This may be due to the naive translation in this release of h4toh5:  compression is not applied in HDF5, and data that is entirely fill-values is written into the HDF5 file.

As expected, the XML for an HDF5 file is much larger than the corresponding HDF5 file.  Table 5 shows that the XML is at least 1.5 times as large as HDF5.  Note that the XML will be the same for compressed HDF5, so the ratio will be much larger for any HDF5 that is effectively compressed.

The XML compressed quite well with gzip, producing much smaller files.  Table 5 shows the ratio of the original HDF4 to the gzip compressed XML.  The minimum compression ratio 1.68, and the best is 9.47.
 
 
Table 5.  Conversion Size Ratios
File H4 / H5 H5 / XML  H4 / XML.gz 
AVHRR 1.005 .42 9.47
CERES 1.006 .33 2.46
MODIS .67 .66 1.68

Table 6 shows the conversion rate, adjusted for the size of the input data.  The HDF4 to HDF5 column gives the MB/s using the size of the HDF4 file, the HDF5 to XML uses the size of the HDF5 file.

The XML to H5 and XML.gz to H5 report the time per size of the uncompressed XML file.  In this case, the time difference is attributable to the decompression time.

It should be noted that the conversion is affected by the number of objects to be converted as well as the total size of the file.  This is not reflected in these numbers.
 
Table 6.  Conversion speed MB of input/s
File HDF4 to HDF5  HDF5 to XML XML to H5 XML.gz to H5
AVHRR .8 .05 .41 .38
CERES .29 .18 .41 .52
MODIS .06 .12

 

Conclusions


This experiment shows that the HDF4 to HDF5 converter and XML dump can be used with realistic datasets.  The processing times are not trivial, and the default conversions can result in files that are very much larger than the original HDF4.

The data is insufficient to determine the processing bottlenecks.  However, informal observations indicate that these conversions are memory and IO intensive.

As expected, the XML file is much larger than the HDF5 file.  However, the size was never more than three times larger, which is not as bad as might be feared.  Of course, this isn't a large or representative enough sample to draw general conclusions.

As expected, the XML compressed very well with gzip.  This can probably be attributed to the large amount of redundancy in the XML, lots of white space, repeated tags, etc.

In every case, the XML compressed to smaller than the original HDF4.  Note, too, that parsing the compressed XML was not greatly slower that the raw XML. If these findings are found to be general, this implies that XML need not be eschewed simply on grounds that it takes too much storage space.