May 1, 2001
Robert E. McGrath
| Platform |
| Sparc 5 |
| Solaris 2.7 |
| NFS mounted disk |
The software was all based on the most recent release of HDF5 and HDF5 Java tools. Minor changes were made in the h4toh5, which do not affect the data reported here. Table 2 summarized the software environment.
The input files were created on other systems with several versions
of HDF4.
| Utilities: h4toh5, h5dump: from HDF5-1.4.1 |
| Java: h5gen 1.0 (April 15, 2001) |
| gzip 1.2.4 |
The input data was three HDF4 files. These files were selected
as being somewhat representative of NASA EOS data. None of the files
is a strict HDFEOS file. The files ranged from 16 to 138 MB, whit
both data and metadata elements. Table 3 describes the files.
The main data elements are the large arrays and tables identified with
the JHV tool. These are the "problem size data", although not all the data
values are necessarily written on disk, and some may be compressed on disk.
The total size of the file includes all data and metadata on the disk.
| File | Main Data Elements | Size of Main Data Elements (no compression) MB | File Size (bytes) |
| AVHRR Ocean Pathfinder Equal Angle
1/1/96 1996001h09da-gdm |
2 SDS, 2048 x 4096, uint8 (no data written)
2 RI8, 2048 * 4096 |
16 | 16,862,109 |
| CERES_ES8
has HDFEOS metadata Organized as a swath ceres.hdf |
Geolocation
Vdata: 2148 records, float32 2* SDS 2148 * 660, float 32 Data
|
113 | 81,229,295 |
|
MODIS Airborne Simulator SCAR-B 23 August 1995 Flight line 15 scarb_mas_950823_015.hdf
|
Data:
SDS 1525 * 50 * 716, uint16 Cal:
Other:
|
118 | 138,444,964 |
Each sample HDF4 file was converted to HDF5 using the h4toh5 utility.
The resulting HDF5 file was converted to XML using h5dump --xml.
The resulting XML file was compressed with gzip. The XML was converted
back to HDF5 with h5gen, and the gzipped XML was converted to HDF5 with:
zcat tile.xml.gzip | h5gen
File sizes were reported with 'ls -l'.
In each case, the wall clock time was collected with 'time'. Each
sequence of conversions was repeated 7 times, the median elapsed time is
reported.
The results are summarized in Table 4. The MODIS file contained
an array that was too large for h5gen to convert from XML to HDF5.
All other conversions completed correctly, although some warnings were
ignored.
The file sizes were constant for each repetition.
The times did not vary greatly, although the variation was larger for
the larger files. This effect was likely due to random network congestion
and other effects due to time sharing. In all cases, there were no
more than a few outliers in the time measurement.
| File | HDF 4 Size (MB) | Convert H4 to H5 (median sec) | Converted H5 file
(MB) |
Convert H5 to XML (median sec) | Size of .XML file | gzip the XML file
(median sec) |
Size of .XML.gz (MB) | Convert XML to H5 (median sec) | convert XML.gz to H5 (median sec) |
| AVHRR (n = 7) | 16.86 | 42 | 16.78 | 305 | 40 | 22 | 1.78 | 84 | 80 |
| CERES (n = 7) | 81.2 | 300 | 80.7 | 418 | 245 | 311 | 33 | 661 | 447 |
| MODIS (n = 7) | 138 | 663 | 206 | 1384 | 313 | 842.5 | 82 | FAIL: too big | FAIL; too big |
Table 5 summarizes three observed sizes of the files produced by the conversions. The size of the HDF4 to HDF5 is extremely similar, with often slightly smaller than HDF4. The MODIS file was significantly larger when converted to HDF5. This may be due to the naive translation in this release of h4toh5: compression is not applied in HDF5, and data that is entirely fill-values is written into the HDF5 file.
As expected, the XML for an HDF5 file is much larger than the corresponding HDF5 file. Table 5 shows that the XML is at least 1.5 times as large as HDF5. Note that the XML will be the same for compressed HDF5, so the ratio will be much larger for any HDF5 that is effectively compressed.
The XML compressed quite well with gzip, producing much smaller files.
Table 5 shows the ratio of the original HDF4 to the gzip compressed XML.
The minimum compression ratio 1.68, and the best is 9.47.
| File | H4 / H5 | H5 / XML | H4 / XML.gz |
| AVHRR | 1.005 | .42 | 9.47 |
| CERES | 1.006 | .33 | 2.46 |
| MODIS | .67 | .66 | 1.68 |
Table 6 shows the conversion rate, adjusted for the size of the input data. The HDF4 to HDF5 column gives the MB/s using the size of the HDF4 file, the HDF5 to XML uses the size of the HDF5 file.
The XML to H5 and XML.gz to H5 report the time per size of the uncompressed XML file. In this case, the time difference is attributable to the decompression time.
It should be noted that the conversion is affected by the number of
objects to be converted as well as the total size of the file. This
is not reflected in these numbers.
| File | HDF4 to HDF5 | HDF5 to XML | XML to H5 | XML.gz to H5 |
| AVHRR | .8 | .05 | .41 | .38 |
| CERES | .29 | .18 | .41 | .52 |
| MODIS | .06 | .12 |
This experiment shows that the HDF4 to HDF5 converter and XML dump
can be used with realistic datasets. The processing times are not
trivial, and the default conversions can result in files that are very
much larger than the original HDF4.
The data is insufficient to determine the processing bottlenecks. However, informal observations indicate that these conversions are memory and IO intensive.
As expected, the XML file is much larger than the HDF5 file. However, the size was never more than three times larger, which is not as bad as might be feared. Of course, this isn't a large or representative enough sample to draw general conclusions.
As expected, the XML compressed very well with gzip. This can probably be attributed to the large amount of redundancy in the XML, lots of white space, repeated tags, etc.
In every case, the XML compressed to smaller than the original HDF4. Note, too, that parsing the compressed XML was not greatly slower that the raw XML. If these findings are found to be general, this implies that XML need not be eschewed simply on grounds that it takes too much storage space.