May 1, 2001
(Revised May 8, 2001)
Robert E. McGrath
| Platform |
| Sparc 5 |
| Solaris 2.7 |
| NFS mounted disk |
The software was all based on the most recent release of HDF5 and HDF5 Java tools. Minor changes were made in the h4toh5, which do not affect the data reported here. Table 2 summarized the software environment.
The input files were created on other systems with several versions
of HDF4.
| Utilities: h4toh5, h5dump: from HDF5-1.4.1 |
| Java: h5gen 1.0 (April 15, 2001) |
| gzip 1.2.4 |
The input data was three HDF4 files. These files were selected
as being somewhat representative of NASA EOS data. None of the files
is a strict HDFEOS file. The files ranged from 16 to 138 MB, whit
both data and metadata elements. Table 3 describes the files.
The main data elements are the large arrays and tables identified with
the JHV tool. These are the "problem size data", although not all the data
values are necessarily written on disk, and some may be compressed on disk.
The total size of the file includes all data and metadata on the disk.
| File | Main Data Elements | Size of Main Data Elements (no compression) MB | File Size (bytes) |
| AVHRR Ocean Pathfinder Equal Angle
1/1/96 1996001h09da-gdm.hdf |
2 SDS, 2048 x 4096, uint8 (no data written)
2 RI8, 2048 * 4096 |
16 | 16,862,109 |
| CERES_ES8
has HDFEOS metadata Organized as a swath ceres.hdf |
Geolocation
Vdata: 2148 records, float32 2 * SDS 2148 * 660, float 32 Data
|
113 | 81,229,295 |
|
MODIS Airborne Simulator SCAR-B 23 August 1995 Flight line 15 scarb_mas_950823_015.hdf
|
Data:
SDS 1525 * 50 * 716, uint16 Cal:
Other:
|
118 | 138,444,964 |
Each sample HDF4 file was converted to HDF5 using the h4toh5 utility.
The resulting HDF5 file was converted to XML using h5dump --xml.
The resulting XML file was compressed with gzip. The XML was
converted back to HDF5 with h5gen, and the gzipped XML was converted
to HDF5 with:
zcat file.xml.gz | h5gen
File sizes were reported with 'ls -l'.
In each case, the wall clock time was collected with 'time'. Each sequence of conversions was repeated 7 times, the median elapsed time is reported.
Table 4 shows the wall clock time for the conversion steps. The conversions were repeated 7 times, and the median is reported. The times did not vary greatly, although the variation was larger for the larger files. This effect was likely due to random network congestion and other effects due to time sharing. In all cases, there were no more than a few outliers in the time measurement.
The MODIS file had one dataset that was too large for the h5gen
tool to process.
| File | Convert H4 to H5 | Convert H5 to XML | gzip the XML file | Convert XML to H5 | Convert XML.gz to H5 |
| AVHRR | 42 | 305 | 22 | 84 | 80 |
| CERES | 300 | 418 | 311 | 661 | 447 |
| MODIS | 663 | 1384 | 842.5 | FAIL: too big | FAIL; too big |
Table 5 shows the file sizes for the original HDF4 and converted files. The sizes were the same for each repetition.
The size of the HDF4 to HDF5 is extremely similar, with HDF5 often slightly smaller than HDF4. An exception is the MODIS file, which was significantly larger when converted to HDF5. This may be due to the naive translation in this release of h4toh5: compression is not applied in HDF5, and data that is entirely fill-values is written into the HDF5 file.
As expected, the XML for an HDF5 file is much larger than the corresponding HDF5 file. However, the XML was not outrageously larger. Note that the XML will be the same whether the HDF5 is compressed or not, because the XML contains all the data values.
The XML compressed quite well with gzip, producing much smaller files.
Table 5 shows the ratio of the original HDF4 to the gzip compressed XML.
| File | HDF 4 (Original) | Converted H5 | H5 converted to XML | GZIP compressed XML |
| AVHRR | 16.86 | 16.78 | 40 | 1.78 |
| CERES | 81.2 | 80.7 | 245 | 33 |
| MODIS | 138 | 206 | 313 | 82 |
Table 6 compares the size of the original HDF4 (which is similar to the HDF5) to XML and to compressed XML. The XML is at least 1.5 times as large as the HDF4. The XML compressed quite well with gzip, producing much smaller files, at least 3.8 times smaller. A comparison of the original HDF4 to the compressed XML shows that the XML is smaller than the original HDF4.
Clearly, these ratios depend on the size of the original HDF4.
If the HDF4 file is effectively compressed (and the HDF5 is equally compressed)
the XML will be the same. If so, the compressed XML might not be
smaller than the compressed HDF4 or HDF5. It is impossible to predict
from this study.
| File | H4 / XML | XML / XML.gz | H4 / XML.gz |
| AVHRR | .42 | 22.5 | 9.47 |
| CERES | .33 | 7.4 | 2.46 |
| MODIS | .44 | 3.8 | 1.68 |
Table 7 shows the conversion rate, the conversion time adjusted for the overall size of the input data. The HDF4 to HDF5 column gives the MB/s using the size of the HDF4 file, the HDF5 to XML uses the size of the HDF5 file. The XML to H5 and XML.gz to H5 report the time per size of the uncompressed XML file. The time difference attributable to the decompression is negligible.
It should be noted that the overall size of the input file is not a
completely valid indication of the amount of work done by the conversion.
The conversion process is affected by the number of objects in the files,
and the size of individual objects, and possibly by other factors such
as the types of HDF4 objects, compression and chunking, the size of the
largest object in the file.
| File | HDF4 to HDF5 | HDF5 to XML | XML to H5 | XML.gz to H5 |
| AVHRR | .8 | .05 | .41 | .38 |
| CERES | .29 | .18 | .41 | .52 |
| MODIS | .06 | .12 |
The data is insufficient to determine the processing bottlenecks. However, informal observations indicate that these conversions are memory and IO intensive.
As expected, the XML file is much larger than the HDF5 file. However, the size was never more than three times larger, which is not as bad as might be feared. Of course, this isn't a large or representative enough sample to draw general conclusions.
As expected, the XML compressed very well with gzip. This can probably be attributed to the large amount of redundancy in the XML; it has lots of white space, repeated tags, etc.
In every case, the XML compressed to smaller than the original HDF4. Note, too, that parsing the compressed XML was not greatly slower that the raw XML. If these findings are found to be general, this implies that XML need not be eschewed simply on grounds that it takes too much storage space.
This study did not compare compressed XML to compressed HDF4 or HDF5. It would be expected that different types of binary data in HDF files will compress better or worse with given compression. Different XML files will compress better or worse, but it might well depend on factors other than the original binary data type.