Request for Comment
Checksum for HDF5
Quincey Koziol & Mike Folk
October 26, 2002
A checksum is a computed value that depends on the contents of a block of data and that is transmitted or stored along with the data in order to detect corruption of the data. The receiving system re-computes the checksum based upon the received data and compares this value with the one sent with the data. If the two values are the same, the receiver has some confidence that the data was received correctly or was not corrupted during storage.
HDF5 users have expressed a desire for some method to verify when metadata and/or raw data stored in the file is corrupted. Using a checksum would allow these information corruption problems to be detected, but not necessarily corrected.
Corruption can occur in either the raw data or metadata components of and HDF5 file. The term “raw data” here refers to the stored elements in the array associated with a dataset. Metadata occurs in dataset headers, in the data structures used to store groups, in file headers, and in a user block.
Since raw data and metadata are organized and treated differently by the library, their checksums will also need to be treated differently. Also, since raw data typically constitutes many more bytes than metadata, different algorithms may be advisable for each.
Checksums in HDF5 will also be handled differently for chunked datasets vs. contiguous datasets. It is relatively easy to apply checksums to individual chunks, which means that partial reads and writes can be handled efficiently – only the chunks involved in a read or write need invoke any checksum calculations.
In contrast, when a part of a contiguous (i.e. non-chunked) dataset is read, its checksum can only be checked by reading the entire dataset. Likewise, when rewriting a part of a contiguous dataset, the use of checksums requires at a minimum that the data to be replaced must be read in order to re-compute a checksum.
Our goal with this implementation is just to provide a simple, raw capability designed to support large, one-time writes of data. It must work in parallel, but can require the use of chunking. It does not have to work efficiently for applications that make partial changes to problem-sized data. The following usage examples cover the cases that we anticipate having to cover.
Here are some usage examples that we propose to support in version 1:
Uses that we do not plan to support in version 1:
We propose that checksums be used always for metadata, but be optional for data. We recommend checksums for all metadata because it would make the coding (and code maintenance) simpler, and because metadata objects are small enough that the extra CPU cost is quite small compared to the actual I/O operations that are performed. (This assertion needs to be verified.)
Raw data, in contrast, can be very large, and it seems likely that checksum calculations could appreciably increase the cost of I/O operations, particularly partial I/O operations on contiguous datasets. As in the metadata situation, some performance testing would need to be done before we can be sure of when the cost of checksum use is significant.
Because of the different nature of metadata vs. dataset array data structures, is seems advisable to consider two checksum algorithms, one for each. For the sake of simplicity, it would be best to chose only one algorithm for metadata and one for array data We currently have no particular checksum algorithm in mind, and welcome suggestions.
Characteristics of the HDF5 library and format make it more
difficult to support checksumming for metadata than for raw data. Raw data checksumming involves less
substantial changes to the file format, and could probably be implemented more
quickly. Therefore, we propose a
two-phase process. In Phase 1, we
implement checksumming for dataset arrays, and phase 2 we implement
checksumming for HDF5 metadata.
Here is a tentative plan of action to checksum raw data.
· Research appropriate algorithm(s) for checksumming large amounts of information. Choice of an algorithm should be made based on speed of checksum algorithm and ability to handle very large data arrays.
· Propose format of new "Checksum" object header message for storing in HDF5 object headers of datasets whose raw data is to be checksummed.
· Propose and agree on API extensions for enabling checksumming on a per-object basis in the file. This will most likely be done with a new dataset creation property.
· Enhance internal library functions to create/update checksum when raw data is written to an object.
· Propose and agree on additional API functions for verifying a checksum on the raw data for an object, etc.
· Implement new API functions.
· Write regression tests for new features
· Write documentation, tutorial material, etc.
Here is a tentative plan of action to checksum metadata.
· Research appropriate algorithm(s) for checksumming small, objects. Choice of an algorithm will be influenced by the algorithm chosen for raw data checksumming, but that algorithm may not be appropriate for the smaller amounts of information contained in the file's metadata.
· Propose format change to incorporate a checksum field into the different types of metadata stored in a file.
· Enhance internal library functions to create/update checksum when metadata is written to a file.
· Write regression tests for new features
· Write documentation, etc.