Background

A checksum is a computed value that depends on the contents of a block of data and that is transmitted or stored along with the data in order to detect corruption of the data. The receiving system re-computes the checksum based upon the received data and compares this value with the one sent with the data. If the two values are the same, the receiver has some confidence that the data was received correctly or was not corrupted during storage.

HDF5 users have expressed a desire for some method to verify when metadata and/or raw data stored in the file is corrupted. Using a checksum would allow these information corruption problems to be detected, but not necessarily corrected.

Corruption can occur in either the raw data or metadata components of and HDF5 file. The term “raw data” here refers to the stored elements in the array associated with a dataset. Metadata occurs in dataset headers, in the data structures used to store groups, in file headers, and in a user block.

Since raw data and metadata are organized and treated differently by the library, their checksums will also need to be treated differently. Also, since raw data typically constitutes many more bytes than metadata, different algorithms may be advisable for each.

Checksums in HDF5 will also be handled differently for chunked datasets vs. contiguous datasets. It is relatively easy to apply checksums to individual chunks, which means that partial reads and writes can be handled efficiently – only the chunks involved in a read or write need invoke any checksum calculations.

In contrast, when a part of a contiguous (i.e. non-chunked) dataset is read, its checksum can only be checked by reading the entire dataset. Likewise, when rewriting a part of a contiguous dataset, the use of checksums requires at a minimum that the data to be replaced must be read in order to re-compute a checksum.

Usage examples

Our goal with this implementation is just to provide a simple, raw capability designed to support large, one-time writes of data. It must work in parallel, but can require the use of chunking. It does not have to work efficiently for applications that make partial changes to problem-sized data. The following usage examples cover the cases that we anticipate having to cover.

Here are some usage examples that we propose to support in version 1:

An application writes HDF5 datasets to a file, and this applications or a later application must determine if corruption occurred when writing. Corruption could occur either in the raw data or metadata object. A checksum is computed on each object before it is written to the HDF5 file. When any object is read later, the checksum is used to determine the validity of the data.
A parallel application using MPI-IO creates a file with a dataset. Checksums are created for all data and metadata. (This is the same as use case #1, but in parallel.)
An application writes a dataset that is stored in chunks, and writes a chunk at a time. In this case, each chunk has it’s own checksum, so that it’s validity can be checked whenever the chunk is read back in.
An application does a partial write to a dataset that is chunked, and the partial write changes the data in several chunks. In this case, a new checksum is computed for every chunk that is altered during the partial write.
A parallel application writes to a dataset that is stored in chunks, with each chunk being written by a single processor. No two processors write to the same chunk. In this case a checksum is computed for each chunk that is accessed, similar to cases 2 & 3.
An application writes to part of a dataset that is stored contiguously. Since checksums are not computed for subsets of non-chunked arrays, a new checksum is computed by reading back the entire dataset, re-computing the checksum, then re-writing the dataset.
An application writes a compressed dataset. A checksum is computed on the uncompressed data, before compression occurs. The data must be uncompressed before it can be validated using the checksum.
Compute, or re-compute the checksum of an existing dataset. The entire dataset may have to be read into memory in order to do this. An HDF5 library routine is available to do this.
An application writes data, but does not want to have a checksum computed. An HDF5 library routine is available to turn off the checksum computation for a dataset. If it is invoked, the dataset (or chunk) has no checksum. (Should this also be extended to other containers, such as a group or a file?)
An application queries about whether a checksum exists, and if it does, the application queries about the value of a checksum. A routine is available to ascertain whether or not a checksum exists. Another HDF5 library routine is available to query the value of a checksum.

Uses that we do not plan to support in version 1:

A part of a contiguously stored array is read. Provide a checksum to tell whether the data is valid. (We do not propose to support the use of checksum when reading only part of a contiguously stored dataset, since this would complicate the algorithm, and it does not seem to be sufficiently important to our users.)
A part of a contiguously stored array is changed. Re-compute the checksum without reading in the entire array. (We do not propose to support this operation, since this would complicate the algorithm, and it does not seem to be sufficiently important to our users.)
Make it possible to use different checksum algorithms, including algorithms supplied by applications. One good reason to support this would be to evolve to newer, better checksum algorithms that might occur in the future. (We propose not so support this operation because it complicates the implementation and does not seem sufficiently important to users at this time. However, we should strive to make it possible to add this feature in the future.)
Support the use of checksum for all VFL drivers.

When should checksums be required?

We propose that checksums be used always for metadata, but be optional for data. We recommend checksums for all metadata because it would make the coding (and code maintenance) simpler, and because metadata objects are small enough that the extra CPU cost is quite small compared to the actual I/O operations that are performed. (This assertion needs to be verified.)

Raw data, in contrast, can be very large, and it seems likely that checksum calculations could appreciably increase the cost of I/O operations, particularly partial I/O operations on contiguous datasets. As in the metadata situation, some performance testing would need to be done before we can be sure of when the cost of checksum use is significant.

Choosing checksum algorithms

Because of the different nature of metadata vs. dataset array data structures, is seems advisable to consider two checksum algorithms, one for each. For the sake of simplicity, it would be best to chose only one algorithm for metadata and one for array data We currently have no particular checksum algorithm in mind, and welcome suggestions.

Proposed plan of action

Characteristics of the HDF5 library and format make it more difficult to support checksumming for metadata than for raw data. Raw data checksumming involves less substantial changes to the file format, and could probably be implemented more quickly. Therefore, we propose a two-phase process. In Phase 1, we implement checksumming for dataset arrays, and phase 2 we implement checksumming for HDF5 metadata.

Implementing checksum for raw data

Here is a tentative plan of action to checksum raw data.

· Research appropriate algorithm(s) for checksumming large amounts of information. Choice of an algorithm should be made based on speed of checksum algorithm and ability to handle very large data arrays.

· Propose format of new "Checksum" object header message for storing in HDF5 object headers of datasets whose raw data is to be checksummed.

· Propose and agree on API extensions for enabling checksumming on a per-object basis in the file. This will most likely be done with a new dataset creation property.

· Enhance internal library functions to create/update checksum when raw data is written to an object.

· Propose and agree on additional API functions for verifying a checksum on the raw data for an object, etc.

· Implement new API functions.

· Write regression tests for new features

· Write documentation, tutorial material, etc.

Implementing checksum for metadata

Here is a tentative plan of action to checksum metadata.

· Research appropriate algorithm(s) for checksumming small, objects. Choice of an algorithm will be influenced by the algorithm chosen for raw data checksumming, but that algorithm may not be appropriate for the smaller amounts of information contained in the file's metadata.

· Propose format change to incorporate a checksum field into the different types of metadata stored in a file.

· Enhance internal library functions to create/update checksum when metadata is written to a file.

· Write regression tests for new features

· Write documentation, etc.