Scale and Offset Data Compression in HDF5

Quincey Koziol
koziol@ncsa.uiuc.edu
October 25, 2004

  1. Document's Audience:

  2. Introduction:

    What is this document about?
    This document describes a new method of compressing integer and floating-point data in HDF5.

    What is "scale and offset" compression?
    Scale and offset compression is loosely defined as a method for performing a scale and/or offset operation on each data value, followed by truncating the resulting value to a lesser number of bits before storing it.

    Will this compression method work on all HDF5 datatypes?
    No, this compression method will only work properly on integer or floating-point data values. I suppose a compound or array datatype consisting of just integers and/or floats would work, but making that work might be beyond what we should attempt initially.

  3. Ideas & Problems:

    I think that integer values are fairly easy to handle with a scale and offset followed by a truncation and I've outlined some scenarios below.

    It's straightforward to determine the appropriate number of bits necessary to encode integer values, but floating-point values are quite a bit more difficult. After tossing around various ideas, I think it's more practical to adopt the GRiB data packing mechanism, outlined here: GRiB data packing method.

  4. Discussion:

    Automatic integer packing seems straightforward and shouldn't be terribly difficult to implement, even with the quirks necessary for handling fill-values properly. Allowing users to determine the number of bits to retain shouldn't add in any significant additional complexity. I don't recommend having the application specify the scale, offset and number of bits because those values would be global to the entire dataset and would not be able to adjust to local variations in the values for each chunk.

    The GRiB data packing method looks complex, but many of our user communities are familiar with it and would benefit from our implementing it properly.

    After we've got the basic algorithms working for integer and floating-point values, we might consider applying this form of compression to compound or array datatypes as well. This may be fairly complex, given than non-compressible datatypes (like strings, etc) may be fields in a compound datatype as well and the compression algorithm should not operate on them and just pass them along unmodified.