Scale and Offset Data Compression in HDF5

Quincey Koziol
koziol@ncsa.uiuc.edu
October 25, 2004

Document's Audience:
- Current H5 library designers and knowledgable external developers.
Introduction:

What is this document about?

This document describes a new method of compressing integer and floating-point data in HDF5.

What is "scale and offset" compression?

Scale and offset compression is loosely defined as a method for performing a scale and/or offset operation on each data value, followed by truncating the resulting value to a lesser number of bits before storing it.

Will this compression method work on all HDF5 datatypes?

No, this compression method will only work properly on integer or floating-point data values. I suppose a compound or array datatype consisting of just integers and/or floats would work, but making that work might be beyond what we should attempt initially.
Ideas & Problems:

I think that integer values are fairly easy to handle with a scale and offset followed by a truncation and I've outlined some scenarios below.
- Integer values can be scanned by the library to determine the range of values to compress and the minimum # of bits to encode them can be determined algorithmically. Consider the following array of values to compress:
  
  4250 4261 4929
  1021 4656 2712
  3113 3118 2508
  
  The minimum data value is 1021 and the maximum data value is 4929. Therefore the "span" of the values ((max-min)+1) is 3909 and the minimum number of bits (MinBits) to store the values between the minimum and maximum values is: ceiling(log₂(span)) = 12 . Setting the offset for the compression method to the minimum value, the values to pack are generated by this equation: packed_n = (unpacked_n - offset) & ((1 << MinBits) -1), where n indicates the n^th value to operate on. Each packed value (of MinBits size) is then stored contiguously in a buffer without intervening bits between adjacent values.
  
  To unpack the values, each packed value of MinBits is zero extended to the original values' size in bits and then the offset is added to it.
- The previous example was simplified by not taking fill-values into account, which would not work well in practice due to fill-values frequently being the minimum or maximum value encodable for a given number of bits. These values would throw the MinBits calculation off, possibly giving no compression savings at all. If a fill-value is defined for a dataset, the fill-value should be ignored for the purposes of calculating the span of values, but the MinBits equation must be modified to be: ceiling(log₂(span+1)).
  
  Additionally, the equations for computing the packed and unpacked values must be updated:
  - packed_n = (unpacked_n == fill-value) ? (( 1 << MinBits) - 1) : (unpacked_n - offset)
  - unpacked_n = (packed_n == (( 1 << MinBits) - 1) ) ? fill-value : (packed_n + offset)
- In addition to allowing integer values to have the number of bits to store each value determined automatically by the library, it would also make sense to allow applications to explicitly control the number of bits used to store the data values. Therefore, instead of determining MinBits algorithmically as above, the application supplies a MinBits value which is used to truncate the packed values, after the buffer's offset is subtracted. Obviously, if an application supplied a MinBits which was less than that which would have been calculated for the automatic encoding method, the compression will be lossy.
  
  It's been suggested that different regions of a dataset could retain different numbers of bits, but I think this will be very complex to specify for users in the general case and I wouldn't recommend it, at least initially.
- Alternatively, we could give the application complete control over the scaling factor, the offset and the number of bits to retain, but that may be too inflexible over all the data values for the dataset.
It's straightforward to determine the appropriate number of bits necessary to encode integer values, but floating-point values are quite a bit more difficult. After tossing around various ideas, I think it's more practical to adopt the GRiB data packing mechanism, outlined here: GRiB data packing method.
Discussion:

Automatic integer packing seems straightforward and shouldn't be terribly difficult to implement, even with the quirks necessary for handling fill-values properly. Allowing users to determine the number of bits to retain shouldn't add in any significant additional complexity. I don't recommend having the application specify the scale, offset and number of bits because those values would be global to the entire dataset and would not be able to adjust to local variations in the values for each chunk.

The GRiB data packing method looks complex, but many of our user communities are familiar with it and would benefit from our implementing it properly.

After we've got the basic algorithms working for integer and floating-point values, we might consider applying this form of compression to compound or array datatypes as well. This may be fairly complex, given than non-compressible datatypes (like strings, etc) may be fields in a compound datatype as well and the compression algorithm should not operate on them and just pass them along unmodified.

4250	4261	4929
1021	4656	2712
3113	3118	2508

Scale and Offset Data Compression in HDF5

Quincey Koziol koziol@ncsa.uiuc.edu October 25, 2004

Document's Audience:

Introduction:

Ideas & Problems:

Discussion:

Quincey Koziol
koziol@ncsa.uiuc.edu
October 25, 2004