I think that integer values are fairly easy to handle with a scale and offset followed by a truncation and I've outlined some scenarios below.
Integer values can be scanned by the library to determine the range
of values to compress and the minimum # of bits to encode them can
be determined algorithmically. Consider the following array of values
to compress:
4250 | 4261 | 4929 |
1021 | 4656 | 2712 |
3113 | 3118 | 2508 |
The minimum data value is 1021 and the maximum data value is 4929.
Therefore the "span" of the values ((max-min)+1
) is 3909
and the minimum
number of bits (MinBits
) to store the values between the
minimum and
maximum values is: ceiling(log2(span)) = 12
.
Setting the
offset for the compression method to the minimum value, the values
to pack are generated by this equation:
packedn = (unpackedn - offset) &
((1 << MinBits) -1)
, where n
indicates the
nth value to operate on.
Each packed value (of MinBits
size) is then stored
contiguously in a buffer without intervening bits between adjacent
values.
To unpack the values, each packed value of MinBits
is
zero extended to the original values' size in bits and then the
offset
is added to it.
The previous example was simplified by not taking fill-values into
account, which would not work well in practice due to fill-values
frequently being the minimum or maximum value encodable for a given
number of bits. These values would throw the MinBits
calculation
off, possibly giving no compression savings at all. If a fill-value is
defined for a dataset, the fill-value should be ignored for the purposes of
calculating the span
of values, but the MinBits
equation must be modified to be: ceiling(log2(span+1))
.
Additionally, the equations for computing the packed and unpacked
values must be updated:
packedn = (unpackedn == fill-value)
? (( 1 << MinBits) - 1) : (unpackedn - offset)
unpackedn = (packedn == (( 1 << MinBits) - 1) )
? fill-value : (packedn + offset)
In addition to allowing integer values to have the number of bits to
store each value determined automatically by the library, it would also
make sense to allow applications to explicitly control the number of
bits used to store the data values. Therefore, instead of determining
MinBits
algorithmically as above, the application supplies
a MinBits
value which is used to truncate the packed
values, after the buffer's offset is subtracted. Obviously, if an
application supplied a MinBits
which was less than that
which would have been calculated for the automatic encoding method, the
compression will be lossy.
It's been suggested that different regions of a dataset could retain different numbers of bits, but I think this will be very complex to specify for users in the general case and I wouldn't recommend it, at least initially.
Alternatively, we could give the application complete control over the scaling factor, the offset and the number of bits to retain, but that may be too inflexible over all the data values for the dataset.
It's straightforward to determine the appropriate number of bits necessary to encode integer values, but floating-point values are quite a bit more difficult. After tossing around various ideas, I think it's more practical to adopt the GRiB data packing mechanism, outlined here: GRiB data packing method.
Automatic integer packing seems straightforward and shouldn't be terribly difficult to implement, even with the quirks necessary for handling fill-values properly. Allowing users to determine the number of bits to retain shouldn't add in any significant additional complexity. I don't recommend having the application specify the scale, offset and number of bits because those values would be global to the entire dataset and would not be able to adjust to local variations in the values for each chunk.
The GRiB data packing method looks complex, but many of our user communities are familiar with it and would benefit from our implementing it properly.
After we've got the basic algorithms working for integer and floating-point values, we might consider applying this form of compression to compound or array datatypes as well. This may be fairly complex, given than non-compressible datatypes (like strings, etc) may be fields in a compound datatype as well and the compression algorithm should not operate on them and just pass them along unmodified.