Storing Packed N-Bit Data in HDF5

Quincey Koziol
koziol@ncsa.uiuc.edu
October 25, 2004

  1. Document's Audience:

  2. Background Reading:

    The HDF5 reference manual sections for H5Tset_precision() and H5Tset_offset():
    H5Tset_precision
    H5Tset_offset
  3. Introduction:

    What is this document about?
    This document describes how HDF5 currently stores N-Bit datatypes on disk and explores methods for packing that data.

    How does a user create an N-Bit datatype?

    This sequence of calls creates a datatype describing 12-bit bitfield in a 16-bit value:

                
                    hid_t tid=H5Tcopy(H5T_STD_B16LE);
                    H5Tset_precision(tid,12);
                    H5Tset_offset(tid,2);
                
                

    The values for this example datatype are stored in memory and disk as follows:
    First data value Second data value
    Data
    Bit Offset
    X X D011 D010 D09 D08 D07 D06 D05 D04 D03 D02 D01 D00 X X
    15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
    X X D111 D110 D19 D18 D17 D16 D15 D14 D13 D12 D11 D10 X X
    15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
    ...

    Where "Dnm" indicates the nth data value and the mth bit and "X" indicates a bit where data is not stored (ie. "padding").


    How is "packed" N-Bit stored?

    For the example data above, packed data should be stored as:
    Data
    Bit Offset
    D011 D010 D09 D08 D07 D06 D05 D04
    7 6 5 4 3 2 1 0
    D03 D02 D01 D00 D111 D110 D19 D18
    7 6 5 4 3 2 1 0
    D17 D16 D15 D14 D13 D12 D11 D10
    7 6 5 4 3 2 1 0
    ...

    Note that the "padding" bits have been eliminated and multiple data values may occupy one byte of storage.


    What about packing compound datatypes with N-Bit fields?

    Ideally, compound datatypes with N-Bit datatype fields would pack into the least amount of space required also. For example, the following code creates a compound datatype with a 2-bit, a 3-bit and a 4-bit member, which would ideally pack into a 9-bit object on disk:

                
                    typedef struct {
                        unsigned char 2bit;
                        unsigned char 3bit;
                        unsigned char 4bit;
                    } s1;
                    hid_t 2bit_tid=H5Tcopy(H5T_STD_B8LE);
                    hid_t 3bit_tid=H5Tcopy(H5T_STD_B8LE);
                    hid_t 4bit_tid=H5Tcopy(H5T_STD_B8LE);
                    hid_t compound_tid=H5Tcreate(H5T_COMPOUND,sizeof(s1));
    
                    H5Tset_precision(2bit_tid,2);
                    H5Tset_offset(2bit_tid,4);
                    H5Tset_precision(3bit_tid,3);
                    H5Tset_offset(3bit_tid,2);
                    H5Tset_precision(4bit_tid,4);
                    H5Tset_offset(4bit_tid,1);
    
                    H5Tinsert(compound_tid,"2bit",HOFFSET(s1,2bit),2bit_tid);
                    H5Tinsert(compound_tid,"3bit",HOFFSET(s1,3bit),3bit_tid);
                    H5Tinsert(compound_tid,"4bit",HOFFSET(s1,4bit),4bit_tid);
    
                
                

    Data for this datatype is currently stored like this:

    Data
    Bit Offset
    X X 2Bit01 2Bit00 X X X X
    7 6 5 4 3 2 1 0
    X X X 3Bit02 3Bit01 3Bit00 X X
    7 6 5 4 3 2 1 0
    X X X 4Bit03 4Bit02 4Bit01 4Bit00 X
    7 6 5 4 3 2 1 0
    X X 2Bit11 2Bit10 X X X X
    7 6 5 4 3 2 1 0
    X X X 3Bit12 3Bit11 3Bit10 X X
    7 6 5 4 3 2 1 0
    X X X 4Bit13 4Bit12 4Bit11 4Bit10 X
    7 6 5 4 3 2 1 0
    ...

    Ideally, this data would be stored like this:

    Data
    Bit Offset
    2Bit01 2Bit00 3Bit02 3Bit01 3Bit00 4Bit03 4Bit02 4Bit01
    7 6 5 4 3 2 1 0
    4Bit00 2Bit11 2Bit10 3Bit12 3Bit11 3Bit10 4Bit13 4Bit12
    7 6 5 4 3 2 1 0
    4Bit11 4Bit10 2Bit21 2Bit20 3Bit22 3Bit21 3Bit20 4Bit23
    7 6 5 4 3 2 1 0
    ...


    What about packing array datatypes with N-Bit fields?
    They should be packed together also, similarly to the way N-Bit fields for compound datatypes are, above.

    The examples above are for packing data on disk, what about packed data in memory?
    Ideally, it would be nice to have some way to specify that data is packed in memory also, but that may be beyond the scope of this work.

  4. Ideas & Problems:

  5. Discussion:

    Given all the aspects of the problem above, I would suggest implementing a "bitfield packing" filter for chunked datasets. This I/O filter would be a no-op for datatypes which weren't bitfields, but would pack out the unused bits for data with bitfield datatype.