Indexing in HDF5

HDF Indexing Prototypes		Other Indexing Projects
Bitmap Indices	Projection Indices	FastBit	PyTables

Basics:

A bitmap index is a special kind of index that stores the bulk of its data as bitmaps and answers most queries by performing bitwise logical operations on these bitmaps. A small introduction of bitmap indexes can be found here. A more detailed description of bitmap indexes and their state of art can be found in this paper. Section 2 of the paper describes the basics of bitmap indexes, while the rest of the paper describes multi-resolution bitmap indexes. This version of H5BIN has provisions to use multi-resolution bitmap indexes.

Restrictions:

At this point the indexing is limited to atomic datatypes only. An extension to use it with compound datatypes with one level complexity would be added soon.

The response from the query consists of a bit vector which can be iterated upon using an iterator (provided with the API). The positions would have to be converted into dataspaces by the users themselves. A patch to convert bit vectors into dataspaces in the works.

It currently supports indexing of integral datatypes (byte, short, int, long long int) only. The bitmap has one bit vector per value in the integral range. In the future we plan to map floating point numbers to integers (by binning them and assigning bit vectors to bins) and provide indexing for floating point numbers too.

Using Bitmap Indexes:

The bitmap index API, called the H5BIN API, is implemented as a C++ class with 3 public functions, the constructor, create and query. The constructor takes as a parameter the directory which contains the data and the metadata files. The create function takes no parameters, while the query function takes a list of attributes and their associated bounds as input. The query function then creates an conjunctive query using the various attributes and their associated bounds and executes the query. In order to create the indexes we need to let the function know which datasets to read and which indexes to build. We do this through the metadata file. The default metadata file to be provided by the user consists of 2 datasets named Variables and Files. They contain:

Variables: This dataset consists of a table with each row consisting of the name of the variable, the location inside a file where the variable is stored, the data type and whether it is a nodal or elemental coordinate (only for meshes).
Files: This dataset consist of a list of the names of the files where the data to be indexed is stored.

Software:

The source code for the indexing software can be found here.