Indexing in HDF5

HDF Indexing Prototypes Other Indexing Projects
Bitmap Indices Projection Indices FastBit PyTables


When Kepler formulated his famous laws of planetary motion, he was using data painstakingly gathered by his mentor Tycho Brahe to advance science. With today's advances in sensor, automation, and computation technologies, gathering scientific data has become very easy. Large data sets produced by observation or simulation allow scientists to formulate new theories and to test existing theories against actual data. This ability has led to production and consumption of large amounts of data by scientists. For example, the Sloan Digital Sky Survey currently offers twelve terabytes of data, which is being used by scientists across the world to study astronomical phenomena and discover new heavenly bodies. Looking at the earth instead of the sky, the Earth Observing Satellite system from NASA produces 3 terabytes of data per day. While these are both observational data, simulation data sets do not lag too far behind the observational data in terms of size. Each complex run at the Center for Simulation of Advanced Rockets at the University of Illinois at Urbana-Champaign produces a couple of terabytes of data. Future scientific applications will be even more data-intensive, due to improved observation-gathering technology and more complex and larger-scale simulation codes. To extract information from such large amounts of data, scientists need to be able to pull out arbitrary subsets of these large data sets based on various constraints; this task requires efficient data management techniques, including a good indexing scheme.