Research |
Improved support for variable length datatypes. |
Variable length datatypes are very common in applications of HDF5, and their frequent use has turn up some shortcomings. Here is an explanation.
Variable length datatypes are currently stored in a ""heap"" in HDF5. The actual data is pointed to by a heap reference. This means that access to a variable length data elements requires two access. It also means that extra storage is used by the heap reference. Furthermore, currently data heaps cannot be compressed, so datasets with variable length data do not compress well. A number of improvements to this are possible. The heap reference size could made smaller. Heap compression could be implemented. Alternate data structures could be put in place to improve access to variable length types. These improvements would be very valuable to a number of types of applications. Other notes on this:
- Use fractal heap code introduced in HDF5 1.8 to store VL raw data (VL and region reference datatypes).
- Allow users to control how large the fractal heap IDs are, so that ""small enough"" raw data can be encoded directly in heap ID (""tiny"" object storage, in fractal heap parlance), avoiding the overhead of the ID and the extra access into the heap. (This solution has been implemented in an ad hoc way, and has proved very effective.)
- "Fold" nested VL types into a single blob when storing in file, to make size smaller and I/O much faster.
- Use single heap per dataset/chunk.
- Enable compression of heap blocks (already implemented, but not available to applications)
|
n/a |