Revising Links in HDF5 Groups

Quincey Koziol
koziol@ncsa.uiuc.edu
May 6, 2004

  1. Document's Audience:

  2. Background Reading:

    The HDF5 file format description:
    B-Tree Nodes
    Group Nodes
    Group Node Entries
    Local Heaps
  3. Introduction:

    What is this document about?
    This document describes some enhancements to the information stored for each link in an HDF5 group.

    How do links in HDF5 groups currently operate?

    Links in HDF5 groups are the method that indicates an object's inclusion in a group. Objects in HDF5 files don't have names themselves (due to being able to be included in multiple groups with different names), so the link to the object has a name for that particular reference to the object.

    Fundamentally, links store certain required information, such as the name of the link and either the object header address in the file (for hard links) or the target name (for soft links). Additionally, hard links can cache certain additional information about the target object (such as it's B-tree root and local heap addresses, for group objects) in an attempt to speed up operations on that object.

    Currently, the information for each link in a group is split into two pieces. A fixed size portion is stored in the group node entry, and the name is stored in the local heap for the group.

    Links in a group are indexed by a B-Tree, where the B-tree keys are the names of the objects (using the offset of the name in the local heap) and leaf nodes of the B-tree point to group nodes, which contain lists of group node entries.


    What are limitations of the current design?

    Over time, we've learned more about how HDF5 groups are used and have determined a few limitations in the current design. One difficulty is the way the B-tree and local heap addresses are cached for groups. Because the empty B-tree and local heap are allocated for each new group (in order to put them into the cached information for the new link to that group), if the group remains empty, the space used by the B-tree and local heap is wasted. It would be better to allocate the B-tree and local heap on demand, when links are created in an empty group.

    Additionally, the information about each link is split into two locations (the group node entry and the local heap) in a way which wastes space for linking to common objects, like datasets. The information about each link also includes "cached" information about the object linked to, which breaks the modularity of the design.

    Also, each time the B-tree is traversed, the local heap must be mapped into memory, in order to compare the names of the links. It would be possible to avoid this penalty by creating a hash-value of each name to store in the B-tree keys and avoid touching the local heap at all during B-tree traversals.


    How are object references and group links related?

    Object references in datasets point to other objects in an HDF5 file. Object references are stored in datasets as an address of the object referenced. They are similar in this respect to the way hard links are stored (hard links also directly store the address of the object linked to). However creating an object reference to an object doesn't increment the reference count of the target object, which allows an object reference to "dangle" like a soft link.


  4. New Requirements:

    There are several new requirements listed here, in order of decreasing importance:

  5. Additional Suggested Changes:

    Since we are planning to revise the format of the links, there are several other changes that would be beneficial to make at this time. Suggested changes are listed here: