Background: - Explain what groups are used for in HDF5 - Explain about how hard & soft links work - Try to explain how B-tree & symbol tables are implemented currently (make some pictures!) Feature Requests: - External: - Access links in a group according to the order they were created (netCDF4 & CGNS) - "Large" empty groups (Boeing) [Inefficient symbol table format (unneccesary/unused "cached info" field and other issues)] - Checksum file metadata (LANL) - non-ASCII link names (Intel?) - Internal: - Inefficient method for computing # of objects in group [must traverse entire B-tree for group, adding up the number of links in each symbol table] - Inefficient "local heap" storage method for names (heap can get large for atomic I/O operation to bring into metadata cache and heap data must be recopied in file when heap grows) - Entire B-tree must be traversed in order to reach link info, even for "early" exact matches of entries in the tree (inefficient) - References are allowed to dangle, which can cause bugs when objects they point to are deleted - Possibly other kinds of ordering of links in groups Summary of New Use Cases: - Open n'th object in link creation order (or by other order) - Iterate over links in creation order (or by other order) - Detect file corruption by checksumming file metadata - Efficiency concerns with current format - Correct errors with current reference implementation - Store non-ASCII link names New Requirements/Features: - From use cases: - Iterate over links in creation order - Open n'th object in link creation order - Open n'th object in alphabetic order - Track # of objects in group - Support non-ASCII character sets (ie. Unicode) - Checksum group metadata: (file format) - Object header - B-tree nodes - "Local" heap (or whereever link info is stored) - Make empty groups take much less space - Eliminate wasted space in each link - Revise references to be a form of link (or at least not allow them to dangle) - Add version # to B-tree nodes - Revise local heap format to allow multiple non-contigous blocks - Allow "early out" for exact matches during B-tree traversal - Allow indices on other aspects of links in the future - Interesting ideas not directly from use cases: - Add back-pointers to linked objects - Compress group info - Attributes on links - Add redundant copies of each piece of metadata, to possibly recover from metadata corruption "No Format Change" Implementation Option: - Why? - Forwardly compatible with older libraries (sorta) - Probably less work (since they capture less requirements) - Why Not? - Older libraries modifying the file can corrupt "high level" data structures used to implement - "High level" data structures used to implement are in "user space" - Requirements Captured - Probably just these: - Iterate over links in creation order - Open n'th object in link creation order - Open n'th object in alphabetic order - Track # of objects in group - With more contortions, we could capture more, though... - Open Questions/Issues - Should existing API functions update "high level" data structures created (see implementation options, below)? - Yes, probably, since we want existing API functions to keep everything in sync - Doesn't address inefficiencies/drawbacks of current group storage mechanisms. - "High level" data structures can get out of sync if file is modified by older version of the library. - All of the implementation options put their new information in places that are visible to the user and can be modified by them. (This could be good (in rare cases, probably) or bad (most of the time...) Also, the name(s) chosen for the objects to implement this method of implementing features could clash with names a user wanted to use. - Possible Implementation Options: - Attribute on each object linked to from a group - Implemented by storing "creation time" attribute on each object linked to from a group. - Pros: - Deleting a link in a group is handled easily and doesn't require extra maintenance of other data structures - Cons: - This won't work for soft links, since they are allowed to dangle and therefore might not have an object to hang the "creation time" attribute on. - This is a nightmare for objects in multiple groups. We'd have to create multiple "creation time" attributes on the object, each with some way to correlate back to the group the attribute belonged to. - Determining the n'th object in creation order would require accessing all the objects in the group, in order to gather and sort the "creation time" attributes. - Iterating over objects in the group will need to access all the objects in the group, in order to gather and sort the "creation time" attributes. - Attribute on group for each link in it. - Implemented by storing a "creation time" attribute on each group for each of the links in the group. - Pros: - Works for soft links - Works for objects in multiple groups - Cons: - Will need some sort of naming scheme for the attributes that involves the link name in either the name of the attribute or stored in the attribute data, so the creation time stored in the attribute can be correlated with the appropriate link name in the group. - Duplicating link names (in group info and in attribute info) makes storage space inefficiencies for groups worse - Would create potentially _lots_ of attributes on group objects. Since attributes are searched for in a linear fashion, this can make operations on attributes in general on groups very slow. It will also fragment up the object header for the group, making I/O on the it perform poorly. - Finding the appropriate attribute to delete requires a linear access of all the attributes, in order to find the creation time attribute which correlated to the link being deleted. Also, deleting a link from a group requires deleting its creation time attribute from the group, making object header fragmentation issues worse. - Determining the n'th object in creation order would require accessing all the attributes on the group, in order to gather and sort the "creation time" attributes. - Iterating over objects in the group will need to access all the attributes on the group, in order to gather and sort the "creation time" attributes. [Could pack all this information into one variable-length datatype attribute, which has its own set of pros & cons] - Add dataset to each group to maintain "creation time" index - Implemented with a unlimited length 1-D array of variable-length strings (the names of links in the group), stored in the order they are created in the group. - Pros: - If the chunk size is 1 element and we make a way to delete chunks from a dataset, deleting entries from the middle of a the dataset is relatively cheap and easy. (Otherwise, deleting entries from a group is very painful, as all the entries following the deleted one would need to be copied up one position in the dataset) - Works for soft links - Works for objects in multiple groups - Cons: - Duplicating link names (in group info and in dataset info) makes storage space inefficiencies for groups worse - Link names for dataset would be stored in "global" heap, which would compete for space in the metadata cache with the "local" heap info for the group's link name info - "Global" heaps can fragment the file badly in certain circumstances (This is true in general for "global" heap info though, and not limited to their use in this case) - 1 element chunks make for inefficient chunked storage B-trees (If API for deleting elements was implemented instead of API for deleting chunks, this could be mitigated by using larger chunks, but then more data would need to be shuffled around when an element was deleted) - Would need to implement user-level API for deleting chunks (or elements) from [the middle of] datasets. (Could be "pro" if desired by users, though) - Deleting a link from the group would mean a linear traversal of the dataset, to find the matching entry to remove for the link - Current B-tree implementation not fast for finding the n'th entry (since it needs to start at the beginning) (Still, this is faster than the attribute methods outlined earlier) - Add group to each group to maintain "creation time" index - Implemented with an "index" group within each "normal" group, which used a string form of "creation time" as the 'name' of each object in the index group and linked to the same object as the links in the normal group. - Pros: - Works for soft links with absolute path (soft link info for links in index group is same as soft link info for links in normal group) - Don't need new APIs for deleting elements from middle of dataset - Cons: - Storing creation times in text form is inefficient - Creation times as names is stored in another "local" heap which competes for space in the metadata cache with normal group info - Soft links with relative path in normal group would need to be converted to use absolute path name in index group (since a path name of the form "foo/bar" in the normal group would need to be something like "../foo/bar" in the index group (since it's a group hanging off the normal group) and we don't support "upward" traversals in the group hierarchy). This conversion could be difficult or impossible to perform. - Deleting a link by name doesn't have a conclusive way to remove correct link in index group, since the only way to locate the link removed is to perform a linear search of the index group for a link with the same object header address (or soft link name) and if the same object is linked to multiple times in the group (with different link names), there is no way to determine which is the correct entry to remove in the index group. [Could kludge this by including link name after creation time 'name' somehow] - Current B-tree implementation not fast for finding the n'th entry (since it needs to start at the beginning) (Still, this is faster than the attribute methods outlined earlier) [Syncronization overhead for parallel operation, for all these 'user mode' implementation] "Format Change" Implementation Option: - Why? - Should be able to capture all requirements (if desired) - Adding some of these features means making core change to group functionality and if we're updating the format, we might as well sweep as many changes in as possible to avoid many updates to the format. - "Aggressive" format change option (outlined below) moves us toward indexing on datasets. - Why Not? - Not forwardly compatible (ie. older libraries can't read) (could be good though, since they can't mangle the information) - Users might be reluctant to change due to what they percieve as a format change - Lots of work/time to get new features done - Lots of work/time to maintain new format as well as older one in source code (hard to tell if this would be more work/time than the "no format change" implementations) - Requirements Captured - Could be all, but probably will need to just choose highest priority ones. - My lists would be: - Definitely: - Iterate over links in creation order - Open n'th object in link creation order - Open n'th object in alphabetic order - Track # of objects in group - Checksum group metadata: - Object header - B-tree nodes - "Local" heap (or whereever link info is stored) - Make empty groups take much less space - Eliminate wasted space in each link - Revise references to be a form of link (or at least not allow them to dangle) - Revise local heap format to allow multiple non-contigous blocks - Add version # to B-tree nodes - Allow "early out" for exact matches during B-tree traversal - If possible: - Support non-ASCII character sets (ie. Unicode) - Add back-pointers to linked objects - Allow future indices on other aspects of links - Compress group info - Probably not: - Attributes on links - Open Questions/Issues - What other information about a link (besides the link name and its creation time) do we want to capture (and possibly index) for every link? - Are there other types of links besides hard & soft links? (since those two mostly focus on how we get to the linked-to object) - Back-pointer ideas: - Adding back-pointers to objects makes groups sort of "two way", by having links to objects within them and having links back to other objects that point to them. Does this have any interesting implications? (ie. could we traverse the group graph in reverse?) - If we add back-pointers to objects, should they include "normal" object & dataset region references, or just links from groups? - If we add back-pointers to objects, how should we "name" back-pointers? (ie. should they be named, like the links to objects from groups, or should they be unnamed and chosen by specifying the n'th back-pointer?) - Should we allow "old style" groups to be created? Should we allow mixing old & new style groups in the same file? (I can think of ways to stop this, but they aren't very nice to users so we should probably allow this as gracefully as possible) - Possible Implementation Options: - Revise format data structures: (the "Conservative" solution) - Keep general current idea of groups (B-trees on links), but update to enable features desired. - Format Changes: - Revise "Group" message to allow for new information - Revise format of links - Revise format of B-trees - Revise format of "local" heap - Create new format data structures: (the "Aggressive" solution) - Revise "group" to actually be a dataset of vlen link info elements with indices on fields of the link info (like the link name and creation time) - Notes: - It would be nice if the dataset used for storing the group's info didn't have coordinates associated with each element and was just a "bag" of elements (link info objects) that could be added to or subtracted from, and the indices on the dataset were the only way to locate elements. - Is there a useful way to give this sort of feature to users of datasets? Probably, if we implement a new "indexed" storage mechanism for datasets, in addition to contiguous, chunked, etc. We'd also have to think about how this would affect H5Dwrite/H5Dread, since they are so closely tied up with using coordinates for selecting the elements to perform I/O on. Assumption is to use B-trees for storing indices on the elements. This could be important for NCASSR indexing work... - Format Changes: - Revise "Group" message to [actually] be a dataset, with a dataspace(?), datatype, etc. (would use new "indexed storage" mechanism mentioned below) - Add new new storage mechanism for dataset's raw data: indexed storage. - Create new type of heap which would be more efficient for collections of variable-length info. (Breaking the information up into smaller blocks probably, etc.) - Update references so that they act like hard links [how to handle current idea of soft links in groups?] API Changes: - Depend on requirements captured and/or features added - Will probably require adding "group creation" and "group access" property lists, in order to control some of the new features. - Existing API routines may need to be changed or augmented (with new API functions) to handle operations on creation order as well as link name. Key Insights: - What we are really trying to do with groups is create different types of indices on link objects within the group, as if the group were a dataset whose elements were links and we are indexing on fields of each link. - Current datasets are like "bags" of elements, with an implicit "coordinate" field for each element, indexed by the "coordinate" field, and stored according to this implicit index. [Is there a useful way to take advantage of storing elements according to an type of implicit index?] [Packing elements this way makes it very painful to insert and/or delete elements in the "middle" of the index]