============================

  1. Why these issues need to be faced now


  2. Although we've been somewhat aware of differences between the way storage space is allocated in a file and how fill-values are treated between HDF4 and HDF5 for a while, this hasn't been an especially burning problem that needed to be dealt with. Unfortunately, there is a bug with the way that memory for variable-length (VL) data is being leaked in the file when the data elements are overwritten, and it is tied to these storage and fill-value issues.

    Currently, when VL data elements are over-written in a dataset, the space for the previous piece of VL data is not released to the file to be re-used, it is instead leaked and not reused. Because the previous value for the VL data would need to be read from the file dataset in order to be properly released, it ties in with the fill-values stored in the file. (For the current library design, since a heap ID is stored in the dataset for the location of the VL data, not the VL data itself, a heap ID set to all zeros is used to indicate that there is no VL data for a paticular location. So currently, the only valid fill-value for VL data is an all zero value, indicating that no VL data has been stored in the heap.)

    If fill-values are not written to the file, then there is the potential for junk data to be read from the file as the VL data to be released and errors to occur. Currently, the library relies on the filesystem to zero-fill blocks allocated to the file when there is no fill-value set for the dataset. We've already seen this assumption break down under Win9x, where the OS does not zero-fill file blocks with zeros and users report "junk" in datasets which have been created, but not written to.

    So, VL data requires valid fill-values to be present in the file in order to be certain that reading the VL data to be overwritten is valid and contains the correct information to either free the previous VL data (in the case of non-NULL valued VL data) or not to try to free the previous VL data (for NULL valued VL data). Having junk (or the potential for junk) in the data read from the file opens the possibility for corrupting data in the file if that junk data is used to try to free the previous VL data.

  3. How are things handled currently in HDF4 vs. HDF5?
    1. HDF4

    2. These issues are specific to how the SD*() API functions operate in the latest version of HDF4, other portions of the HDF4 library may operate in different ways. Only the normal (i.e. "contiguous") and chunked storage methods are discussed, other storage methods are treated as normal storage in HDF4.
      1. Dataset Storage Allocation
        Allocating space to store a dataset is deferred until the space is needed. Space is only needed when non-fill-value data is written to a dataset. This allows for very large datasets to be defined, and if they are not written to, the file size can stay very small. This applies to both contiguous and chunked data.

      2. Fill-values
        1. Metadata
          Metadata documenting the fill-value is always written to a file. Either the default fill-value (of zero) or the user's fill value is written as an attribute of the dataset.
        2. Writing
          Fill-values are only written to the dataset or chunk when the entire dataset or chunk is not going to be written in a single I/O request. For example: in a contiguously stored dataset, if a hyperslab in the middle of the dataset is written by the user (and this is the first piece of data to be written to the dataset), fill-values are written to the dataset and then the user's data is written in the hyperslab location. However, if the entire dataset is going to be written in one write call, then the fill-value writing step is skipped, since they would all be immediately over-written with the actual data. Note: Writing fill-values in HDF4 can be turned off completely by a user who either "knows" that they will be writing the entire dataset in successive calls, or who doesn't care about data outside the region(s) they are writing to in the dataset.
        3. Reading
          If storage for the dataset or chunk is not allocated yet, the fill value is used to fill the buffer to return to the application and the file data is not read.

    3. HDF5


    4. These issues apply to all datasets in HDF5. Only the contiguous and chunked storage methods are discussed, other storage methods are treated as contiguous storage in HDF5.
      1. Dataset Storage Allocation

        Space for contiguously stored data is always allocated during the creation of the dataset. Space for chunk stored data is allocated as needed, when data needs to be written to the portion of the dataset that the chunk occupies. (Except in the case of parallel I/O, where all the chunks for a dataset are allocated at creation time also).
      2. Fill-values
        1. Metadata

          Metadata documenting the fill-value for a dataset is only written out if the user explicitly set a fill-value for the dataset during creation. Although there is an implicit zero fill-value assumed for the dataset, this is not enforced or recorded.
        2. Writing

          Fill-values are only written to contiguously stored data when a dataset is created (and only if the user has set a fill-value). This occurs regardless of how the fill-values will be overwritten by future writes to the dataset. Fill-values for chunked storage data are somewhat more controlled, they are written only when data is actually written to a particular chunk. (The library may also be smart enough to notice when an entire chunk is being written and to not write the fill-values in that case, this case hasn't been investigated).
        3. Reading

          Fill-values are only used for chunked storage datasets when an unallocated chunk is read from. Because contiguously stored data always allocates space in the file, the library assumes that there is always valid data to read for contiguous data.

  4. Suggestions for improving HDF5's behavior


  5. We can provide user three properties to control the library. They are when to allocate space, when to write fill value and what value to write. For each property, it has several values listed as follows.


    By using these three properties, the library's behavior of fill value writing is listed in the table below during the dataset create-write-close cycle.

    When to allocate space When to write fill value What fill value to write Library create-write-close behavior
    early never ----- Library allocates space when dataset is created, but never write fill value to dataset.
    late never ----- Library allocates space when dataset is written to, but never write fill value to dataset.
    ----- allocation undefined Error on creating dataset, dataset not created.
    early allocation default or user-defined Allocate space for dataset when dataset is created. Write fill value(default or user-defined) to entire dataset.
    late allocation default or user-defined Doesn't allocate space for dataset until user's data value are written to dataset. Write fill value to entire dataset before writing user's data value.
    ----- stands for any value.

    During the H5Dread function call, the library behavior depends on whether space has been allocated, whether fill value has been written to storage, how fill value is defined, and when to write fill value.

    Is space allocated? What is the fill value? When to write fill value? H5Dread behavior
    No undefined ----- Error. Dataset doesn't exist, no data has been written, fill value isn't defined.
    default or user-defined ----- Fill user's buffer with fill value.
    Yes undefined ----- Return data from storage(dataset), trash is possible.
    default or user-defined Never Return data from storage(dataset), trash is possible.
    default or user-defined allocation Return data from storage(dataset).
    ----- stands for any value.


QAK:1/9/02