Modifying VFL 'flush' function in HDF5

Quincey Koziol
koziol@ncsa.uiuc.edu
May 13, 2002

  1. Document's Audience:

  2. Background Reading:

  3. Motivation:

    Why Modify the VFL 'flush' function?
    During investigations of parallel I/O performance issues, it was discovered that the MPI_File_sync call in the MPI-I/O VFL file driver's 'flush' implementation was slowing down the benchmark reports by a large amount. Note the differences between the "Write Open-Close" times (as well as all the read times) in the MPI-I/O and PHDF5 cases in the following tables:

    Type of IO = MPIO
    Minimum Throughput Maximum Throughput Average Throughput
    Raw Write 88.59 MB/s (1.445 s) 93.24 MB/s (1.373 s) 90.46 MB/s (1.415 s)
    Dataset Write 88.59 MB/s (1.445 s) 93.24 MB/s (1.373 s) 90.46 MB/s (1.415 s)
    Write Open-Close 88.35 MB/s (1.449 s) 92.95 MB/s (1.377 s) 90.04 MB/s (1.422 s)
    Raw Read 612.43 MB/s (0.209 s) 666.66 MB/s (0.192 s) 628.93 MB/s (0.204 s)
    Dataset Read 612.39 MB/s (0.209 s) 666.62 MB/s (0.192 s) 628.91 MB/s (0.204 s)
    Read Open-Close 606.01 MB/s (0.211 s) 659.11 MB/s (0.194 s) 622.20 MB/s (0.206 s)
    Type of IO = PHDF5
    Minimum Throughput Maximum Throughput Average Throughput
    Raw Write 82.24 MB/s (1.557 s) 85.81 MB/s (1.492 s) 84.22 MB/s (1.520 s)
    Dataset Write 82.02 MB/s (1.561 s) 85.72 MB/s (1.493 s) 84.07 MB/s (1.522 s)
    Write Open-Close 8.70 MB/s (14.714 s) 10.06 MB/s (12.723 s) 9.44 MB/s (13.565 s)
    Raw Read 77.39 MB/s (1.654 s) 234.39 MB/s (0.546 s) 142.05 MB/s (0.901 s)
    Dataset Read 73.43 MB/s (1.743 s) 232.74 MB/s (0.550 s) 138.83 MB/s (0.922 s)
    Read Open-Close 72.87 MB/s (1.757 s) 229.44 MB/s (0.558 s) 137.23 MB/s (0.933 s)
    Table 1: Before modifying VFL 'flush' function, 128MB file


    Type of IO = MPIO
    Minimum Throughput Maximum Throughput Average Throughput
    Raw Write 87.83 MB/s (1.457 s) 94.97 MB/s (1.348 s) 90.92 MB/s (1.408 s)
    Dataset Write 87.83 MB/s (1.457 s) 94.97 MB/s (1.348 s) 90.91 MB/s (1.408 s)
    Write Open-Close 87.52 MB/s (1.463 s) 93.58 MB/s (1.368 s) 90.44 MB/s (1.415 s)
    Raw Read 578.94 MB/s (0.221 s) 686.00 MB/s (0.187 s) 622.84 MB/s (0.206 s)
    Dataset Read 578.90 MB/s (0.221 s) 685.95 MB/s (0.187 s) 622.79 MB/s (0.206 s)
    Read Open-Close 572.88 MB/s (0.223 s) 677.37 MB/s (0.189 s) 615.97 MB/s (0.208 s)
    Type of IO = PHDF5
    Minimum Throughput Maximum Throughput Average Throughput
    Raw Write 82.64 MB/s (1.549 s) 84.72 MB/s (1.511 s) 83.93 MB/s (1.525 s)
    Dataset Write 82.55 MB/s (1.551 s) 84.63 MB/s (1.512 s) 83.77 MB/s (1.528 s)
    Write Open-Close 80.62 MB/s (1.588 s) 83.93 MB/s (1.525 s) 82.66 MB/s (1.549 s)
    Raw Read 469.99 MB/s (0.272 s) 599.26 MB/s (0.214 s) 524.36 MB/s (0.244 s)
    Dataset Read 463.16 MB/s (0.276 s) 588.44 MB/s (0.218 s) 515.60 MB/s (0.248 s)
    Read Open-Close 451.87 MB/s (0.283 s) 569.75 MB/s (0.225 s) 500.64 MB/s (0.256 s)
    Table 2: After modifying VFL 'flush' function, 128MB file


    Type of IO = MPIO
    Minimum Throughput Maximum Throughput Average Throughput
    Raw Write 80.63 MB/s (6.350 s) 83.41 MB/s (6.138 s) 82.52 MB/s (6.204 s)
    Dataset Write 80.63 MB/s (6.350 s) 83.41 MB/s (6.138 s) 82.52 MB/s (6.204 s)
    Write Open-Close 80.58 MB/s (6.354 s) 83.37 MB/s (6.141 s) 82.36 MB/s (6.217 s)
    Raw Read 617.25 MB/s (0.829 s) 658.48 MB/s (0.778 s) 634.87 MB/s (0.806 s)
    Dataset Read 617.23 MB/s (0.830 s) 658.47 MB/s (0.778 s) 634.86 MB/s (0.806 s)
    Read Open-Close 615.46 MB/s (0.832 s) 656.54 MB/s (0.780 s) 633.09 MB/s (0.809 s)
    Type of IO = PHDF5
    Minimum Throughput Maximum Throughput Average Throughput
    Raw Write 76.89 MB/s (6.659 s) 86.79 MB/s (5.899 s) 82.27 MB/s (6.224 s)
    Dataset Write 76.83 MB/s (6.664 s) 86.76 MB/s (5.901 s) 82.23 MB/s (6.226 s)
    Write Open-Close 6.08 MB/s (84.213 s) 6.31 MB/s (81.087 s) 6.24 MB/s (82.012 s)
    Raw Read 23.12 MB/s (22.144 s) 78.41 MB/s (6.530 s) 31.45 MB/s (16.281 s)
    Dataset Read 23.12 MB/s (22.148 s) 78.04 MB/s (6.561 s) 31.43 MB/s (16.291 s)
    Read Open-Close 23.11 MB/s (22.156 s) 77.72 MB/s (6.588 s) 31.40 MB/s (16.306 s)
    Table 3: Before modifying VFL 'flush' function, 512MB file


    Type of IO = MPIO
    Minimum Throughput Maximum Throughput Average Throughput
    Raw Write 71.18 MB/s (7.193 s) 87.93 MB/s (5.823 s) 83.43 MB/s (6.137 s)
    Dataset Write 71.18 MB/s (71.93 s) 87.93 MB/s (5.823 s) 83.43 MB/s (6.137 s)
    Write Open-Close 71.01 MB/s (7.210 s) 87.86 MB/s (5.827 s) 83.28 MB/s (6.148 s)
    Raw Read 575.98 MB/s (0.889 s) 648.56 MB/s (0.789 s) 618.10 MB/s (0.828 s)
    Dataset Read 575.97 MB/s (0.889 s) 648.55 MB/s (0.789 s) 618.09 MB/s (0.828 s)
    Read Open-Close 574.48 MB/s (0.891 s) 646.71 MB/s (0.792 s) 616.39 MB/s (0.831 s)
    Type of IO = PHDF5
    Minimum Throughput Maximum Throughput Average Throughput
    Raw Write 79.10 MB/s (6.473 s) 81.75 MB/s (6.263 s) 80.44 MB/s (6.365 s)
    Dataset Write 78.40 MB/s (6.531 s) 81.73 MB/s (6.264 s) 80.27 MB/s (6.378 s)
    Write Open-Close 75.77 MB/s (6.757 s) 81.57 MB/s (6.277 s) 79.58 MB/s (6.434 s)
    Raw Read 540.19 MB/s (0.948 s) 613.93 MB/s (0.834 s) 585.09 MB/s (0.875 s)
    Dataset Read 537.91 MB/s (0.952 s) 610.90 MB/s (0.838 s) 582.37 MB/s (0.879 s)
    Read Open-Close 533.92 MB/s (0.959 s) 605.22 MB/s (0.846 s) 577.41 MB/s (0.887 s)
    Table 4: After modifying VFL 'flush' function, 512MB file

    All tests performed on NCSA's SGI O2K (modi4) with 8 processors and 5 iterations of each test. Tables 1 & 2 were performed with the following parameters: Transfer Buffer Size: 512KB, File size: 128MB, # of files: 1, # of dsets: 1, # of elmts per dset: 33554432 (i.e. pio_perf -m -H -i 5 -p 8 -P 8 -x 512K -X 512K -f 128M). Tables 3 & 4 were performed with the following parameters: Transfer Buffer Size: 1MB, File size: 512 MB, # of files: 1, # of dsets: 1, # of elmts per dset: 134217728 (i.e. pio_perf -m -H -i 5 -p 8 -P 8 -x 1M -X 1M -f 512M)

    Comparing Table 1 with Table 2 and Table 3 with Table 4, the "Write Open-Close" time is significantly improved with the change proposed in this document. Additionally, all the PHDF5 read times are improved as well.

    Note: Because modi4 does not have a true parallel filesystem, these investigations are being performed on the ASCI 'blue' machine as well and further results on that machine will be reported.

  4. Feature's Primary Users:

    Current & Future PHDF5 Applications
    This feature has an impact for all applications which use the MPI-I/O file driver in HDF5.
  5. Design Goals & Requirements:

  6. Proposed Changes and Additions to Library Behavior:

    The proposed change to the VFL 'flush' function is to add a parameter which indicates that the file will be immediately closed after this 'flush' call. This allows VFL drivers which perform duplicated actions in the 'flush' and 'close' functions to omit taking those actions in the 'flush' function. In the case of the MPI-I/O VFL driver, this allows the 'flush' function to avoid calling the MPI_File_sync function, whose actions are duplicated by the MPI_File_close function called in the 'close' function. Other VFL drivers may benefit from similar optimizations.

  7. VFL 'flush' Change Details:

    The current VFL 'flush' APIs have parameter lists (see VFL background documents referenced at the top of this document for more details) with only the VFL file driver information (H5FD_t *) being passed in:

        herr_t H5FDflush(H5FD_t *file);
    
        herr_t (*flush)(H5FD_t *file);
    
    The revised form of these functions would have these parameter lists:
        herr_t H5FDflush(H5FD_t *file, hbool_t closing);
    
        herr_t (*flush)(H5FD_t *file, hbool_t closing);
    
    The 'closing' parameter would be set by the library and indicate that the file referenced by the 'file' parameter will be closed immediately after this call to the 'flush' function.

  8. Alternate Approachs:

    This document describes a method of adding a parameter to the VFL 'flush' function as a clean and maintainable method of passing information down to the VFL driver about the library's future actions on the file. Because the main target of this improvement is improving the performance of the MPI-I/O VFL driver, an alternate approach could be taken in the library which did not change the VFL 'flush' API, at the expense of thread-safety and maintainability.

    The alternate approach would be to add code to the internal flush function in the library which would only be compiled if the library was compiled to support parallel I/O. Additionally, the code would only be invoked if a file was using the MPI-I/O VFL driver. This code would make a custom call directly to the MPI-I/O driver to indicate that the file was being closed and the 'flush' function could avoid duplicated work to be performed later in the 'close' function. This method is not thread-safe, but that is not currently a requirement when operating on files with the MPI-I/O VFL driver, so there should be no negative impact from this aspect of the change. This method does require a higher maintainence cost in the library due to the diligence which must be performed when adding or removing internal flush calls. Failure to flag a flush before a close would cause these same performance problems again and flagging a flush without a close could cause the file not to be in sync at a point when an application had assumed it to be.

    However, because of the large performance benefits of this change and the strong desire of our customers to have improvements to the performance of the library in the v1.4 release branch, it is recommended that this somewhat unsafe and higher maintainence code be added to the v1.4 branch. Adding the 'closing' parameter to the v1.5 branch, where API changes are better tolerated, is the optimal long-term course of action.

  9. Forward/Backward Compatibility Repercussions:

    Backward compatibility is the ability for applications using the HDF5 library to compile and link with future versions of the library. Forward compatibility is the ability for applications using the HDF5 library to compile and link with previous versions of the library.

    Forward compatibility has not been supported in the library APIs, and that issue is not addressed here. The change proposed above (in the Alternate Approaches section) has no backward compatibility issues and is proposed for the v1.4 branch of the library. Changing the VFL 'flush' API will be a backward compatibility break for the v1.5 branch of the library. However, the impact is limited to only those developers who are writing VFL drivers for the library and has no affect on applications who only use the VFL drivers shipped with the HDF5 library.

  10. File Format Changes:

    None.