Performing I/O on dataset elements in one of the more complex aspects of the HDF5 library (as you might expect from an I/O library) and has multiple components and layers, which are only partially documented here. Each different method of storage for a dataset (contiguous, chunked, etc) is handled in a somewhat different manner, with some shared components or layers, so the common aspects are described in one section and then the special aspects of each storage method are documented in seperate sections.
When an MPI application performs I/O on dataset elements, the default action is to perform independent I/O operations from each process. When independent I/O is performed by the MPI application, the underlying MPI-IO or MPI-POSIX VFD is eventually used to perform the read or write operation, but until that point, the I/O operation is handled identically to how I/O is handled for non-MPI applications. Since there are no special actions taken to use the MPI interface until the appropriate VFD is reached in this case, it is not discussed further here, except to note that under certain circumstances (described in context in other locations in this document set), a collective I/O operation may be initiated by an application, but the HDF5 library "breaks" the collective access into independent access in order to perform the I/O operation.
When an MPI application has opened a file using the MPI-IO VFD, collective I/O operations on dataset elements may be performed. The application must collectively create a dataset transfer property list (DXPL), call H5Pset_dxpl_mpio on that DXPL to set the I/O transfer mode to H5FD_MPIO_COLLECTIVE, and then use that DXPL in calls to H5Dread or H5Dwrite in order to invoke collective I/O when accessing dataset elements. Collective read and write I/O operations on dataset elements are generally handled identically by the HDF5 library except for the eventual read or write call to the VFD layer and so are treated identically by this document set, unless noted in context.
The actions for performing collective I/O are identical for all dataset storage types, with the correct routines to invoke for each storage method chosen by setting function pointers in the "I/O information" for the I/O operation:
- <set up the I/O information> (H5D_ioinfo_init() in src/H5Dio.c)
- <adjust the I/O information for parallel I/O> (H5D_ioinfo_adjust() in src/H5Dio.c)
- <check if collective I/O is possible from the application's buffer directly to the file> (H5D_mpio_opt_possible() in src/H5Dmpio.c)
- The following conditions are evaluated by each process and the <local opinion> is set to false if any fail:
- The application must have chosen collective I/O in the DXPL used for this I/O operation
- There must not be any HDF5 datatype conversions necessary
- There must not be any HDF5 "data transforms" (set with H5Pset_data_transform) to invoke during the I/O operation
- The HDF5_MPI_OPT_TYPES environment variable must either be not set, or set to '1' (one)
- The MPI-IO VFD must be used (not the MPI-POSIX VFD)
- A simple or scalar dataspace must be used for both the memory and file dataspaces
- A point selection must not be used for either the memory or file dataspaces
- The dataset's storage method must be either contiguous or chunked
- [A configuration-dependent check whether the MPI implementation can handle "complex" derived MPI datatypes is checked and if the MPI implementation can't handle them, the selection for both the memory and file dataspaces must be "regular" (i.e. describable with a single call the H5Sselect_hyperslab()]
- If the dataset is chunked:
- There must be no I/O filters (like compression, etc) used
- [A configuration-dependent check whether the MPI implementation can perform collective I/O operations correctly when not all processes actually perform any I/O (i.e. when the buffer count parameter to MPI_File_[read|write]_at_all is set to 0) and if the MPI implementation can't perform the I/O correctly, there must be elements accessed in at least one chunk]
- MPI_Allreduce() is called with the "logical and" (MPI_LAND) operator and the <local opinion> of each process, to create the <consensus opinion> of all the processes, which is returned from this routine.
- If the <consensus opinion> evaluated to FALSE, collective I/O is "broken", and the I/O proceeds as an independent I/O operation on all processes.
- If the <consensus opinion> evaluated to TRUE, collective I/O is possible and the function pointers for performing "multiple" I/O operations (i.e. I/O on vectors of <offset/length/buffer> tuples) is set to the appropriate parallel I/O operation callback for the dataset's storage type. The function pointers for "single" I/O operations is set to H5D_mpio_select_read or H5D_mpio_select_write (in src/H5Dmpio.c).
- For contiguous datasets, the "multiple" I/O operation callbacks are: H5D_contig_collective_read and H5D_contig_collective_write (in src/H5Dmpio.c)
- For chunked datasets, the "mutiple" I/O operation callbacks are H5D_chunk_collective_read and H5D_chunk_collective_write (in src/H5Dmpio.c)
- <invoke the "multiple" I/O operation callback to perform the I/O operation (set earlier)>
- <tear down the I/O information>
The "multiple" I/O operations for collective I/O on each type of dataset storage method are further described here: