Because contiguous datasets have all their data elements stored in a single block within the HDF5 file, the main task for performing collective I/O operations is to create an MPI datatype that corresponds to an HDF5 dataspace selection. Then, the MPI datatypes describing the elements selected in the file and in the application’s memory buffer are passed down to the MPI-IO VFD where the read or write callback is invoked.
There are four types of HDF5 dataspace selections:
- All - Selects all the elements in the dataspace extent
- Hyperslab:
- Regular - A set of regularly-sized, regularly-spaced blocks of elements in the dataspace extent. Regular selections can be described with a single set of <start>, <stride>, <block> and <count> parameters to H5Sselect_hyperslab().
- Irregular - A set of elements in a dataspace that can be any size and any shape within the dataspace extent. Irregular selections are produced by multiple calls to H5Sselect_hyperslab().
- Point - A small set of elements in the dataspace extent, visited in a specified order.
- None - No elements in the dataspace extent are selected.
Each type of selection generates a different MPI datatype:
- All (H5S_mpio_all_type() in src/H5Smpio.c)
- The base MPI datatype is set to MPI_BYTE and the MPI count is set to <HDF5 datatype size> * <# of elements in HDF5 dataspace extent>.
- Hyperslab
- Regular (H5S_mpio_hyper_type() in src/H5Smpio.c)
- <build a structure which contains the <start>, <stride>, <block>, <count> and dataspace <extent> in each dimension
- <compute the <offset> in each dimension (in bytes), which is the number of bytes to advance to the next element in this dimension, holding all the other dimension offsets constant>
- <compute the <max_extent> in each dimension (in elements), which is the total number of elements for “all the lower dimensions than this one”>
- <use MPI_Type_contiguous() to create an <inner type> that is the same size as the HDF5 datatype size>
- <construct the final MPI datatype by working from the fastest changing dimension to the slowest>
- <use MPI_Type_vector() to create a new MPI datatype, <outer type>, with the MPI count = <count>, MPI blocklength = <block>, MPI stride = <stride>, MPI base type = <inner type> >
- <free the old <inner type> with MPI_Type_free()>
- <retrieve the <extent_len> of the <outer type> with MPI_Type_extent()>
- <create a <displacement> array for the particular dimension>
- <displacement[1] = <byte offset of the start of the selection in this dimension> >
- <displacement[2] = <byte size of the <max_extent> in this dimension> >
- <if displacement[1] > 0 or <extent_len> less than displacement[2] >
- <use MPI_Type_struct() to construct “wrapper” around <outer_type> that describes the portion of the dimension’s extent that is selected>
- <move the <outer_type> to <inner_type> for next pass through the loop>
- <Commit the derived MPI datatype and use it as the base MPI datatype, with the MPI count set to 1>
- Irregular (H5S_mpio_span_hyper_type() in src/H5Smpio.c)
- <recusively generate the MPI datatype for the selection>: (using H5S_obtain_datatype() in src/H5Smpio.c)
- <if operating on a span in the fastest changing dimension>
- <Use MPI_Type_contiguous() to build a <base_type> that is composed of an MPI type of MPI_BYTE and a count of the size of the HDF5 datatype>
- <Commit the <base_type> >
- <Generate arrays that describe the displacement and block length of each span of elements selected in this span of this dimension>
- <Use MPI_Type_hindexed() with the offset & length arrays and the <base_type> to create a <span_type> that describes all the bytes selected in this span of this dimension>
- <else> [operating on a span in a dimension that is slower than the fastest changing dimension]
- <for each span in this dimension>
- <recursively generate a <temp_type> MPI datatype for the selection in this span in the next dimension down>
- <use MPI_Type_commit() to commit the <temp_type> retrieved>
- <use MPI_Type_hvector() to create a new <tempinner_type> that uses the number of elements in the current span as the count, 1 for the block length, the stride set to the total number of elements in dataspace extent for dimensions lower than this one, and the <temp_type> as the base MPI datatype>
- <use MPI_Type_commit() to commit the <tempinner_type> created>
- <use MPI_Type_free() to release the <temp_type> >
- <save the <tempinner_type> created for this span in an array, <inner_type> >
- <use MPI_Type_struct() to construct a <span_type> for this dimension, with the count set to the number of spans in this dimension, a block length array set to all 1’s, a displacement array with the byte offset of each span and the array of <inner_type> created in the above loop>
- <use MPI_Type_free() to release all <inner_type> datatypes created>
- <Commit the MPI datatype generated and use it as the base MPI datatype, with the MPI count set to 1>
- Point - Collective I/O on point selections is not currently supported.
- None - (H5S_mpio_none_type in src/H5Smpio.c)
- The base MPI datatype is set to MPI_BYTE and the MPI count is set to 0.