Notes from Oct. 21, 2009 Meeting with MPICH Developers

Suggestions about improvements to how HDF5 uses the MPI interface.

Attendees:

ANL: Rob Latham, Dave Goodell, Rob Ross, Dries Kimpe

THG: Quincey Koziol, John Mainzer, Neil Fortner

From visit to Argonne, on Oct. 21, 2009

MPI-POSIX VFD:

'open' VFD callback:

ORNL suggests staggering POSIX open() call across groups of processes, but general consensus of ANL developers is that its probably not necessary.
Should be able to combine information sent with two MPI_Bcast() calls into only one call.

'close' VFD callback:

Should be able to eliminate MPI_Barrier() from here by pushing it into the open callback and only invoking it when a file is created/truncated.

'read' & 'write' VFD callbacks:

Someday, it should be possible to eliminate tracking the current seek offset, but it's useful currently, particularly on BG/* machines.

'truncate' VFD callback:

If operating on a non-NFS file system, the truncate() call could be shifted to only occur on process 0, which then can MPI_Bcast() the results. It should be possible to detect when a file is on an NFS file system w/statfs(). We should look at the ROMIO configure script for what variant of statfs() is supported on this OS. The code in MPICH for dealing with this is at: mpich2/src/mpi/romio/adio/common/abfstype.c

MPI-IO VFD:

'read' & 'write' VFD callbacks:

We don't have to memset() MPI_Status to zeros
We should cache the "using special MPI view" flag across I/O operations, to avoid up to 1/2 of the MPI_File_set_view() calls.

Use of MPI for Metadata I/O:

Avoiding “all collective” metadata operations in HDF5

Can see the problem and the new MPI “RMA” operations will probably give reasonable to do this.
Need the MPI RMA “passive target” operations in order to guarantee that progress is made, however.

Cache Improvement #1: Distribute writes across the processes

Probably would help some, but probably not very useful

Cache Improvement #2: Do all metadata I/O from process 0

Yes, please, do this whenever possible! :-)
It's possible for the HDF5 library to infer that certain metadata reads are collective [when they are the result of a [collective] HDF5 API call that could modify or create metadata].
Otherwise, we'd have to add a [new?] "collective operation" property to the group/dataset/named datatype access property lists for use with HDF5 API calls that only read metadata [so that the application could indicate that this optimization could be invoked]
There's at least two stages to this improvement:

Perform individual I/O accesses from process 0 and MPI_Bcast() the data to other process, as the I/O accesses occur.
Bundle all the individual I/O reads from a metadata operation (which are only made from process 0) into a single buffer and MPI_Bcast() that to the other processes. This would be a more intrusive change to the HDF5 library's algorithms, but might be smoothed over with some sort of "collective I/O buffer", which "tricked" the other processes into thinking that individual I/O operations were still occurring.

Cache Improvement #3: Bundle up the metadata writes at each sync point in an MPI derived type

Yes, we should definitely do this, in order to pass as much information down to the MPI implementation as possible.
We should create an MPI datatype and perform the MPI_File_write_at() from rank 0.
It may be useful to go ahead and create an MPI datatype describing all the dirty metadata in the file (not just the entries that should be evicted) and flush all the dirty metadata out in one call to MPI_File_write_at(). However, if the MPI implementation/file system/OS doesn't buffer these [well], the repeated writing of "hot" cache entries will probably be a drag on the performance. So, we should probably default to only performing I/O on the entries to evict and give the application a property/metadata cache config knob to indicate that they want all the dirty entries flushed.

Cache Improvement #4: Use AIO for metadata writes

Definitely the right idea, but since async I/O ends up blocking on most current systems, it's probably not the time yet.

Cache Improvement #5: Allow writes between sync points

This is a clever idea and we should do it. (But do optimizations #2 & #3 first)
Would really benefit from asynchronous write of individual metadata cache entries. If AIO on the platform isn't up for it, maybe we could spawn a background thread?

Cache Improvement #6: Journaling

General discussion, but since it's not implemented in parallel yet, no particular directions/improvements were suggested.

Allocating Space for Dataset Elements:

Chunked datasets would really benefit from an “implicit” chunk index (in multiple different places in the library, including here and when performing I/O). This would be a way to algorithmically know where each chunk in a chunked dataset was located and would eliminate the need to store an index in the file. The chunks would be allocated in one big block (as large as a contiguous dataset) and located by indexing into the block, using the coordinates of the chunk in the overall dataset (similar to how we locate elements in datasets). However, this would only work for chunked datasets that were:

Non-sparse - we are allocating space for all the chunks
Non-extensible - the algorithm would only work if the chunks were all stored in one contiguous section of the file

However, it would be possible to start with an “implicit” index and then create a real index data structure if/when the dataset’s dimensions were extended

No-filters - the chunks must all have the same size and not move around in the file

Distributing writing chunked dataset’s fill value to many processes would be good
We could build MPI datatype that used a stride of 0 in memory to replicate fill buffer containing fill value across many chunks in one MPI_File_write_at() call. (Works for both chunked and contiguous dataset storage)
Bulk allocation of chunks and bulk loading of chunk addresses into the indexing data structure would be good.

Reading or Writing Dataset Elements:

Should try to increase number of supported situations for collective I/O:

Support point selections
May be able to support collective I/O from datatype conversion buffer

Collective I/O on Contiguous Datasets

Regular Hyperslab Selection I/O

Don’t commit intermediate/temporary MPI datatypes, only the final one that will be used for I/O

Irregular Hyperslab I/O:

Replace use of MPI_Type_hindexed() with MPI_Type_create_hindexed()
Replace use of MPI_Type_hvector() with MPI_Type_create_hvector()
Don’t commit intermediate/temporary MPI datatypes, only the final one that will be used for I/O
We should reuse MPI datatypes between identical spans.

It is possible [and desirable] to make the MPI datatypes correspond exactly to the span tree describing the HDF5 dataspace selection. So, when there are shared spans in the span tree, we should only create one MPI datatype describing them.

Collective I/O on Chunked Datasets

For multi-chunk I/O w/no optimizations (H5D_multi_chunk_collective_io_no_op() case)

Should compute the max. # of collective chunks with MPI_Allreduce() (as well as the min. # of collective chunks)
When performing independent I/O on chunk, should create MPI datatype for HDF5 dataspace selection and call MPI_File_write_at() for performing I/O (instead of performing serial I/O!)

For case when determining I/O mode for each chunk (H5D_multi_chunk_collective_IO() case)

Need to find a more efficient algorithm that scales effectively when the number of processes or chunks is large.
We should be able to at least compute the number of processes performing I/O on a given chunk with an MPI_Allreduce, instead of iterating over the number of processes.