Parallel HDF5 Design

1. Design Overview

This section describes the function requirements of the Parallel HDF5 (PHDF5) software and the assumed system requirements. Section 2 describes the programming model of the PHDF5 interface. Section 3 presents several sample PHDF5 programs.

1.1 Functional requirements

Parallel HDF5 is designed to meet the following functional requirements:

Provide an API to support parallel file access for HDF5 files in a message-passing environment. Other types of parallel computing environments, such as shared memory, may be added in future versions
Provide a fast parallel I/O to large datasets through a standard parallel I/O interface. The standard interface is important since the library needs to work in different platforms.
No user process can be set aside for I/O purpose only such as an I/O requests daemon. The user application has full use of all processes all the time. No threads facility can be used, since it is not standard in all platforms.
Processes are required to do collective API calls only when structural changes (e.g., file open, dataset definition) occur to the file. (Collective in this case means all processes accessing the file must all make the same API when structural changes occur.)
Each process may do independent I/O requests of different datasets in the same or different HDF5 files.
Support collective I/O requests for datasets. (Collective in this context means the processes that have opened the file all make the same I/O request call though may each use different argument values. This is often used when the processes are accessing the file in small fragmented requests. The collective call may be able to combine fragmented requests from multiple processes into fewer but bigger I/O requests, thus providing better I/O performance.)
Minimize deviation from the serial HDF5 interface.
HDF5 files created with the PHDF5 interface must be totally readable by the serial HDF5 interface and vice versa. (It may require system utilities to convert HDF5 files when they transfer between the parallel and sequential file systems.)
Initial supported platforms include the three ASCI platforms: IBM SP2, Intel TFLOPS, and SGI Origin 2000.

1.2. Design Specification

The Message Passing Interface (MPI) was selected for interprocess communication and the MPI-IO API as defined in MPI for parallel I/O accesses. The MPI library is a commonly accepted standard among parallel computing environments. The vendors for the ASCI platforms have committed to support the MPI library. MPI is supported on nearly all Unix platforms and on Microsoft Windows platforms either by the vendors or by public domain software.
C language interface is the initial requirement. Other interfaces, such as Fortran 90, may be added later.
To minimize the parallel API's deviation from the serial API, we chose not to define a different set of functions for parallel API purpose only. Instead, all parallel access requests are specified via the property list arguments. The property list design, besides providing a similar API for both current serial and parallel HDF5 interfaces, also permits future extension of HDF5 support for other types of parallel computing environments such as share memory systems.

2. Programming Model

PHDF5 supports parallel access to HDF5 files in the MPI environment. MPI is a standard interface for the distributed memory parallel computing environment in which inter-process communication is done by message passing. The MPI standard documents are available at http://www.mpi-forum.org. Other related MPI information, such as tutorials and implementations, can be found at http://www.mcs.anl.gov/Projects/mpi/.

The following discussion describes the programming model of HDF5 specific to the MPI environment. For a general and more complete understanding of the HDF5 library, one may consult the documents and user guide at http://hdf.ncsa.uiuc.edu/HDF5.

HDF5 uses property lists to control the file access mechanism. The general model in accessing an HDF5 file in parallel contains the following steps:

Setup access property list for parallel access
File create/open
Dataset create/open
Dataset extension (when needed)
Dataset data access with appropriate data transfer property list
Dataset close
File close

2.1. Setup access property list

Each process of the MPI communicator creates an access property list via H5Pset_mpi and sets it up with MPI information (communicator, info object) as required by the MPI_File_open as defined in MPI-2. Note that H5Pset_mpi does not make duplicates of the communicator or the info object. The PHDF5 library will make duplicates of them when an HDF5 file is opened. Therefore, any changes to the communicator or info object will affect the H5Fcreate/H5Fopen calls following the changes. Users are advised not to make changes to the communicator or the info object after the H5Pset_mpi call.

(From this point on, processes are limited to those that are members of the communicator defined in the H5Pset_mpi call.)

Example:

/* setup file access property list with parallel IO access. */
acc_pl = H5Pcreate (H5P_FILE_ACCESS);
H5Pset_mpi(acc_pl, comm, info);

2.1. File create/open

All processes of the MPI communicator open an HDF5 file by a collective call (H5FCreate or H5Fopen) with the access property list. The call must be collective because the underlying MPI_File_open() is a collective call.

Example:

/* create the file collectively */
fid=H5Fcreate("filexyz",H5F_ACC_TRUNC,H5P_DEFAULT,acc_pl);

2.2. Dataset create/open

All processes of the MPI communicator open a dataset by a collective call (H5Dcreate or H5Dopen). This version supports only collective dataset open. The call must be collective because all processes need to have a common knowledge of the dataset object being accessed. This allows cooperative changes to the dataset object later. A future version may support datasets opened by a subset of the processes that have opened the file.

Example:

/* create a 512x1024 dataset */
hsize_t dims[2] = {512, 1024};
sid = H5Screate_simple (2, dims, NULL);
dataset = H5Dcreate(fid, "dataset1", H5T_NATIVE_INT, sid, H5P_DEFAULT);

2.3. Dataset access

2.3.1. Independent dataset access

Each process may do independent and an arbitrary number of data I/O accesses by independent calls (H5Dread or H5Dwrite) to the dataset with the transfer property list set for independent access. (The default transfer mode is independent transfer.)

If the dataset has an unlimited dimension and if the H5Dwrite is writing data beyond the current dimension size of the dataset, all processes that have opened the dataset must make a collective call (H5Dallocate) to allocate more space for the dataset before the independent H5Dwrite call. The reason is that when data is written beyond the current dimension size, that dimension size must be increased to hold the new data. Changing the dimension size of a dataset is a structural change of the object and must be done by all processes.

2.3.2. Collective dataset access

All processes that have opened the dataset may do collective data I/O access by collective calls (H5Dread or H5Dwrite) to the dataset with the transfer property list set for collective access. Pre-allocation (H5Dallocate) is not needed for unlimited dimension datasets since the H5Dallocate call, if needed, is done internally by the collective data access call. Though all collective accesses can be replaced with independent accesses by each process, collective accesses can provide a better performance if the equivalent independent accesses result in small fragments. A simple example is that of a two dimensional dataset stored in row major order. When each process needs to access the data by columns, individual independent access by each process would result in multiple uncoordinated accesses to the dataset with each access segment the size of the column width. But if all processes can access the dataset with one collective call, the library, with the extra information of the access pattern, can combine the small accesses into bigger I/O accesses and use gather/scatter to transfer data between all processes.

2.3.3. Dataset attributes access

Changes to attributes can only occur at the main process (process 0). Read only access to attributes can occur independently in each process that has opened the dataset.

2.4. Dataset close

All processes that have opened the dataset must close the dataset by a collective call (H5Dclose). The call must be collective so that all processes have the same knowledge that the dataset is no longer being accessed.

2.5. File close

All processes that have opened the file must close the file by a collective call (H5Fclose). The call must be collective because the underlying MPI_File_close() is a collective call.

3. Parallel HDF5 Example

The following are examples of code using the parallel HDF5 API. The main program and the testphdf5.h files can be viewed at these links.

3.1. Opening multiple HDF5 files with different communicators

This example shows how to open two HDF5 files with two different communicators containing two groups of processes.

Example: Multi-open

3.2. Accessing a dataset via independent transfer mode

This example shows how to create a fixed dimension dataset. Each process then writes and reads data to and from part of the dataset independent of other processes.

Example: Independent access

3.3. Accessing a dataset via collective transfer mode

This example shows how to create a fixed dimension dataset. All processes then write and read data to and from the dataset in the collective mode.

Example: Collective access

3.4. Accessing an extendible dimension dataset

This example shows how to create an extendible dimension dataset. All processes then collectively extend the size of the dataset. Then each process writes and reads data to and from part of the dataset independent of other processes.

Example: Independent access to extendible dataset

Comments and questions: hdfparallel@ncsa.uiuc.edu
Last modified: 29 Dec 1998