Implementation Design of the Parallel HDF5 Diff Tool

(Revision: December 15, 2004 by Leon Arber, Albert Cheng)

Purpose

This presents the design to implement the Parallel HDF5 Diff tool (ph5diff) in an MPI parallel environment.

Overview

We have chosen the Parallel Processing approach from the initial design consideration[1]. The parallelization will come from doing multiple datasets simultaneously, instead of splitting up a single dataset among multiple tasks. ph5diff will take two hdf5 files, find datasets common to both files, and then find the differences, if any, in these datasets in both files.

This is done by having two sets of tasks, a Manager task (task 0) and Worker tasks (task 1...k).

Worker tasks:

Given a dataset and the names of two hdf5 files, find the differences in the corresponding dataset in the 2 files and report the result. The dataset is guaranteed to exist in both files.

Manager task:

Given two hdf5 files, walk through both files to generate a list of datasets to compare. As soon as a comparable dataset is identified, its name, along with the names of the two files, is passed along to a Worker task for actual comparison.

Implementation details

Output

The way h5diff is currently implemented, as soon as a difference in two datasets is found, a print statement is executed. Although this approach works well on a single processor, with multiple processors this would result in a jumble of mismatched outputs being printed to the screen and there would be no way to separate out which differences belong to which dataset.

To address this, an Output Token (OT) is being introduced. A task can only print when it has possession of the OT. The Manager is the initial owner of the OT. When a Worker task completes processing of a dataset and needs to print its output, it requests the OT from the Manager. The Manager, upon receipt of a request for printing, will pass the OT to that Worker task. The Worker will then print and return the OT upon completion.

Further complications with output

Since a difference in the two datasets causes a print statement to be executed immediately, a task that does not have the OT will be blocked until it has acquired it in order to print. If a pair of files has numerous scattered changes across multiple datasets, this will essentially cause ph5diff to run serially as each Worker task waits for the OT instead of looking for the rest of the differences in its dataset. A Worker task will only continue looking for differences after it has obtained the OT and printed the difference it was waiting to print.

To address this, the original h5diff code will be modified so that, based on a compile-time parameter, a print statement will either print immediately or buffer its output. In the parallel case, this allows a Worker task to continue processing the differences in a dataset after it finds the first difference, as it will simply buffer the output instead of waiting for the OT. When a Worker task finishes processing a dataset, it will then wait for the OT to print its accumulated list of differences. Once it gets the token and prints its differences, it will notify the Manager that it is free and can be assigned another dataset to compare.

The initial implementation will use a buffer of the size of 10KB. With the average of 50 characters per line, the buffer will hold up to 200 lines of output. This size should work well for files that are similar and have only a few differences on each dataset. In the event that the pending output exceeds the buffer capacity, the Worker task will immediately request the OT for printing. The Worker is blocked until it has acquired the OT. It then prints and compares the rest of the dataset, printing as soon as it finds any differences. When it has completed comparing the dataset, it releases the OT back to the Manager and informs the Manager that it is free for another dataset comparison. This will serialize the Workers if the files have many differences on most datasets. A command option is provided for the user to specify a different buffer size such that a larger buffer size can be used in anticipation of files of the preceding situation.

Scheduling of work

Because some datasets might take longer to process than others, there needs to be some scheduling system so that the Manager knows which tasks are free and which are busy. The Manager will maintain the state of all Worker tasks. Initially, each Worker task will be free, meaning ready to accept an assignment. When the Manager finds a dataset that needs comparison, he will assign it to the first free Worker task and mark that task as busy. When the Worker task completes the dataset and finishes printing the results, it will notify the Manager that it is free again.

H5diff modes

Non-verbose mode:

In non-verbose mode, h5diff simply prints the number of differences in the files. Non-verbose mode removes the need for a print token, and when ph5diff is running in this mode, each Worker task will simply report back the number of differences it found to the Manager task, instead of buffering them and waiting for a print token.

Other print modes:

Other print modes, such as printing only up to a certain number of differences or printing only if the number of differences is greater than some number will be implemented by notifying the Worker tasks of the printing policy associated with the dataset. The Worker tasks will then print accordingly.

Protocol between the Manager and Workers

Manager computes set of datasets to diff. If this set has more than one item, the Manager sends an MPI_TAG_PARALLEL message to all the Workers. The payload of this message contains the filenames that are going to be diff’ed. If the set contains only one item, the Manager sends a MPI_TAG_SINGLE message to all the Workers.

The Workers initially lie in wait for either the MPI_TAG_PARALLEL or MPI_TAG_SINGLE message. Upon receipt of MPI_TAG_PARALLEL the Workers open the two filenames provided and proceed to wait for an MPI_TAG_ARGS message. Upon receipt of a MPI_TAG_SINGLE, the Worker tasks gracefully exits.

If there are datasets to diff, at this point all the Workers are blocked waiting to receive an MPI_TAG_ARGS message which will contain the names of the datasets to diff and the options for the diff. The Manager maintains a list of the status of all the Workers (which are all initially free) and proceeds to distribute datasets for diffing to them. Once a Worker task receives an MPI_TAG_ARGS, it proceeds to diff the two datasets. Then, it sends either an MPI_TAG_DONE or an MPI_TAG_TOK_REQUEST message back to the Manager.

If there are busy tasks, the Manager periodically checks for responses from them. If all the tasks are busy, the Manager blocks waiting for a response.

The response can be either MPI_TAG_DONE, signifying that the Worker task has finished diffing the dataset and is ready to receive another MPI_TAG_ARGS message, or PI_TAG_TOK_REQUEST. The payload of MPI_TAG_DONE is the number of differences found in the dataset. Upon receipt of an MPI_TAG_DONE, the Manager marks that task as free and will eventually, if there is more work available, give that task another pair of datasets to diff. Upon receipt of an MPI_TAG_TOK_REQUEST, the Manager checks to see if he has the token. If so, then the Manager sends this token to the task that requested it in a MPI_TAG_PRINT_TOK message and marks himself as not in possession of the print token. The Manager will proceed to periodically check for an MPI_TAG_TOK_RETURN, which signifies that the Worker task has finished printing and is ready for an MPI_TAG_ARGS message. The payload of the MPI_TAG_TOK_RETURN message is the number of differences found in the dataset. If the Manager receives an MPI_TAG_TOK_REQUEST but does not currently have the token, it will wait for an MPI_TAG_TOK_RETURN before sending the Worker task an MPI_TAG_PRINT_TOK.

Upon completion of all the diffs of the datasets, the Manager will send an MPI_TAG_END to all of the Worker tasks. This will tell the Worker tasks to gracefully exit. The Manager then exits himself.

PH5DIFF Protocol Definitions

MPI tags are used to implement the protocol. The tag definitions follow:

MPI_TAG_ARGS
Sent by the Manager to the Worker tasks. It contains the dataset and options for the diff call that they need to make.

MPI_TAG_PRINT_TOK
Send by the Manager to a Worker task. This tells the task that they have the print token and can proceed to print their output.

MPI_TAG_TOK_REQUEST
Sent by the Worker to the Manager. This is a request for a print token.

MPI_TAG_DONE
Sent by the Worker to the Manager. This informs the Manager that the Worker has finished diffing the dataset AND doesn't need to print anything and is ready to receive another piece of work.

MPI_TAG_TOK_RETURN
Send by the Worker to the Manager. This informs the Manager that the Worker has finished printing their output and is returning the print token. It signifies that they are ready to receive another piece of work.

MPI_TAG_END
Send by the Manager to tell the Workers that all the datasets have been diff'ed and that the Workers can exit.

MPI_TAG_PARALLEL
Sent by the Manager to tell the Workers that a parallel diff is about to start. This tag also carries a payload which provides the Workers with the filenames that need to be diff’ed so that the Workers can open them.

MPI_TAG_SINGLE
Sent by the Manager to tell the Workers that their services will not be required for this diff and they can all exit.

[1] Design specification of the Parallel HDF5 Diff Tool.