Design specification of the Parallel HDF5 Diff Tool

(Revision: December 14, 2004 by Albert Cheng)

Purpose

An analysis of two designs of a parallel hdf5 diff tool (ph5diff).

Requirements of Ph5diff

The h5diff tool compares two HDF5 files and reports the differences.  It provides a great means for users to manage their HDF5 files and is widely used.  The tool is implemented for serial execution and has its limits when it processes large HDF5 files.  This design is to explore how to add a parallel version of the tool to take advantage of parallel processing machines to shorten the wall clock time of comparing large HDF5 files.

Two possible designs

Parallel Data

All processes recursively traverse down the HDF5 file, working on one dataset at a time together. Each dataset is divided evenly among all processes. Each process reads the data of their portion of the dataset, compares them with the corresponding dataset in the other file, reports any differences found and then returns the result code. Repeat with the next dataset.

Parallel Processing

Process 0 is designated as the Manager and all remaining processes as Workers. The Manager recursively traverses down the HDF5 file, assigning each dataset to one of the Workers to do the comparison. Each Worker receives an assignment of a dataset, opens it on both files, compare them, reports any differences found and then returns the result code.  Repeat with another assignment from the Manager.

Issues facing each design

A common issue

The processes report differences found by printing them to the output file.  Since there is only one output file (stdout), if each process prints the differences as it finds them, the results of all processes will be intermixed and impossible to comprehend.

A simple solution is for each process to have its own output file.  But this results in numerous files and for the case of Parallel Data, the differences of each dataset is fragmented into multiple files.  For the case of Parallel Processing, the differences of all datasets within a group will be spread across multiple files.  Again, it is harder to read than just one output file.

A better solution is to pass around an output token.  A process needs to acquire the Output Token (OT) before doing any printing and releases the OT when it is done with printing.  This allows output done in a synchronous manner though it is no guarantee that the underlying parallel operating system will display them in the same exact manner since the system may hold the output of each process in some system buffer before “flushing” them to the one output file.  But that is beyond the control of this tool.

Another solution to guarantee synchronous output is to funnel all output back to the main process which prints them to the output file in a coordinated manner.

Parallel Data

All processes run in a lock step fashion. The OT will be passed in a round robin fashion from process 0 to process 1 to process 2 and so on.  Eventually, process n-1 passes it back to process 0.  When a process finds a difference and needs to print something, it gets the OT from its previous neighbor.  When it has it, it prints the differences.  When it has completed its own portion of the dataset, it gets the OT from its left neighbor IF it does not have it already.  Then it passes the OT to its next neighbor. It then proceeds to the next dataset.

This works great if most datasets are big so that it is worth each process to do the dataset open to process a portion of the dataset.  If there are very little differences found or if all differences are concentrated in very few portions of the dataset, then the processes will run in a parallel manner.  But if there are differences in many portions of the dataset, the overall execution becomes a serial fashion since each processes has to wait for the OT.

Parallel Processing

All processes run in a distributed computing fashion.  The OT will be controlled by the Manager and passed between it and the Workers.  When a Worker finds a difference and needs to print something, it gets the OT from the Manager.  When it has the OT, it prints all the differences.  When it has completed its assigned dataset, it returns the OT to the Manager IF it has it. Then it asks for the next assignment.

This has a short coming that the Manage process is not doing “real” work.  Most of the time, it sits idle waiting for the Workers to do their assignments.  But if there are many Workers, the Manager will be an acceptable small percentage of idle processing power.  If one is really concern about this, the design can be modified to have the Manager double as a Worker too.

This works great if there are many datasets of similar sizes to keep each Worker busy most of the time.  (Would be bad if there is one dataset of 100GB big and a few more of 10KB big.  Then one process takes a long time to compare the hugh dataset while the others have finished and are idle.)  The execution will run in a pretty parallel manner if most datasets have no differences.  But if most datasets have differences, the overall execution becomes a serial fashion again since the Workers have to wait for the OT.

One solution to go around the OT serialization problem is that each Worker prints the difference of one dataset to a holding place such as a temporary file. Then it asks for the OT only at the end of the processing of one dataset and prints the content of the holding place to the output file.  This allows each Worker to process its own assignment rather than getting stuck at the first sight of any difference.  This will reduce the severity of serialization.  One more optimization is for the Worker to tell the Manage the name of the temporary file and let the Manager does the printing of the file content.  Since the Manager is often idle, it is a good use of the Manage.

More on Output Token

There are two more solutions to deal with the serialization issue of OT.  The first one is for each Worker to send all output to a temporary file and then print them (or send them to Manager to print) by acquiring the OT first.  When done printing its output file, it releases the OT.  An potential enhancement is for the Worker to tell the Manager where the output file is and let the Manager prints the file directly.  This requires a shared and writable file system between the Manager and the Workers.  That is not always true.  One cannot assume it is okay to write to the file system where the files being compared resided.  The compared files could very well be read only.

A second solution is for each Worker to have an output buffer to hold differences output temporary.  When the Worker has completed the comparison of a dataset, it can then acquire the OT, send the content of the output buffer to the Manager, then release the OT.  If the buffer is full before a dataset comparison is completed, the Worker may either flash the content to a temporary file just like the above first solution; or it acquires the OT, flush the buffer content to the Manager, continue the comparison and sends the Manager any more differences found. When it finishes the comparison of the current dataset, it releases the OT.  This allows all Workers to continue in parallel as far as their local buffers can hold.

More on Parallel Processing

The Parallel Processing can be extended to process only a subset of a dataset.  Then the Manager can schedule multiple Workers to process different parts of the same dataset in parallel.  This has the same effect as the Parallel Data approach.

Summary

The biggest issue of this parallel tool is the coordination of work assignment and output coordination with the latter the more dominate issue.  The Parallel Processing design is the better one to implement.