NARA - Phase 2


High Performance Storage and Access to Electronic Federal Records with scientifc and Engineering Data with Electronic Records

Scalability of HDF5 for geospatial data.
Phase II work in the area of "High-Performance Storage and Access to Geospatial Collections" expanded beyond the study of the suitability and performance of HDF5 for storing federal geospatial data to scalability, and in particular the I/O efficiency of handling this data on Terascale facilities.

A technical report describing the results of this work on NCSA's is:
"Factors Affecting I/O performance when Accessing Large Arrays on HDF5's TeraGrid Cluster."

HDF5 and engineering data.
During 2006, our research has focused primarily on investigating the implications of using HDF5 for federal product data using EXPRESS and STEP. Over the past 12 years, many organizations have used HDF for product model data. The ISO standard format (STEP) for product model data has some shortcomings, particularly in its ability to handle very large datasets, something that HDF is especially suited for. The NARA work complements a collaboration between the HDF Group and a European Union (EU) project to assess the viability of using HDF5 as a binary alternative to the text-based STEP format. Much of this work involves mapping product model data models, which are described using the data modeling language EXPRESS, into HDF5. To test the viability of HDF5, data from a number of different domains were converted to HDF5.

If this work is successful, HDF might not only provide an ISO standard for storing STEP data, but the flexibility of HDF may make it possible to supplement this data in ways that would make it more usable for the NARA constituency. For the NARA component of the EXPRESS/STEP activities, the HDF group used an unclassified collection of files relating to a US naval vessel called the Torpedo Weapons Retriever. This collection contained files in many formats, including STEP files, photos, graphical renderings, schematics, and readme files. The project focused on two areas.

In FY 2006, effort (a) examined structures for handling B-spline and Cartesian point data. This involved mapping EXPRESS to HDF5, converting data from STEP files to HDF5, and assessing the results. HDF5 was shown to be an effective format for storing this kind of data, and was quite efficient when compression was used.

Results of this investigation are found in the technical report:
"Investigations into using HDF5 for product model data -- B-Spline and Cartesian Point data in HDF5 [pdf].".

Effort (b), which was completed in FY 2006, investigated the use of HDF5 as a aggregate container for the entire collection, including image and text files. This required mapping a number of different file types to HDF5, including JPEG, GIF, TIFF, text, and STEP. A container file was created in which records were preserve in their original formats, as well as the corresponding HDF5 structures.

Currently, our work is focusing on the effectiveness of using HDF5 for finite element data from engineering records, this time from another collect provided us by one of our collaborators on the EU project.

HDF5 and the Storage Resource Broker (SRB).
We also investigated performance implications involving access to HDF5 data where the data is stored in a Storage Resource Broker. This was part of a project to investigate the implementation of object-level access to HDF-based collections that are stored in the SRB, an approach designed to improve access to complex data from distributed repositories. Results indicate that the HDF5-SRB configuration can be very efficient in reducing subsampling time as compared to transferring whole files and then subsampling.

See the report "Integration of HDF5 and the SRB for Object-level Data Access.".