RFC: h5copy and h5repack

Peter Cao, Pedro Vicente, August, 18, 2006

Introduction

This RFC describes the tool h5copy, a command line tool that uses the HDF5 API H5Gcopy function to copy an object (group, dataset or named datatype) from one location to another location within a file or across files. It also describes the inclusion of this function in the h5repack tool.

H5Gcopy Function

The H5Gcopy() function is an HDF5 API function to be introduced in the coming release of HDF5 1.8.0. This function copies one or more HDF5 objects from a source location to a destination location within a file or cross files. It is described below.

Name: H5Gcopy

Signature:

herr_t H5Gcopy( hid_t src_loc_id, const char *src_name, hid_t dst_loc_id, const char *dst_name, hid_t cplist_id )

Purpose:

Copies a group, dataset, or named datatype.

Description:

H5Gcopy copies the object specified by src_name from the file or group specified by src_loc_id to the destination location dst_loc_id. The new copy will be created with the name dst_name.

The object being copied can be a group, dataset or named datatype.

The destination location, as specified in dst_loc_id, may be a group in the current file or a location in a different file. If dst_loc_id is a file identifier, the copy will be placed in that file’s root group. H5Gcopy() will fail if the name of the destination object exists in the destination group.  For example, H5Gcopy(fid_src, "/dset", fid_dst, "/dset", ...) will fail if "/dset" exists in the destination file. Note: In this case, we may consider a flag that forces the destruction of the old dataset.

The new copy of the object is created with the creation property list specified by cplist_id.

Several flags are available to govern the behavior of H5Gcopy. These flags are set in the creation property list cplist_id with H5Pset_copy_object

Parameters:

hid_t src_loc_id

Object identifier indicating the location of the source object

const char *src_name

Name of the object to be copied

hid_t dst_loc_id

Location identifier specifying the destination

const char *dst_name    

Name to be assigned to the new copy

hid_t cplist_id

Creation property list of the copy

Returns:

Returns a non-negative value if successful; otherwise returns a negative value.

Fortran90 Interface:

None.

History:

Release    

C

1.8.0

Function introduced in this release.

 

Several flags are available to govern object copying behavior and are described in the following table. These flags are set via H5Pset_copy_object.

 

H5G_COPY_SHALLOW_HIERARCHY_FLAG

If this flag is specified, only immediate members of the group are copied. Otherwise (default), it will recursively copy all objects below the group

H5G_COPY_EXPAND_SOFT_LINK_FLAG

If this flag is specified, it will copy the objects pointed by the soft links. Otherwise (default), it  will copy the soft link as they are

H5G_COPY_EXPAND_EXT_LINK_FLAG

If this flag is specified, it will expand the external links into new objects, Otherwise (default), it will keep external links as they are (default)

H5G_COPY_EXPAND_REFERENCE_FLAG

There are two separate cases

  1. Copy object between two different files: When this flag is specified, it will copy objects that are pointed by the references and update the values of references in the destination file. Otherwise (default) the values of references in the destination will set to zero. The current implementation does not handle references inside of other datatype structure. For example, if a member of compound datatype is reference, H5Gcopy() will copy that field as it is. It will not set the value to zero as default is used nor copy the object pointed by that field the flag is set
  2. Copy object within the same file: This flag does not have any effect to the H5Gcopy(). Datasets or attributes of references are copied as they are, i.e. values of references of the destination object are the same as the values of the source object

H5G_COPY_WITHOUT_ATTR_FLAG

If this flag is specified, it will copy object without copying attributes, Otherwise (default), it will copy object along with all its attributes

H5G_COPY_ALL

Note: see below h5copy correspondent flag note

Switches all flags from the default to the non-default setting.

h5copy command line tool

h5copy is a command line tool that parses several input parameters that translate to the above H5Gcopy function parameters. The usage of the tool is:

h5copy [OPTIONS] [OBJECTS...]

OBJECTS

A pair of HDF5 file names (input and output) and a pair of HDF5 object names (input and output). These names are each controlled by a switch (each one with a short and a long name version, i.e., either one can be used). The syntax is

Switch (short name)

Switch (long name)

Meaning

-i name

-input name

Input HDF5 file name

-o name

-output name

Output HDF5 file name (existing or non-existing)

-s name

-source name

Source object name

-d name

-destination name

Destination object name


OPTIONS

Switch (short name)

Switch (long name)

Meaning

-h  

-help

 Print a usage message and exit

-v  

-verbose

 Print information about OBJECTS and OPTIONS

-f  

-flag

 Flag type

-V

-Version

 Version information


Flag type is one of the following strings:

String

Meaning

Corresponding API symbol in H5Gcopy()

shallow  

Copy only immediate members for groups

H5G_COPY_SHALLOW_HIERARCHY_FLAG

soft  

Expand soft links into new objects

H5G_COPY_EXPAND_SOFT_LINK_FLAG

ext 

Expand external links into new objects

H5G_COPY_EXPAND_EXT_LINK_FLAG

ref 

Copy objects that are pointed by object references

H5G_COPY_EXPAND_ REFERENCE_FLAG

noattr  

Copy object without copying attributes

H5G_COPY_WITHOUT_ATTR_FLAG

allflags

Switches all flags from the default to the non-default setting

Note: This flag has the effect of having the same effect on future flags, which may not be desirable. It may be not implemented because of this.

H5G_COPY_ALL

Parsing of the file and object names

To parse the file and object names, several options are possible. We review here other methods used by current HDF5 command line tools that have to parse a pairs of file names and a pair of object names. These are h5diff (compares two HDF5 files) and h5repack (regenerates one HDF5 file to a new one using filter and layout options). We also consider the parsing method used by h5ls and h5dump, that only parse one pair of file name and object name, but can be adapted to two pairs.

h5ls

In h5ls each file name is followed by a slash and an object name within the file.  For example

./h5ls test1.h5/array

The dichotomy is determined by calling H5Fopen repeatedly until it succeeds. The first call uses the entire name and each subsequent call chops off the last component. If we reach the beginning of the name then there must have been something wrong with the file (perhaps it doesn't exist). For example, if /a/b/c is a valid HDF5 file and /d/e is the path of the object to be copied, the full string “/a/b/c/d/e” will be tried as the file name. After it fails, “/a/b/e/d” will be test as the file name, and so on, until it finds the file name /a/b/c.

h5diff

h5diff uses the method of a sequence of file and object names

./h5diff file1 file2 object1 object2

h5repack

h5repack uses yet another parsing method. It uses the -i and -o for input and output files, and the object names are parsed from a special grammar that includes the filter names and parameters

 ./h5repack -i file1 -o file2 -f dset1:SZIP=8,NN

Object names are comma separated, so this method can be used for multiple objects.

h5dump

The h5dump tool uses a -d flag to specify the object name (in this case only one object and file name)

./h5dump -d dset1 file.h5

h5copy

The proposed syntax for parsing in h5copy is

./h5copy -i input_file -o output_file -s source_object -d destination_object

Testing

Testing is made through a UNIX shell script, that executes several runs. These will include a subset of the existing C program that tests the H5Gcopy function (currently in /test/objcopy). The correctness of the copied file object will be made by calling h5diff on the test script, the same way the h5repack test script does.

1.      Simple dataset

2.      Chunked dataset

3.      Compact dataset

4.      Compound dataset

5.      Compressed dataset

6.      Dataset with named VLEN datatype

7.      Dataset with nested VLEN datatype

8.      Dataset with VLEN datatype

9.      Dataset that uses named datatype

10.  Named datatype

11.  Nested groups with loop

12.  Expand object reference Note: In the case where references are expanded, there is a need for a special case in the later file comparison by h5diff, that will detect that the files are different.

13.  Expand soft link

14.  Shallow group copy

15.  Without attributes

16. Deep group compare

Also a benchmark test between h5copy and the H5Dread/H5Dwrite functions will be made. Items that could be tested are

1.      Large number of empty or small groups

2.      Large number of attributes

3.      Large dataset with different compressions

4.      Chunked datasets

5.      Datasets with variable-length datatype

6.      dataset is too large to fit into memory

Inclusion of H5Gcopy in h5repack

h5repack is a command line tool that regenerates one HDF5 file to a new one using filter and layout options. All the objects from the original file are read to memory using the API function H5Dread and saved to a new file using H5Dwrite.

Using H5Gcopy in h5repack will take advantage of fast low level object copy, especially in:

• Copying metadata such as attributes, data type, data space, and property lists. Since the information of object header is directly copied from the source to the destination, it saves I/O operations. For example, it is required that a data type and data space are created when copying a simple attribute by the traditional higher level way. This is all done in one operation with H5Gcopy.

• Copying compressed raw data. Bytes of compressed raw data are directly copied from the source to the destination. Using the high level H5Dread and H5Dwrite routines, it requires uncompressing the data to read the data from the source file, and compressing the data to write the data into the destination. These uncompress/compress operations can be very time consuming.

h5repack options


h5repack allows the user to change data layout and compression properties. H5Gcopy does not support such options. When the different data layout or compression are used in h5repack, H5Gcopy cannot be used in that dataset copy. In such a case, the old way of h5repack will be used.
 

Benchmarks


We will need a set of tests to determine how much the H5Gcopy improves h5repack. Here is a list of interesting cases for benchmark study.

1.      Large number of empty or small groups

2.      Large number of attributes

3.      Large dataset with different compressions

4.      Chunked datasets

5.      Datasets with variable-length datatype

6.      dataset is too large to fit into memory


Last updated

09/08/2006

Send comments to hdfhelp@ncsa.uiuc.edu