RFC: h5copy
and h5repack
Peter Cao, Pedro Vicente, August, 18, 2006
This RFC describes the tool h5copy, a command line tool that uses the HDF5 API H5Gcopy function to copy an object (group, dataset or named datatype) from one location to another location within a file or across files. It also describes the inclusion of this function in the h5repack tool.
The H5Gcopy() function is an HDF5 API function to be introduced in the coming release of HDF5 1.8.0. This function copies one or more HDF5 objects from a source location to a destination location within a file or cross files. It is described below.
Signature:
herr_t
H5Gcopy
( hid_t src_loc_id
,
const char *src_name
,
hid_t dst_loc_id
, const char *dst_name
,
hid_t cplist_id
)
Purpose:
Copies a group, dataset, or named datatype.
Description:
H5Gcopy
copies the object
specified by src_name
from the file or group specified by src_loc_id
to the destination
location dst_loc_id
.
The new copy will be created with the name dst_name
.
The object being copied can be a group, dataset or named datatype.
The destination location, as specified in dst_loc_id
,
may be a group in the current file or a location in a different file. If dst_loc_id
is a file identifier, the copy will be placed in that file’s root group. H5Gcopy()
will fail if the name of the destination object exists in the destination
group. For example, H5Gcopy(fid_src,
"/dset", fid_dst,
"/dset", ...) will fail if "/dset" exists in the destination file.
Note: In this case, we may consider a flag that
forces the destruction of the old dataset.
The new copy of the object is created with the
creation property list specified by cplist_id
.
Several flags are available to govern the behavior
of H5Gcopy
.
These flags are set in the creation property list cplist_id
with H5Pset_copy_object
Parameters:
hid_t |
Object identifier indicating the location of the source object |
const char * |
Name of the object to be copied |
hid_t |
Location identifier specifying the destination |
const char * |
Name to be assigned to the new copy |
hid_t |
Creation property list of the copy |
Returns:
Returns a non-negative value if successful; otherwise returns a negative value.
Fortran90 Interface:
None.
History:
Release |
C |
1.8.0 |
Function introduced in this release. |
Several flags are available to
govern object copying behavior and are described in the following table. These
flags are set via H5Pset_copy_object
.
|
If this flag is specified, only immediate members of the group are copied. Otherwise (default), it will recursively copy all objects below the group |
|
If this flag is specified, it will copy the objects pointed by the soft links. Otherwise (default), it will copy the soft link as they are |
|
If this flag is specified, it will expand the external links into new objects, Otherwise (default), it will keep external links as they are (default) |
|
There are two separate cases
|
|
If this flag is specified, it will copy object without copying attributes, Otherwise (default), it will copy object along with all its attributes |
Note: see below h5copy correspondent flag note |
Switches all flags from the default to the non-default setting. |
h5copy is a command line tool that parses several input parameters that translate to the above H5Gcopy function parameters. The usage of the tool is:
h5copy [OPTIONS] [OBJECTS...]
OBJECTS
A pair of HDF5 file names (input and output) and a pair of HDF5 object names
(input and output). These names are each controlled by a switch (each one with
a short and a long name version, i.e., either one can be used). The syntax is
Switch (short name) |
Switch (long name) |
Meaning |
-i name |
-input name |
Input HDF5 file name |
-o name |
-output name |
Output HDF5 file name (existing or non-existing) |
-s name |
-source name |
Source object name |
-d name |
-destination name |
Destination object name |
OPTIONS
Switch (short name) |
Switch (long name) |
Meaning |
-h |
-help |
Print a usage message and exit |
-v |
-verbose |
Print information about OBJECTS and OPTIONS |
-f |
-flag |
Flag type |
-V |
-Version |
Version information |
Flag type is one of the following strings:
String |
Meaning |
Corresponding API symbol in H5Gcopy() |
shallow |
Copy only immediate members for groups |
H5G_COPY_SHALLOW_HIERARCHY_FLAG |
soft |
Expand soft links into new objects |
H5G_COPY_EXPAND_SOFT_LINK_FLAG |
ext |
Expand external links into new objects |
H5G_COPY_EXPAND_EXT_LINK_FLAG |
ref |
Copy objects that are pointed by object references |
H5G_COPY_EXPAND_ REFERENCE_FLAG |
noattr |
Copy object without copying attributes |
H5G_COPY_WITHOUT_ATTR_FLAG |
allflags |
Switches all flags from the default to the non-default setting Note: This flag has the effect of having the same effect on future flags, which may not be desirable. It may be not implemented because of this. |
H5G_COPY_ALL |
To parse the file and object names, several options are possible. We review here other methods used by current HDF5 command line tools that have to parse a pairs of file names and a pair of object names. These are h5diff (compares two HDF5 files) and h5repack (regenerates one HDF5 file to a new one using filter and layout options). We also consider the parsing method used by h5ls and h5dump, that only parse one pair of file name and object name, but can be adapted to two pairs.
In h5ls each file name is followed by a slash and an object name within the file. For example
./h5ls test1.h5/array
The dichotomy is determined by calling H5Fopen repeatedly until it succeeds. The first call uses the entire name and each subsequent call chops off the last component. If we reach the beginning of the name then there must have been something wrong with the file (perhaps it doesn't exist). For example, if /a/b/c is a valid HDF5 file and /d/e is the path of the object to be copied, the full string “/a/b/c/d/e” will be tried as the file name. After it fails, “/a/b/e/d” will be test as the file name, and so on, until it finds the file name /a/b/c.
h5diff uses the method of a sequence of file and object names
./h5diff file1 file2 object1 object2
h5repack uses yet another parsing method. It uses the -i and -o for input and output files, and the object names are parsed from a special grammar that includes the filter names and parameters
./h5repack -i file1 -o file2 -f dset1:SZIP=8,NN
Object names are comma separated, so this method can be used for multiple objects.
The h5dump tool uses a -d flag to specify the object name (in this case only one object and file name)
./h5dump -d dset1 file.h5
The proposed syntax for parsing in h5copy is
./h5copy -i input_file -o output_file -s source_object -d destination_object
Testing is made through a UNIX shell script, that executes several runs. These will include a subset of the existing C program that tests the H5Gcopy function (currently in /test/objcopy). The correctness of the copied file object will be made by calling h5diff on the test script, the same way the h5repack test script does.
1.
Simple dataset
2.
Chunked dataset
3.
Compact dataset
4.
Compound dataset
5.
Compressed dataset
6.
Dataset with named VLEN
datatype
7.
Dataset with nested VLEN
datatype
8.
Dataset with VLEN datatype
9.
Dataset that uses named datatype
10.
Named datatype
11.
Nested groups with loop
12.
Expand object reference
13.
Expand soft link
14.
Shallow group copy
15. Without attributes
16. Deep group compare
Also a benchmark test between h5copy and the H5Dread/H5Dwrite functions will be made. Items that could be tested are
1.
Large number of empty or small groups
2.
Large number of attributes
3.
Large dataset with different compressions
4.
Chunked datasets
5.
Datasets with variable-length datatype
6. dataset is too large to fit into memory
h5repack is a command line tool that regenerates one HDF5 file to a new one using filter and layout options. All the objects from the original file are read to memory using the API function H5Dread and saved to a new file using H5Dwrite.
Using H5Gcopy in h5repack
will take advantage of fast low level object copy, especially in:
• Copying metadata such as attributes, data type, data space, and property
lists. Since the information of object header is directly copied from the
source to the destination, it saves I/O operations. For example, it is required
that a data type and data space are created when
copying a simple attribute by the traditional higher level way. This is all
done in one operation with H5Gcopy.
• Copying compressed raw data. Bytes of compressed raw data are directly copied
from the source to the destination. Using the high level H5Dread
and H5Dwrite routines, it requires uncompressing the
data to read the data from the source file, and compressing the data to write
the data into the destination. These uncompress/compress operations can be very
time consuming.
h5repack allows the user to
change data layout and compression properties. H5Gcopy
does not support such options. When the different data layout or compression are used in h5repack, H5Gcopy cannot be used in that dataset copy. In such a
case, the old way of h5repack will be used.
We will need a set of tests to determine how much the H5Gcopy
improves h5repack. Here is a list of interesting
cases for benchmark study.
1.
Large number of empty or small groups
2.
Large number of attributes
3.
Large dataset with different compressions
4.
Chunked datasets
5.
Datasets with variable-length datatype
6. dataset is too large to fit into memory
Last updated
09/08/2006
Send comments to hdfhelp@ncsa.uiuc.edu