1 TOOL NAME: h5import 2 SYNTAX: h5import -h[elp], OR h5import [ ...] -o[utfile] 3 PURPOSE: To convert data stored in one or more ASCII or binary files into one or more datasets (in accordance with the user-specified type and storage properties) in an existing or new HDF5 file. 4 DESCRIPTION: The primary objective of the utility is to convert floating point or integer data stored in ASCII text or binary form into a data-set according to the type and storage properties specified by the user. The utility can also accept ASCII text files and store the contents in a compact form as an array of one-dimensional strings (Not implemented in this version). The input data to be written as a data-set can be provided to the utility in one of the follwing forms: 1. ASCII text file with numeric data (floating point or integer data). 2. Binary file with native floating point data (32-bit or 64-bit) 3. Binary file with native integer (signed or unsigned) data (8-bit or 16-bit or 32-bit or 64-bit). 4. ASCII text file containing strings (text data). (Not implemented) Every input file can be associated with options used to specify the data type and storage properties. There are two ways of specifying the options: 1. command line arguments or 2. configuration file. For simple data-sets command line arguments may be used. Using command line arguments, only the class, size, dimensions of the input data and the path of the output data-set can be specified. The advanced storage features can only be specified using a configuration file which is also to be provided as an input to the utility (See Section 5.B. "CONFIGURATION FILE" to know how it is to be organised). The configuration file is the recommended way of specifying options. The dimension sizes are the only required parameters in both ways of specifying options. Defaults for all other parameters exist in both forms. It should be noted that exactly one of the two ways of specifying the options may be used for every input file, and any number of input files can be specified in one command. The floating point data in the ASCII text file may be organized in the fixed floating form (for example 323.56) or in a scientific notation (for example 3.23E+02). A different input-class specification is to be used for both forms. (Note: Only the fixed form floating point version has been implemented in this version) The utility extracts the input data from the input file according to the specified parameters and saves it into an H5 dataset. The user can specify output type and storage properties in the configuration file. The user can also specify the path of the dataset. If the groups in the path leading to the data-set do not exist, the groups will be created by the utility. If no group is specified, the dataset will be created under the root group. If no path is specified the dataset is created as 'dataset1'. The user can also provide the class and size of output data to be written to the dataset and also the output-architecure, and the output-byte-order. If output-architecture is not specified the default is NATIVE. Output-byte-orders are fixed for some architectures and is relevant only if output- architecture is IEEE, UNIX or STD. Also, layout and other storage properties such as compression, external storage and extendible data-sets may be optionally specified. The layout and storage properties denote how raw data is to be organized on the disk. If these options are not specified the default is Contiguous layout and storage. The dataset can be organized in any of the following ways: 1. Contiguous. 2. Chunked. 3. External Storage File (has to be contiguous) 4. Extendible data sets (has to be chunked) 5. Compressed. (has to be chunked) 6. Compressed & Extendible (has to be chunked) If the user wants to store raw data in a non-HDF file then the external storage file option is to be used and the name of the file is to be specified. If the user wants the dimensions of the data-set to be unlimited, the extendible data set option can be chosen. The user may also specify the type of compression and the level to which the data set must be compresses by setting the compressed option. 5 SYNOPSIS: h5import -h[elp], OR h5import [ ...] -o[utfile] -h[elp]: Prints this summary of usage, and exits. : Name of the Input file(s), containing a single n-dimensional floating point or integer array in either ASCII text, native floating point(32-bit or 64-bit) or native integer(8-bit or 16-bit or 32-bit or 64-bit). Data to be specified in the order of fastest changing dimensions first. There are two ways of specifying options - command-line arguments (See Section 5.A. "COMMAND-LINE ARGUMENTS") OR configuration file. (See Section 5.B. "CONFIGURATION FILE") Every input file should be associated with exactly one of those two ways. -o[utfile] : Name of the HDF 5 output file. Data from one or more input files are stored as one or more data sets in . The output file may be an existing file or it maybe new in which case it will be created. A. COMMAND-LINE ARGUMENTS The options can be specified on the command-line in the following format. h5import -d[ims] [-p[ath] pathname] [-t[ype] ] [-s[ize] ...] -o[utfile] -d[ims] : is a String with no spaces and only numbers separated by commas, describing the dimensions of the input data. For example, a 50 x 100 2D array is to be specified with -dims 50,100 If the configuration file <-c option on the command line> is not being used, then this is a required command-line argument. -p[ath] : is a string consisiting of one or more strings separated by '/' to represent the path of the data-set in the output file. If the groups in the path do no exist, they will be created. This is an optional argument. If not used, the default path will be '/dataset1'. -t[ype] : is a string denoting the class of the input data. Also determines the class of the output data. See section 5.C. "More on INPUT-CLASS". This is an optional argument. If not used, the default value is 'FP'. -s[ize] : is a string denoting the size of the input data. Also determines the size of the output data. See section 5.D. "More on INPUT-SIZE". This is an optional argument. If not used, the default value is 32. It should be noted that the arguments although optional, when used have to appear in a definite order as presented in the syntax. For example, the -p option can only be used immediately following the -d option. The -t option immediately follows the -p option (or the -d option if the -p option is absent) and immediately before the -s option (if present). Examples using command-line arguments: 1. h5import infile -dims 2,3,4 -type TEXTIN -size 32 -o out1 This command will create a file 'out1' with a 2x3x4 32-bit integer dataset having the path '/dataset1'. 2. h5import infile -dims 20,50 -path bin1/dset1 -type FP -size 64 -o out2 This command will create a file 'out2' with a 20x50 64-bit float dataset having the path '/bin1/dset1'. B. CONFIGURATION FILE: A configuration file can be specified using the -c option. The in the syntax is to be replaced by -c . For example, the syntax of the command will be h5import -c [...} ] -o[utfile] The configuration file is an ASCII text file and must be organized as "CONFIG-KEYWORD VALUE" pairs, one pair on each line. The configuration file may have the following keywords each followed by an acceptable value. Required KEYWORDS: RANK DIMENSION-SIZES Optional KEYWORDS: PATH INPUT-CLASS INPUT-SIZE OUTPUT-CLASS OUTPUT-SIZE OUTPUT-ARCHITECTURE OUTPUT-BYTE-ORDER CHUNKED-DIMENSION-SIZES COMPRESSION-TYPE COMPRESSION-PARAM EXTERNAL-STORAGE MAXIMUM-DIMENSIONS Values for keywords: PATH: Strings separated by '/' to represent the path of the data-set. If the groups in the path do no exist, they will be created. If this keyword is not specified, the default value is '/dataset1'. For example, PATH grp1/grp2/dataset1 PATH: keyword grp1: group under the root. If non-existent will be created. grp2: group under grp1. If non-existent will be created under grp1. dataset1: the name of the data-set to be created. INPUT-CLASS: String denoting the type of input data. See section 5.C. "More on INPUT-CLASS". If INPUT-CLASS is "STR", then RANK, DIMENSION-SIZES, OUTPUT-CLASS, OUTPUT-SIZE, OUTPUT-ARCHITECTURE and OUTPUT-BYTE-ORDER will be ignored. If this keyword is not specified, the default value is 'FP'. INPUT-SIZE: Integer denoting the size of the input data See section 5.D. "More on INPUT-SIZE" If this keyword is not specified, the default value is 32' RANK: This is a required keyword. Integer denoting the number of dimensions. DIMENSION-SIZES: This is a required keyword. Integers separated by spaces to denote the dimension sizes for the no. of dimensions determined by rank. OUTPUT-CLASS: String dentoting data type of the dataset to be written ("IN","FP", "UIN"). If this keyword is not specified and the keyword INPUT-CLASS is specified then, if INPUT-CLASS is "IN" or "TEXTIN", OUTPUT-CLASS is "IN"; if INPUT-CLASS is "FP" or "TEXTFP" or "TEXTFPE" OUTPUT-CLASS is "FP"; if INPUT-CLASS is "UIN" or "TEXTUIN", OUTPUT-CLASS is "UIN"; If this keyword is not specified and the keyword INPUT-CLASS is also not specified then, the default value of OUTPUT-CLASS is "FP". OUTPUT-SIZE: Integer denoting the size of the data in the output dataset to be written. If OUTPUT-CLASS is "FP", OUTPUT-SIZE can be 32 or 64. If OUTPUT-CLASS is "IN" or "UIN", OUTPUT-SIZE can be 8, 16, 32 or 64. If this keyword is not specified and the keyword INPUT-CLASS is specified then, OUTPUT-SIZE is same as INPUT-SIZE. If this keyword is not specified and the keyword INPUT-CLASS is also not specified then, default value for OUTPUT-SIZE is 32. OUTPUT-ARCHITECTURE: STRING denoting the type of output architecture. Can accept the following values STD IEEE INTEL CRAY MIPS ALPHA NATIVE (default) UNIX Refer to section 6 Predefined Atomic Types in the H5T (datatype interface) in the HDF5 User Guide to know more about these architecutres. (http://hdf.ncsa.uiuc.edu/HDF5/doc/Datatypes.html) (Only STD, IEEE and NATIVE are implemented in this version. The extensibiilty for implementing other architecutres has been provided for.) OUTPUT-BYTE-ORDER: String denoting the output-byte-order. Ignored if the OUTPUT-ARCHITECTURE is not specified or if it is IEEE, UNIX or STD. Can accept the following values. BE (default) LE By default, all of the following options are disabled, i.e., the default storage properties are, no chunking, no compression, no external storage and no extensible dimensions. CHUNKED-DIMENSION: Integers separated by spaces to denote the dimension sizes of the chunk for the no. of dimensions determined by rank. Required field to denote that the dataset will be stored with chunked storage. If this field is absent the dataset will be stored with contiguous storage. COMPRESSION-TYPE: String denoting the type of compression to be used with the chunked storage. Requires the CHUNKED-DIMENSION to be specified. The only currently supported compression method is GZIP. Will accept the following value GZIP COMPRESSION-PARAM: Integer used to denote compression level and this option is to be always specified when the COMPRESSION-TYPE option is specified. The values are applicable only to GZIP compression. Value 1-9: The level of Compression. 1 will result in the fastest compression while 9 will result in the best compression ratio. The default level of compression is 6. EXTERNAL-STORAGE: String to denote the name of the non-HDF5 file to store data to. Cannot be used if CHUNKED- DIMENSIONS or COMPRESSION-TYPE or MAXIMUM- DIMENSIONS is specified. Value : the name of the external file as a string to be used. MAXIMUM-DIMENSIONS: Integers separated by spaces to denote the maximum dimension sizes of all the dimensions determined by rank. Requires the CHUNKED-DIMENSION to be specified. A value of -1 for any dimension implies UNLIMITED DIMENSION size for that particular dimension. Examples of configuration file: 1. Configuration File may look like: PATH work h5 pkamat First-set INPUT-CLASS TEXTFP RANK 3 DIMENSION-SIZES 5 2 4 OUTPUT-CLASS FP OUTPUT-SIZE 64 OUTPUT-ARCHITECTURE IEEE OUTPUT-BYTE-ORDER LE CHUNKED-DIMENSION 2 2 2 MAXIMUM-DIMENSIONS 8 8 -1 The above configuration will accept a floating point array (5 x 2 x 4) in an ASCII file with the rank and dimension sizes specified and will save it in a chunked data-set (of pattern 2 X 2 X 2) of 64-bit floating point in the little-endian order and IEEE architecture. The dataset will be stored at "/work/h5/pkamat/First-set" 2. Another configuration could be: PATH Second-set INPUT-CLASS IN RANK 5 DIMENSION-SIZES 6 3 5 2 4 OUTPUT-CLASS IN OUTPUT-SIZE 32 CHUNKED-DIMENSION 2 2 2 2 2 COMPRESSION-TYPE GZIP COMPRESSION-PARAM 7 The above configuration will accept an integer array (6 X 3 X 5 x 2 x 4) in a binary file with the rank and dimension sizes specified and will save it in a chunked data-set (of pattern 2 X 2 X 2 X 2 X 2) of 32-bit floating point in native format (as output-architecure is not specified). The first and the third dimension will be defined as unlimited. The data-set will be compressed using GZIP and a compression level of 7. The dataset will be stored at "/Second-set" C. More on INPUT-CLASS The input-class can have any of the following values. (TEXTIN, TEXTFP, TEXTFPE, FP, IN, STR, TEXTUIN, UIN). TEXTIN denotes an ASCII text file with signed integer data in ASCII form, TEXTUIN denotes an ASCII text file with unsigned integer data in ASCII form, TEXTFP denotes an ASCII text file containing floating point data in the fixed notation (325.34), TEXTFPE denotes an ASCII text file containing floating point data in the scientific notation (3.2534E+02) (Not implemented in this version), FP denotes a floating point binary file, IN denotes a signed integer binary file, UIN denotes an unsigned integer binary file, & STR denotes an ASCII text file the contents of which should be stored as an 1-D array of strings (Not implemented in this version). D. More on INPUT-SIZE The input-size can have any of the following values. (8, 16, 32, 64). For floating point, (TEXTFP, TEXTFPE, FP) INPUT-SIZE can be 32 or 64. For integers (signed and unsigned) (TEXTIN, TEXTUIN, IN, UIN) INPUT-SIZE can be 8, 16, 32 or 64.