SZIP Support-- Proposals for Handling "Read Only" Libraries.
April, 2004
1. Overview
The HDF libraries are required to include SZIP compression as a standard
filter. The SZIP library has some restrictions on its use for commercial
purposes. Specifically, the decoder is free for all to use, but the encoder
may be used only for non-commercial purposes.
The SZIP library has been modified so that it can be compiled in two versions:
- the full library, and
- the library with the encoder disabled/removed.
The former may require a license for commercial use. The latter is free for all
use.
The overall approach will be to have one version of the HDF libraries, which
can be linked to either version of SZIP, depending on the user's preference
and rights. We will distribute two versions of the SZIP binaries, full and
decode only, the user may download and use either.
In order to realize this goal the HDF libraries must be modified to behave
reasonably in the case when the SZIP encoder is not available. E.g.,
in this configuration, a dataset previously compressed with SZIP can be read,
but datasets cannot be created with SZIP, nor can data be written compressed
with SZIP.
In addition to the changes to the libraries, miscellaneous tools will need
to be modified to provide meaningful feedback to the user, e.g.,"this dataset
cannot be modified because you do not have the SZIP license".
This document proposes
required changes to the HDF libraries.
2. Challenges for the HDF Libraries
The SZIP library presents a new and unprecedented case for HDF: it is a filter
that may be configured to be "one-way." In the current libraries, a filter
is either present or absent. If present, it is always applied (although
it may be silently skipped in some cases).
The SZIP library now has three configurations: absent, present read/write,
and present read-only. The fundamental goal for the changes to the library
is to handle the third case in a reasonable way, and in a way that the calling
program can understand.
In the future, there may be other filters with similar 'read-only'
configurations, so the solutions should be applicable to any filter.
3. Required Changes
3.1 Format Changes
No changes to either the HDF4 or HDF5 file format is required.
3.2 Filter Operations
A new error must be defined, i.e., "filter present, but writes not allowed".
E.g., if a H5Dwrite fails because SZIP is required but encoding is disabled,
the failure should tell the reason.
In HDF5, the semantics of the H5Z_FLAG_OPTIONAL must be refined. Currently, this flag is defined:
If the filter fails [...] during an
H5Dwrite
operation then the filter is
just excluded from the pipeline for the chunk for which
it failed...This is commonly used for compression filters: if the
filter result would be larger than the input, then
the compression filter returns failure and the
uncompressed data is stored in the file.
If this bit is not set (i.e., the filter is required), the operation will fail.
When SZIP encoding is enabled, it should work as described above. However,
when encoding is disabled, all reads should succeed, but all writes should fail
(rather than silently writing the data uncompressed).
Note that, while this behavior is new, it does not contradict the current
documentation, nor change the behavior of existing code or files. Therefore,
this is considered a "refinement" to the current library, which applies to
a new case.
In HDF4, the semantics of filters does not change. If encoding is disabled, the write will fail.
4. User Visible Changes (HDF5)
There are user visible cases where the HDF5 library should recognize the read-only case.
4.1. Create Dataset with SZIP
When SZIP is configured read-only, a request to create a dataset with SZIP
encoding should fail. There are three ways this may happen in HDF5.
1. Call H5Pset_szip to add SZIP to a Dataset Creation Property List
The library should detect that SZIP encoding is not enabled, and return a failure code.
2. Copy the Dataset Creation Properties from another dataset, try to create a new dataset.
In this scenario, a dataset in a file was created with another version of
the library using SZIP. The program calls H5Pcopy to copy the dataset creation
properties, and then tries to create a new dataset, calling H5Dcreate.
In this case, the library must detect that SZIP encoding is not enabled, and H5Dcreate should fail.
3. Extend a dataset that is compressed with SZIP
In this scenario, a dataset in a file was created
with another version of the library using SZIP. The dataset is extendible,
has a fill value defined, and has a fill policy that requires writing the
fill values when space is allocated.
This file is opened with SZIP encoding disabled, and H5Dextend is called to extend the dataset.
In this case, the H5Dextend should fail.
4.2. Write Data to an SZIP Compressed Dataset
It is possible for data to be created by one program compressed with SZIP,
and later read by another program with the encoder disabled. In this case,
reading the data will succeed as expected, but an attempt to write back cannot
be re-compressed, i.e., the attempt to compress will fail.
In this case, the library must do one of two actions:
- Fail the write, or
- write without compression
The proposed default is to 'fail', i.e., return an error from the write
operation. See the discussion of the H5Z_FLAG_OPTIONAL flag, above.
We could support the first behavior with a new transfer property to override
the default. This is discussed in section 6 below.
4.3. Discover Whether Encoding is Enabled
The HDF library has a function to discover the settings for compression and
other filters. These facilities need to be enhanced so the calling program can discover whether SZIP encoding is enabled or not.
While a program can discover that SZIP is disabled by attempting to create
or write using SZIP, it is highly desirable to provide inquiry functions
so a program can easily determine whether SZIP encoding is enabled. This
can be used by tools to behave gracefully when SZIP is read-only, e.g., to
inform the user that this dataset cannot be compressed with this version
of the library.
1. Filter availability
The availability of filters is a feature of the library (how it was linked),
so there should be a new API call to test any filter.
We propose a new API function, e.g.:
H5Zfilter_is_available( H5Z_filter_t filter_id id)
which returns: READ, WRITE, BOTH or NONE.
2. Filter properties
Currently, there are several functions that retrieve the settings for a filter,
e.g., the parameters to the compression algorithm. These are retrieved from
a dataset creation property list. It is desirable that the inquiry functions,
H5Zget_filter and so on should be extended to report whether writing is enabled.
The proposed extension is to add another returned value, to tell the availablility of the filter (READ, WRITE, NONE, BOTH).
For example:
herr_t H5Pget_filter_by_id
(
hid_t plist_id
,
H5Z_filter_t filter
,
unsigned int *flags
,
size_t *cd_nelmts
,
unsigned int cd_values[]
,
size_t namelen
,
char name[]
)
would be extended to have an new OUT parameter, which tells whether this filter is configured.
5. User Visible Changes (HDF4)
There are user visible cases where the HDF4 library should recognize the read-only case.
1. Create Dataset with SZIP
When SZIP is configured read-only, a request to create an object with SZIP
encoding should fail.
An SDS (or GR image) is created with SDcreate (GRcreate), then compression is requested with SDsetcompress (GRsetcompress).
In this case, the SDsetcompress (GRsetcompress) should fail. The dataset can be created, but it will not be compressed.
2. Write Data to an SZIP Compressed Dataset
In this scenario, a dataset (GR image) is created with one version of the
library, and compressed with SZIP. The file is opened using a different
version of the library, with SZIP encoding disabled. The program writes
data to the SDS (GR), with SDwrite or SDwritechunk (GRwrite, GRwritechunk).
In this case, the write should fail, and return an appropriate error.
5.3. Discover Whether Encoding is Enabled
As discussed above, there needs to be a method to discover whether SZIP encoding
is enabled. This can be used by tools to behave gracefully when SZIP
is read-only, e.g., to inform the user that this dataset cannot be compressed
with this version of the library.
This information can be added as a new value to the comp_info_t union, which is returned by SDgetcompress (GRgetcompress).
6. Optional library features that might be done in the future
For the HDF5 library, we might add a new data transfer property to override
the failure on write when encoding is disabled. I.e., when requested, the
library could write uncompressed chunks into the dataset. This feature
should not be done now, but could be added in the future, if needed.
The HDF4 API could be extended to add an inquiry to determine if the compression
method is available, e.g.,
HCcomp_available( comp_code_t )
This should
not be done now.
7. Changes to Tools
Once the library changes are available, several standard utilities and tools
should be modified to provide clear information to the user when the SZIP
encoding is disabled. Essentially, any tool that may create or write
data using SZIP needs to be modified to check for the availability and give
a reasonable result or message when SZIP is read only.
These tools include: hdfview (Java), h5repack, h4toh5, h5toh4, etc..
8. Documentation and Examples
It will be important to clearly document this behavior and provide examples
for how to detect and handle the case when SZIP encoding is not available.