Raw data encryption in HDF5
Proposed work for Boeing Company
Elena Pourmal, Mike Folk, Quincey
Koziol, James Laird,
Robert E. MCGrath
HDF5 is a file format and I/O library for storing, archiving, managing and exchanging scientific and other data. HDF5 files are portable, i.e. can be written and read on almost all currently available platforms. HDF5 files are “self-described”: they contain metadata of the stored objects that is interpreted by the HDF5 library. These features along with the simple and flexible HDF5 data model make it very easy to share data stored in HDF5 between the users, for example, between HDF5 data producers and HDF5 data consumers and their applications.
Data production may be a very expensive process. Creating “customized” data products for each data consumer often may not be an acceptable solution. Customization may be achieved by using access control to the data. For example, only some part of data written by a data producer in an HDF5 file is available to a particular data consumer while all data in the same file may be available to another data consumer.
In the current HDF5 file format and I/O library design and implementation, data and metadata in the HDF5 files can be easily accessed and interpreted by an HDF5 application. There is no mechanism to control access to the objects stored in the file. Implementation of access control for HDF5 is an extremely difficult and challenging task. NCSA proposed to conduct feasibility study in [1].
This current proposal focuses on using “data filtering” mechanism in HDF5 and AES [1] and Twofis data encryption technologies [2] to control access for some types of raw data stored in HDF5 file.
HDF5 Library provides “data filtering” mechanism to perform data transformation on raw data during the I/O operations. For more information see “Filters in HDF5” document that can be found at http://hdf.ncsa.uiuc.edu/HDF5/doc/Filters.html.
Currently “data filtering” mechanism is used to implement data compression, data shuffling and check sum for the raw data stored in the HDF5 chunked datasets. HDF5 library provides predefined deflate and szip filters for the GNU Zlib and NASA Szip compression methods, data shuffling filter that can improve compression ratio, and Fletcher32 check sum filter. HDF5 version 1.8 and higher will include predefined “formula” data transformation filter, for example, for performing linear data transformation during I/O operations. Predefined filters are available to all HDF5 users.
Application developers may use the “data filtering” mechanism to add new filters to the HDF5 I/O pipeline. For example, see http://hdf.ncsa.uiuc.edu/HDF5/papers/papers/bzip2/ for how to add Bzip2 compression to HDF5.
User-defined filters are not available with the NCSA HDF5 library. They are made available at run-time using “filter register” mechanism described in http://hdf.ncsa.uiuc.edu/HDF5/doc/Filters.html. If data is written to the file with user-defined filter applied, it can be read back only if the same filter is available for the application that reads data.
“Data filtering” mechanism may be used to encrypt raw data during I/O operation by providing “data encryption” filters. We propose to study how to apply AES and Twofish encryption technologies described in the next section to the raw data stored in the HDF5 chunked datasets.
Advanced Encryption Standard (AES) and Twofish are symmetrical-key encryption technologies. Symmetrical-key encryption requires both sender (data producer) and receiver (data consumer) to know the same secret key, which is used to encrypt and decrypt the message (data) (see Figure 1).
|
|
|
|
|
|
|
AES (also know as Rijndael) is a
block cipher adopted as an encryption standard by the
Twofish encryption technology
[3] is similar to Rijndael and was one of the five
finalists in the NIST Advanced Encryption Standard Process. Finalists were
chosen based on the security, speed and flexibility of the encryption methods.
Benchmarking results can be found in [4]. One of the conclusions of the study
was that while it takes more time to generate long keys for Twofish,
encryption/decryption is faster than with the AES.
Both AES and Twofish are available in GNU Open source libgcrypt library [5].
Filters have never been used before in HDF5 for raw data encryption.
We propose to implement two HDF5 user-defined (external) encryption filters AES and Twofish to investigate the “data filtering” mechanism for raw data encryption in HDF5. In particular we would like to answer the following questions (also see [1]):
· Does this approach provide a partial solution for raw data access control?
· Does it allow interoperability across platforms?
· How is HDF5 I/O performance affected?
· How easy is it to integrate standard encryption technologies with HDF5?
· How are keys passed between data producer and data consumer?
· Source code for the HDF5 AES and Twofish filters; filters will use encryption algorithms implemented in the GNU libgcrypt library
· Source code examples for data producer’s and data consumer’s application
· Technical report on performed work including
o AES and Twofish filters’ implementation description
o Performance examples for two filters
o Recommendations for future work based on the performed study
High-level view of the proposed approach is described below and is shown on Figures 2, 3 and 4.
NCSA team will create two external filters that implement AES and Twofish encryption methods using the GNU Open Source libgcrypt library. User’s applications (data producer’s and data consumer’s applications) will register those filters at run-time with the HDF5 library. The filters have to be added to the I/O pipeline by modifying creation properties of the datasets in which encrypted data will be stored. This needs to be done in the data producer’s application only. HDF5 files may contain encrypted datasets along with the datasets without encryption applied. (See Figure 2)
Encryption library libgcrypt will be used by data producer to generate encryption keys which in their turn will be used by AES and Twofish filters correspondingly for encrypting data during write operations. Those keys will be shared with the data consumer; data consumer’s application needs the keys in order to read encrypted data in the HDF5 file (see Figure 3)
Figure3 :
The data encryption key should be passed to the data consumer along with the HDF5 file
The data consumer’s application will need to register the NCSA-provided data encryption filter(s) and use an encryption key from data provider to read encrypted data in the HDF5 file (see figure 4). If the encryption key is not available, the reading operation will fail as illustrated in Figure 4.
Figure 4
Only AES encrypted dataset A and non-encrypted dataset C are available to data consumer
There are several known limitations for the proposed HDF5 encryption filters:
· HDF5 encryption filters can be applied only to the raw data stored in the HDF5 chunked datasets. It is not applicable to contiguous, compact, and external types of raw data storage
· Even though in theory HDF5 encryption filters can be applied to raw data of any HDF5 datatype, it is not generally advisable to apply it to the variable-length and object reference datatypes due to the HDF5 implementations of those types.[1]
· HDF5 encryption filters can not be applied to an e HDF5 object’s attribute unless the attribute’s data is stored in a dataset and is referred to by using attribute of “object reference” type.
· It is unknown how “suitable” two chosen encryption methods for encrypting “non-text” data; this proposal may give a negative answer to this question
· Only one implementation (GNU libgcrypt) of AES and Twofish will investigated.
· Applying encryption to the data doesn’t protect data from destruction or tamper; it also does not protect data from reading of encrypted data
[1] Robert E. McGrath and all, “Access Control for High Performance Data Management with HDF5” Proposal to the NCSA NCASSR project http://hdf.ncsa.uiuc.edu/apps/boeing/documents/AccessControlHDF5.pdf
[2] Introductory article to Advanced Encryption Standard (AES) http://www.answers.com/topic/advanced-encryption-standard
[3] Twofish: A new Block Cipher http://www.schneier.com/twofish.html
[4] Bruce Schneier and Doug Whiting, “A Performance comparison of five AES finalists” http://www.schneier.com/paper-aes-comparison.pdf
[5] GNU Open source encryption library libgcrypt ftp://ftp.gnupg.org/gcrypt/libgcrypt/
[1] Variable length data is stored in a heap with that dataset itself pointing to the heap. An HDF5 encryption filter would only encrypt the pointer, not the raw data. In the case of the object reference data, an HDF5 encryption filter would only encrypt pointers to the objects stored in an HDF5 file; the objects themselves are not encrypted.