Nested Datatypes support in PyTables

Design Document

Author: Ivan Vilata
Author: Francesc Altet
Company:Cárabos Coop. V.
Date: 2005-04-21

Abstract

The present document is a proposal for a redesign of PyTables software in order to support nested datatypes for Table objects.

First of all, it is explained how to declare these nested datatypes by using generalizations of the existing declarative objects in PyTables. Then, it follows a discussion on how several existing classes in PyTables must be modified and enhanced to allow the same goal. And finally, a proposal on how the RecArray and Record classes (see numarray), the foundation for I/O in Table objects, must be subclassed in order to support nested datatypes as well.

Although it is not critical for this report, it is understood that these nested datatypes will be saved on disk by using nested compound datatypes in the underlying HDF5 library. In the same way, a native HDF5 file which contains datasets with nested compound datatypes and following the HDF5_HL table specification will be supported by PyTables when the implementation phase of this proposal is finished.

Declaration of Nested Datatypes during Table creation

The user will be able to declare nested datatypes in Table objects by using generalizations of the existing declarative methods in PyTables. Such generalizations are described next.

Nested Subclasses of IsDescription

IsDescription is a metaclass designed to be used as an easy, yet meaningful way to describe the properties of Table objects through the use of classes that inherit properties from it.

The generalization required to support nested datatypes in this case should allow the declaration of nested subclasses of IsDescription. This should look like:

class NestedType(IsDescription):
    id = Int64Col()
    pos = Float32Col(shape=(2,))
    class info(IsDescription):
        name = StringCol(length=2)
        value = Complex64Col()

Nested Dictionary

Another way of describing the types in a Table is through a dictionary. The Table constructor will be enhanced so that it can accept nested dictionaries as type descriptor. Such a generalized dictionary should look like:

{'id': Int64Col(),
 'pos': Float64Col(shape=(2,)),
 'info': {'name': StringCol(length=2)
          'value': Complex64Col()},
}

NestedRecArray

Finally, the Table constructor also accepts a RecArray object that will be used as a descriptor of the type of columns. As RecArray currently supports just flat datatypes, the Table constructor will be enhanced to accept NestedRecArray objects as well as RecArray ones. Using the same example than above, creating such a NestedRecArray should look like:

array(databuffer,
      names=['id', 'pos', ('info', ['name',' value'])],
      formats=['Int64', '(2,)Float64', ['a2', 'Complex64']]]

(See Subclassing RecArray and Record for more information on the aforementioned classes.)

Modifications needed for Table accessors

In order to have a complete support for nested datatypes, some modifications must be carried out in methods of the Table class, as well as in the Cols class. The modifications on the behavior of existing methods for these classes are documented here.

Subclassing RecArray and Record

In-memory table operations in PyTables make extensive use of the RecArray class in the numarray.records module. However, this class does not support nested fields (i.e. non-homogeneous fields with sub-fields).

The NestedRecArray class shall extend the behavior of RecArray to support this kind of constructs and be as compatible as possible with the original class. A companion NestedRecord class will extend Record in the same sense. Original methods which currently return NumArray, Record or RecArray objects will now be able to return NestedRecArray and NestedRecord objects.

A complete explanation of the API intended to be provided by the nestedrecords module can be found here.