Proposed new datatype:  H5T_VAR_STRING


Robert E. McGrath
March 23, 2004


Problem

Use of strings is confusing in C.  There is no HDF5 datatype that corresponds to the "native" way that C
implements arrays of strings, i.e.,

   char mystrs [100] = {"a string", "another one", "", "and so on", ...};

The correct data type is something like:

tid1 = H5Tcopy (H5T_C_S1);
ret = H5Tset_size (tid1,H5T_VARIABLE);

This declaration has little relation to the C data structure, and does not transparently convey the intent of the program.

Also, there is a tendency for programmers to naively do the wrong thing, e.g.,

  tidbad = H5Tcopy(H5T_NATIVE_CHAR);
  H5Tset_size(tidbad, H5T_VARIABLE);  /* wrong! */

or

   if (H5Tequal(tidbad, H5T_NATIVE_CHAR)) {
      /* try to do zero terminated, VL strings.... */

   }

and similar disasters.

Solution

Define a new standard type: H5T_VAR_STRING or H5T_CHAR_STRING, which is an alias for the above.

tid1 = H5Tcopy(H5T_VAR_STRING);

Advantages:
  1. Makes clear the programs intent in a simple way
  2. Helps user's avoid the mistake of declaring the type to be 'CHAR'

Impact on other uses is minimal. The data is used the same way (for better or worse).  E.g.,


if (H5Tis_variable_str(tid1)) {

/* ... */

char *rdata[SPACE1_DIM1];

ret=H5Dread(dataset,tid1,H5S_ALL,H5S_ALL,xfer_pid,rdata);

for(i=0; i<SPACE1_DIM1; i++) {
printf("%d: len: %d, str is: %s\n", strlen(rdata[i]),rdata[I]);
}

ret=H5Dvlen_reclaim(tid1,sid1,xfer_pid,rdata);
}

Compatibility:

This should be fully compatible with existing files and codes. It is just an alias for the right thing.  The old declaration is still correct.

Open issues:

It is not clear what the defined behavior should be for two functions:
H5Tequals(tid1, H5T_VAR_STRING) ? -- it would be nice if this worked intuitively, i.e. the same as H5Tis_variable_str()
H4Tget_native(tid1) ? -- should it return H5T_VAR_STRING?