Notes from DM meeting, 2-23-00

Notes by: Robert E. McGrath (2/24/00)

The handout for this meeting was: "Version 1" of the DTD.

The discussion centered on how to represent the structure of the HDF-5 file in XML. In the last tow weeks we've looked at examples that show two basic approaches:

We were using "Version 0" and "Version 1" to refer generically to these approaches, realizing that there are doubtless variations and mixes of these approaches.

We had a long discussion of these approaches. It is clear that each has merits and disadvantages.

Bob's Summary Comparing "Version 0" and "Version 1"
Approach Advantages Disadvantages
"Version 0"
(Objects + Links)
  • Follows HDF-5 model
  • Fully general
  • Probably is close to the data structures used by applications such as browsers
  • Representation of simple files is complex
  • Structure in XML isn't related to structure of file
  • General purpose XML tools won't have a clue what it means
"Version 1"
(Objects in tree, Links only as needed)
  • Simple files are simple XML
  • In simple cases, the XML completely represents the structure of the HDF-5. Things like parent pointers are not needed because the information is in the XML hierarchy
  • General purpose tools will approximately understand a simple file
  • XML doesn't really represent the multiple linking very well
  • Objects with additional links are asymmetrical in the XML, the same object appears in > 1 guise
  • Algorithmic manipulation, e.g., moving an object from one group to another, can require substantial rewriting of the XML (and a small change to the file can produce a large change to the XML)

This is obviously a big issue.

Decision:
No vote was taken.

Do we want to form a ballot and have a vote? (Who should be polled?)

For this week, we're keeping V1, with modifications. V1 is preferred for now because it makes the simple case simple.

Some details:

There were many detailed points mentioned. Here are some short notes.

It might be better not to use a 'Link', but instead have a variant of the Dataset, etc., that contains only a link. (This has equivalent information, but it makes clearer the concept that this is really 'the same dataset' in its second appearance.) To be investigated.

With Version 1, back pointers are not needed in links, since they are always nested in exactly one group. These should be deleted, since they are a source of errors. Other back pointers should be eliminated where not needed.

Links should have a 'target type' attribute. (This is moot if we eliminate the 'Link' as discussed above.)

There was discussion of 'forward' pointers, i.e., pointers 'forward' in the XML file. This isn't a problem for XML, and is not thought to be a problem for generating XML.

Some open issues were briefly raised:

Does the order of the XML objects matter? E.g., does XML need to try to preserve the order of objects in the HDF-5 file? The consensus seemed to be that it is not critical that the XML preserve the order within a Group. Are there other cases where the order in XML matters?

What to do about 'missing' datatype objects?

What to do about references in general?

How should we handle shared Datatypes? This is analogous to the case of Group memebership. Is it important for the XML to faithfully represent this sharing, or would it be OK to duplicate the type in each dataset/attribute that uses it? Etc.

Quincey raised a number of detailed questions and suggestions about Version 1. His changes will appear soon.

Philosophical principles and questions:

1. Redundancy Principle:

Use redundancy (only?) if there is a compelling reason.

2. Is the XML required to have sufficient information to be able to reconstruct any possible H5 file.

We surely want XML to be able to express almost everything about an HDF5 file. But perhaps not everything. E.g., what about funny cases like 'missing' datatypes? These can be represented in XML some way, but need not necessarily capture the wierdness. This is a decision we can make.

A related question is how much detail the DTD will require in order to be a 'valid' XML representation. We have the option to define a minimal set of elements that must be present, along with many more that may be used if more detail is needed.

An example of this is a questions like "should the BootBlock be optional or mandatory in the DTD?" For many purposes, the BootBlock is not needed by the consumer of the XML, so it could be 'optional' in the XML.

It's probably time for some 'use cases'--see below.

To do:

1. Representation of Data

A major issue TBD is how to represent data data in the XML file. We have an example (XSIL) of how to handle simple cases, and Bob will try to provide one or more examples from Astronomy.

The goal here is to rapidly reach consensus on a specification for simple cases (preferably related to the current state of practice). More complex cases will be handled as best we can now, or deferred if we believe they are too difficult.

2. Use cases

Much of our discussion has been based on unstated assumptions about the purpose and use of the XML representations of HDF-5 files. Since we intend to be general, we want to support many uses. We are not seeking to identify a single set of requirements, we are trying to understand cases where different uses may have different requirements--and hence might drive our design in different directions.

I propose that it is time to do some 'use cases'. I'll try to provide a sample next week.

3. XML Technology Queries

I have some questions about how XML works that need to be pursued in the next few weeks.

a) Style Sheets, XSL, etc.: algorithmically transforming XML

XML style sheets are said tobe a powerful tool for providing standard ways to transform from one representation to another. I think we should be able to transform a "Version 1" representation into a "Version 0" representation and vice versa using XSL or something. If so, then we can provide a single DTD with multiple transformations for different purposes.

b) Inclusion of DTD's in DTD's, etc.

How does XML do 'inclusion'? This may be useful if we with to 'import' other standards, e.g., for representing data.