Friday, May 17, 2013

Python and XML design philosophy question

My current design quandary revolves around Python interfaces to
XML files.  I use the ElementTree interface, which represents an
XML file as a tree of Element nodes.  This interface is
excellent: it reads or writes trees with a single method call,
and has provisions for validating a document against its schema.

The old way

Usually, when I have needed to read an XML file, I built a Python
class that encapsulates the XML structure, and makes its content
available through attributes and method calls.

Case in point: I recently set up an app that scrapes the official
Banner class schedules for each semester and puts them up at a
given URL as an XML file.  So for example the top-level class,
ClassSchedule, has a .timeStamp attribute that says when it was
scraped, and a .lookupSemester() method that selects a specific
semester's schedules.  Helper classes represent the other
entities: SemesterSchedule, DeptSchedule, CourseSchedule, and
SectionSchedule.

Now, if the data in such a Python object needs to be written back
out using the same XML schema, my usual approach has been to add
a .write() method that rebuilds the XML as an ElementTree and
then reserializes it back out as XML.

The new way

For a while now I've been toying with a different approach to
writing these XML interface classes.

Rather than writing code to convert the XML into an ElementTree,
convert the ElementTree to lists and hashes and so forth, and
then also writing code to convert the lists and hashes back to an
ElementTree and then serialize that back to XML, I thought, why
not use the ElementTree itself to hold all the data?

In this view, my classes would basically be thin wrappers around
Element nodes.  My classes would still encapsulate the details
of the XML schema and provide access through attributes and
methods---so the external interface would not change---but
they would also provide methods that did things like search
the tree and read/write XML and so forth.

My first attempt uses this approach only for reading.  You can
read the external spec and the internals, but  it's not necessary
that you read these to comment intelligently on my design
question, which follows.

My current problem

My next interface is a read-write interface.  It can read the
XML file, let you change the data, and then write it back out.

For a very simple case, consider a three-level structure: the XML
represents a department and has course children, and each course
has section children.  So we make three wrapper classes called
Dept, Course, and Section, and each of these classes has a .node
attribute which is the corresponding Element node that contains
the actual data for that entity.

One purpose of this approach is to make the conversion of the
ElementTree back into XML trivial: it is literally one method
call: tree.write(outFileName).

So the caller of the Python interface wants to add a new Section
child to an existing Course instance.  In the underlying
ElementTree, this means that I'll create a new Section node and
add it as the next child of the Course node.  That's very
straightforward.

Now let's consider the implementation of the Course.genSections()
method: given a Course instance, this generates all of its child
Section children.

It's easy for the Course instance to go to its .node and
find all that node's child Element instances.  But how do I
go backwards: given the Element node for a section, find
the Section instance that wraps that Element node, which is
what I want to give back to the caller?

The brute-force way is to break encapsulation of the Element
instance and cram a .myThing attribute into it that points back
at the corresponding wrapper instance.  But as we highly evolved
software masters know, every time you break encapsulation, an
angel cries, and Satan drowns an innocent kitten.  So clearly
that is not even worth considering.

My current solution is not that ugly, but it's pretty ugly.
Each Course instance maintains an internal list, ._sectionList,
that contains the child Section wrapper instances.

This list obviously has to have a 1-to-1 correspondence to
the actual nodes in the tree.  Therefore every time I change
the set of child nodes in the tree, I have to make the
corresponding change in the wrapper class's list of wrapped
child nodes.  This seems to me like not only a lot of extra
code but a lot of opportunity for things to get out of sync
and then Bad Things Happen.

Can anyone suggest a better approach?  Think of the poor
innocent kittens!