Thursday, February 11, 2016

Is literate programming harmful?

All of my serious work of the last few years has used “lightweight literate programming,”: a program's source code is embedded in a document that describes the internals. See my lightweight literate programming page for an explanation and many examples.

So I feel obliged to respond to a post by my good friend and colleague Daniel Lyons entitled “Literate programming considered harmful”.

The code is important—it’s what makes it go, and we spend all day in there reading it and writing it. We have to be able to understand it to extend it or debug it. But if I’m not able to communicate clearly to a human, it won’t stop the program from entering production—but the “incidental” detail of it being wrong will.

Agreed. It's clearly true that the functioning of the programming is the first priority. And the second sentence above emphasizes the important truth that documented programs are easier to extend or debug.

I put a lot of stock in Brooks’s quote, “Build the first one to throw away, because you will.”...This leads to the second material limitation of literate programming, which is that if you were doing literate first, you have either just written a book about the wrong approach to the problem, which incidentally is also the throwaway program, or you have expended twice the resources to produce a book when what was desired was a program.

Maybe I've led a sheltered life, but I'd guess that in fewer than a third of my projects I threw away the first one and started over. Yes, it's often great to use what you have learned in the first revision. But the last major project I built for the NM Tech Computer Center went into service as soon it was finished, and according to my friend Dylan who is the primary user, has been trouble-free. Of course there lots of changes during the design, but there was less wasted effort because he caught the problems in the specification and not during testing. There are two components to this system: cmsadds: A GUI for CMS course creation, and a database access layer, cmsimport2: Courseware Banner integration tools.

There's another substantive argument here: that literate programming costs “twice the resources” as some unnamed other approach. To what are we comparing? What is the minimal level of documentation for a serious production system? I've been a software professional for fifty years now, and I've spent a goodly slice of that time discussing this very question with my peers. Contra my former coworker Joel Eidsath, I don't really think that zero is the correct level. I think the general consensus among seasoned professionals is that a decently documented program requires a substantial effort to provide comments or other documentation. This is certainly not free.

A third option, which I have seen in practice, is that you have produced a book of negligible value, because although the book-production toolchain was employed and literate code was written, almost no effort went into forming the book-as-literature—that effort went directly into the code anyway.

Writing skills vary among software professionals. I don't know how you judge the literary merit of literate code. My technical writing guru Dr. Jon Price taught me that the first question is, who is your audience, and what assumptions can you safely make about their capabilities? And the second is, what are you trying to accomplish?

Lyons addresses this: “The market for programs that cannot be executed (or are not primarily to be executed) is precisely the book market.” I can't speak for other practitioners, but my primary audience is me!

When I start a nontrivial project nowadays, the first thing I do is to create a directory and place in it a DocBook-XML document and a rudimentary Makefile. The first section I write is the “Requirements,” which states why the project is being undertaken and what requirements it hopes to fill. Then I flesh out a complete specification of the external surface of the product.

I hardly think this is wasted effort. In my practice, the specification serves two purposes. It communicates the proposed solution to the users so they can evaluate the design before the developer has spent a lot of time building the wrong widget. Regardless of the audience, however, writing the specification is my favorite way to think seriously about how the widget is going to work: not just the feature set, but all the potential unfortunate interactions between features. So I argue that writing a tight specification should not be counted as extra effort.

This leaves us to consider whether the documentation of the actual code is extra work over and above the writing of what would generally be regarded as minimal comments or other internal documentation. Clearly the useability of the toolset is a critical factor on this evaluation. I've been using DocBook for twenty years. I use emacs with nxml-mode for DocBook creation and editing for almost that long. There is definitely a big learning curve to master those tools. But when I started using them for literate programming, the additional learning curve was inconsequential.

One of the biggest advantages DocBook has over other documentation tools is that it has the full CALS table model, so you can build complex tables with row and column spanning. Also, DocBook makes it easy to embed images with multiple formats so the Web version can show a .jpg but the PDF rendering can use a fully vectorized format like .pdf or .svg. ASCII art is clunky and not, to me, a lot less time-consuming than quality figure creation with Inkscape.

I am thrilled by Lyons' vision of the quality literate framework of the future, which he feels is justified for code that justifies careful study:

I picture something like a fractal document. At first, you get a one-sentence summary of the system. You can then zoom in and get one paragraph, then one page. Each section, you can expand, at first from a very high-level English description, to pseudocode elaborated with technical language, finally to the actual source code.

Exactly: someone new to the code wants to see the big picture first, then work their way down a sort of pyramid until they get to the nuts and bolts when necessary. This is exactly how I structure my literate document. The tip of the pyramid is the requirements. Below that is the externals. The next layer is all the high-level design work above the layer of individual modules:

  • Entity-relationship models for databases, XML document schemata, or internal program data structures.
  • Discussion of any algorithms or data structures that are not generally known standard practice.
  • Database schemata, UML diagrams, or other high-level design artifacts.

I argue that a properly documented program of any size needs these things anyway. The literate model gives you a place to put them.

So what remains is the narrative portion, which in my work generally takes about two-thirds or more of the page count. As I write the narrative for a function or method, I am not just describing the code, I am designing the code. As Flannery O'Connor once said, “I write to discover what I know.” If I'm having trouble describing what a module does, it tells me that I haven't really thought it through. This is one of the reasons that I so frequently emphasize writing skills to student programmers: for me writing is central to the craft, not peripheral.

One of the standard complaints about documentation is that it gets out of sync with the actual code. I find that having the narrative adjacent to the code makes it much easier to fix them both when something changes. Another payoff of the literate approach is that it gives you a place to document paths not taken: we tried this and it didn't work, and here's why; or, we didn't try this, and here's why we didn't.

Yes, I admit that it's somewhat more work to write the narrative sections. One of the chores for each document is to come up with a system of unique identifiers for each section, so that you can cross-refer some other point in the document using its section identifier. One of my standards is that if a section of code calls another function in the same system, the narrative for that code section includes a hyperlink to the definition of the called function. This addresses another of Lyons's points:

I don’t know if you could create the same experience in a linear manner. I suppose what you would do is have an introduction which unfolds each layer of the high-level description up to the pseudocode. Then, each major subsystem becomes its own chapter, and you repeat the progression. But the linearity defeats the premise that you could jump around.

The primary difference between literate programming as envisioned by Dr. Don Knuth in his WEB system (no relation to the World Wide Web; he did this work in the 1980s) is that he wanted to order the presentation of the bits of code to optimize the pedagogical value. My system presents the source code in its original order, and uses hyperlinks so you can examine related functions with a mouse click.

So what is the payoff for me? Here's an example. I have been doing data entry for the Institute for Bird Populations since 1988. My data entry system is on its seventh complete rewrite. Sometimes several years go by between change requests from the client. The specification is 34 pages, and the internals about 200 pages. A lot of the code is now obsolete. Yet when I get a change request, I can generally refamiliarize myself with the project structure and jet down to the point of the change and charge the client less than an hour's labor.

In summary, I think Lyons makes a number of good points. Certainly I hope that others may use my literate code as examples (whether good or bad ones is not for me to say!), but my personal justification for what I consider a relative modest expenditure of additional effort is that it makes it easier to work on my own code when I've been away from it for a while.

1 comment:

Stephen Smoogen said...

I have seen many arguments against literate programming over the years which seem to boil down to the following

1. I have no idea what I wrote so how the henry can I document it.

2. My code is so clear that it is literate by itself.

3. This takes up too much time for something I am just going to throw away.

The first one is probably the most honest and I wish I knew more programmers who would admit to it. Most of the time they go for 2 or 3 which rarely work out in reality. Most of the time the programmer believes that they every other programmer thinks like them and so it can be handed off to any peer without a problem. This is usually followed by "Who wrote this crap?!?!" when either that programmer is looking at the code a year later or looking at any other programmers code.

Which leads to the problem with 3. which is that people rarely get the chance to throw stuff away. Somewhere somehow what you wrote is going to be in production somewhere and someone else is going to have to maintain it. Not having some documentation with the code is what makes it usable in the long term and not finding someone at your door at 2am with a brick because that SELECT * FROM * you were going to optimize someday has been in production for 2 weeks and they finally found it.

My main reason for trying to be literate (which I fail usually) is that it slows me down to look at the code and go.. well this is what I think it should do.. why isn't the test case I wrote as an example doing what I want. My secondary reason for having literature in the first version is to avoid the problem that Brooks covers after first systems.. that is 2nd system overdesign. If I know what I did in the first one and know where it didn't work as well, I (or the guy who came in to fix what got pushed into production) will be able to avoid most of the 2nd system problem.