Goodbye Scribe, Welcome Exscribe!: fare

fare

Goodbye Scribe, Welcome Exscribe!

Nov 05, 2005 23:22

I am pleased to announce the availability of Exscribe, a document authoring tool programmed and programmable in Common Lisp. It currently only targets the web, but it's extensible, so who knows?
The Programmability of Document Authoring Tools

Exscribe is directly inspired by Scribe (note that this Scribe is Manuel Serrano's Scheme tool from the 1990's, not the antique predecessor of LaTeX from the 1980's). Actually, I use Exscribe as a drop-in replacement for Scribe: within the strictures of my programming style and with a small abstraction layer, my Scribe files run with both Scribe and Exscribe, with similar output.

I used to use Scribe for all the pages I was writing: I greatly enjoy being able to compute content on-the-fly within documents. Most authoring systems are not programmable, and force me to tediously type, check and manage the consistency of a lot of redundancy in my documents. My! I remember the bad old days of doing footnotes manually in HTML. With a programmable infrastructure, I can start from something very crude (HTML indeed), and build whichever features I need on top of it, whenever I need them; I casually add new shortcuts to factor away the tedious bookkeeping tasks so common with non-programmable tools.

Now, there already exist a few other document authoring systems that are programmable; but the way they are programmed is always a gross kluge, in a mess of a language that was never designed, never selected, and lacks the basic features required of any serious programming language. Their programming capability capability stems from an overgrown collection of badly interacting scripting features, special-purpose macro languages, and exotic specialer-purpose virtual-machines. The most widespread programmable authoring tool, LaTeX, comes to mind; it is a collection of gross hacks, layered on top of each other, in a way hopelessly beyond any kind of salvation. Scribe, being programmable using a dialect of a real, decent programming language, namely Lisp, was a beacon of light in the dark chaos of authoring tools.

However, a few years ago, support for Scribe was discontinued in favor of the slightly but oh so annoyingly incompatible Skribe; eventually, even my binaries ceased to work on debian: old installations are too old, new installations are too new, and I can't recompile the pristine original Scribe sources with a newly compiled Bigloo. Moreover, while I enjoyed Scribe a lot and thought its basic architecture was great, there were many quirks I didn't like about Scribe, and the natural divergence of practical concerns between the author and I were such that these quirks couldn't really go away. Thus, instead of learning Skribe and migrating my documents to it, at great cost for no benefit, I opted to reimplement myself a variant of Scribe.
Where a little bit of Lisp does a lot!

The result, after a lot of procrastination, is Exscribe. In many ways, it's a quick and dirty hack, thrown together in a few tens of hours of intensive work. But I really like the architecture of it, and though it doesn't do much, yet, it does just what I need for the time being, and it does it exactly the way I want. When it fails, I have noone to blame but myself.

And I'm amazed how much can be done with so little code: it was a breeze to add footnotes, table of contents, and a simple bibliography to Exscribe; these features, while not fanciful by any stretch of mind, are sufficient to completely leverage my existing code base, resolve all the quirks I used to have with Scribe and would no doubt have had with Skribe, and can easily be extended in whichever fancy will be desired in the future.

But of course, the single most important factor to the extraordinary productivity of Lisp is that I could build Exscribe by just extending Lisp, building upon the existing infrastructure, instead of having to reinvent yet another special-purpose document-oriented programming language and to code an ad-hoc bug-ridden implementation for it. In short, using Lisp allowed me to avoid Greenspunning.

Exscribe can be used with a fancy bracketed syntax called Scribble, and mostly compatible with that of Scribe, but cleaner: it won't randomly treat semi-colons as beginning comments inside brackets, it will treat backslashes as a means to achieve character escape, and it will handle specially a colon, at beginning of bracket only. Exscribe can also be used as a Lisp Markup Language, to create you documents with simple Lisp forms.

Exscribe currently only targets HTML and so currently is only a web authoring tool at the time being. But a CL-typesetting backend should not be too hard; it's a matter of the idea of it gathering enough interest so that somebody will take a few hours to do it. It's in my distant plans, but I don't really need it currently, and have many other more urgent projects.

In its current state, Exscribe is definitely not for end-users: error checking is minimal, the crude data representation (copied over from araneida's html hack) opens for a whole category of stupid mistakes, and documentation is almost non-existent. However, in its current state it is already fit for use by developers, and I'm open to collaboration to develop it.

Exscribe is asdf-installable, and depends on other asdf-installable packages: meta, fare-utils, fare-matcher and scribble. I maintain all but the first one, and I fixed many bugs in them while developing Exscribe. In case you don't use asdf-install, the repository is here.
Internal architecture

Internally, the design of Exscribe is rather simple. In a first pass, the user's program is executed, and creates a representation of the document. In a second pass, the representation is walked so as to allow for such features as footnote and bibliography. After the second pass, indices are computed and the representation is side-effected again to reflect these computations. The representation is then ready for output, which happens in a third pass.

Side-effects are cool. In absence of much more clever tools than are available, they simplify work tremendously. Sometimes, it's about inserting a cell that will hold the result of an as yet unspecified future computation. A side-effect free call-by-future could do this, but it's a feature few languages have. Or an extra level of indirection could be used, which is costly and requires an additional pass. Sometimes, you need to introspect the representation and add text at an arbitrary place. That's what I do with footnotes: I add a small indicator at the beginning of the first paragraph within the footnote text. Without side-effects, a global new pass would be necessary for every such modification. (Note that systems such as Scribe that compile representation too early do not make it possible to do this kind of modification at all; the kludges I had to do to achieve nice multiparagraph footnotes were something I distasted strongly about Scribe.) In a better language, the compiler could perhaps merge all these passes and combine them efficiently into one. But making pure computations as efficient and easy to use as side-effects is something that no language implementation can do in the general case, and no language specification is able to guarantee. And so I stick to my side-effects. Adding nodes like that to a dynamic representation is how I always thought a compiler should be working. Damn static languages; they don't allow the schema of the representation to change with every new pass. And every side effect is conceptually a new pass.

So as to reuse the code from CTO which I derived from Araneida, I adopted its internal representation of HTML, which looks like that: ((:p :align 'justify) "foo" (:b "bold text") "bar"). Lists of stuff, my extension to araneida, are like that (list "foo" "bar"). This representation, with cons cells having two meanings depending on their structural location, which made things more fragile than they should be regarding the first, user-visible, pass of data generation; it made it slightly incompatible with idioms I used with Scribe. I had to fix my Scribe documents and introduce an abstraction layer so that my programs could run under both Scribe and Exscribe -- at least until I'm satisfied enough with Exscribe to throw away Scribe. In the future, I may change my representation slightly, using vectors for tags, which will remove the above-mentioned ambiguity.
Performance issues

As for performance, I have been unable to get the fast native code Scribe running for many months, and I only have the unstable and slow JVM version of Scribe. Hence, the performance comparison will be unfair to Scribe; but the hassle it would take to get Scribe to run is not to be neglected, either; it might be cheaper to learn Skribe instead. Thanks to CL-Launch, it is very easy to try my code with many CL implementations. Thus, I tried several implementations on two machines, namely my home server and my friend's previously-mentioned LinuxPPC jukebox. Depending on availability on the two platforms, I tried sbcl, cmucl, clisp, openmcl, and I also compared them with an old Scribe 1.1a running on the Sun JVM. Note that I tried gcl but I couldn't get it to compile Scribble. I took some full-featured hundred KiB essay as a benchmark, because it was long enough to exercise the implementation and actually used all the more complex features of Exscribe. I was quite surprised by the results.

At first, I was quite surprised to see SBCL faring so badly: I would have expected it to be faster than scribe on a JVM. Then, I was surprised again to see that clisp was twice faster than sbcl, and faster than openmcl, too. My, where are these optimizing native-code compilers wasting their time? Then, I tried cmucl and found it was slightly faster than clisp; cmucl and sbcl use essentially the same compiler, and the main difference that I know of is that sbcl is now uses unicode, so sbcl is possibly wasting time doing needless wide-character tricks but then again so does interpreted clisp use unicode and still it's faster; maybe I should try a sbcl compiled without unicode to see the difference. Finally, I noticed that startup time took a significant amount of time. So I also benchmarked it by compiling /dev/null, and I found that on about all implementations, it was taking about half of the computation time, which I think is scary. I would take the use of a profiler to fully understand the problems at stake. Here are the respective timings of time LISP=foocl exscribe -I ~/fare/lisp/exscribe -I ~/fare/www -o ~/html/liberty/microsoft_monopoly.html liberty/microsoft_monopoly and time LISP=foocl exscribe -I ~/fare/lisp/exscribe -I ~/fare/www -o - /dev/null. I tried several times and considered the shortest run. (The very first run with any implementation was significantly longer anyway because it then compiled all those Lisp source files.) Implementationruntime on a 350MHz PIIruntime on a 240MHz PPC603ev sbcl9.85 s / 5.26 s32.5 / 16.7 s scribe (j2se 1.4)9.36 s / 5.50 sN/A openmclN/A26.5 s / 6.62 s clisp4.15 s / 1.96 s10.6 s / 5.03 s cmucl3.46 s / 1.71 sN/A

To improve performance, I could try to eliminate startup time by compiling as many files as possible from a single session. And if I had a higher-level language than Lisp I could try some deforestation techniques to avoid all the unnecessary consing. But first I'd have to implement that higher-level language (in Lisp, of course). Or maybe I should just move over to Slate.

PS: The retrospectively obvious explanation for the performance pattern was given to me by Pascal Costanza. I verified it to be true with (time ...): where Exscribe spends over 98% its computation time (excluding the admittedly slow FASL loading), is compiling the Exscribe code. Indeed, the whole point of Exscribe is that your documents are Common Lisp programs, evaluated by your Common Lisp implementation. And here, the optimizing compilers will spend a lot of time trying to optimize code that is only going to be run once, whereas interpreters do the right thing of not wasting time optimizing it. That's why clisp is so good, and also why cmucl is good: cmucl includes an interpreter to be used in those situations, whereas sbcl dumped this interpreter away and its eval always calls the optimizing compiler. Reading the source code takes from 5% to 15% of the evaluation time (more on clisp, less on sbcl), so it might be worth optimizing Scribble, but not until the main evaluation issue is solved.

smop, lisp, tao of programming, code, exscribe, en