Today, I decided to learn XSLT

Feb 16, 2008 17:58

I have to shred XML movie data from IMDb into a relational structure, for a project at work. I whipped up something using Perl's XML::Simple, because it's a simple problem, but then I figured it would be nicer if I could use standards to translate from the XML to the insert statements, especially if I could use a stream-based parser to keep memory requirements lower...as you might imagine, IMDb has a lot of movie data. So, I decided to look into XSLT, which I hear is the de facto XML transformation standard, and really awesome, if you can wrap your head around it.

Having been told, by a number of people, that it's actually a fairly difficult idea to wrap your head around, I set aside a large chunk of time to go and learn it. (I'm using the rest of that time to write this rant.) It took me 20 minutes to realize it was just a gimped version of LISP macros, and I'm embarrassed it took me that long. There's nothing conceptually innovative there; it's just a case of looking up the syntax when you need it.

I hate XML. For years and years, I saw the hype, and everyone was "learning" XML. I saw XML listed under "programming" sections in bookstores, and even on resumes. It's just a file format, people! Ever look at HTML? Now imagine that you can specify anything you want for element names between the angle brackets. Throw in a few optional headers at the top, and you've got XML. Want to specify which element names are allowed, inside of which other elements? Make a DTD, describing what elements can contain what other elements. This is not rocket science, and it's not even innovative -- LISP had the same type of hierarchical data structures, complete with a similar syntax, in 1959.

My main beef with it, I think, is that it's so godawful hard to read. Why oh why did anyone think that content was a good set of delimiters? Wouldn't it be clearer -- and more consistent with the underlying structure -- to use simple parentheses, like (name (content)), or even (name content)? That would be much easier to read. Less redundant. Oh noes, we have to count parentheses, instead of searching for a specific end tag! Err...except for the times when we have to count the tags too, because they're nested. Okay. It's a shame there's no hierarchical data syntax that uses that. Oh wait. Nevermind. LISP data syntax. In 1959.

And now, there's XSLT. Well, since 1999 or so. We can embed control flow into our data! Now that control flow is in the same syntax as our data, imagine the possibilities for templating: we can intersperse data and code! Surely, this is innovative. Oh, wait. No. LISP made that innovative leap in 1959, with its partial-execution macro system. (Granted, in this instance, XSL may be easier to read than the LISP macro syntax.)

I admit that it's easier to specify a tree structure with an XML DTD than it is in LISP, or actually anything else I can think of. You can do it, though. Since 1959. Because data and code are the exact same thing in LISP (wow, what an innovation!), you can just "evaluate" the data as code: If it parses, it's legit.

I'm mentioning LISP a lot because it was first. all of these things have been around, and exist in a number of other languages. Perl, Ruby, Python, ML and lots of other languages have hierarchical data syntax. SAX parsing? Every compiler known to mankind uses a similar technology, since the nearly the dawn of compilers. XPath? You have to index the heck out of your XML to make that fast, then you use -- surprise! -- relational databases to do it. The worst of both worlds: Hard to parse by humans, and hard to parse by machines!

XPath, XSLT, SAX ... they're all just libraries implemented for manipulating an arbitrarily decided "standard" syntax. There are better tools for getting each of those jobs done. It's (now) universally supported, so I suppose I'm stuck with it. That's really the only reason to use it, in my opinion. It just happens to be a very compelling reason.

So, yeah. XML is another stupid file format, amid a plethora of equally useful formats. The only thing making it special is organizational backing. Go ahead and use it, but stop thinking it's innately special somehow. Please? It's getting really old.

programming, rants, tech

Previous post Next post
Up