*sigh*: lionsphil

lionsphil

sigh

Feb 18, 2009 02:49

Aaaaand the hypersomnia has flipped back to insomnia again.

But so that this isn't just some kind of Twit-style expression of mental malaise (LOL MISERABLE PEOPLE ON INTERNETS), I'm going to pollute your friends page with hypertext esoterica.

You're welcome.

So, generic links. Lovely idea. Rather than link endpoints which specify some literal documents ("link from ProgLanguages.anchor1 to Perl"), generic endpoints are matched by content ("link from anywhere the word 'Perl' appears to Perl"). Simple. You keep an index, as you would for full-text searching, from words to documents with those words in, and when you get one of these links you just look up the right entry and you're done.

Right?

Except that I'm implementing an oh-so-clever little model which aims to encompass everything at once. I allow, theoretically, endpoints to be full queries on document properties, of which "contains word" is just one. That's not so bad-you just make your "contains word" property match based on the word index. Evaluating "CONTAINS('Perl')" is the same straightforward lookup.

Right?

I also have first-class links. That means that the links aren't conveniently tied down at one end, but float around freely like any other document. So when you want to bake all this down to HTML (or to a similar view for editing), the first thing you have to do is find the appropriate links. For literal endpoints, that's a pretty straightforward query: you find all the links where the source (or target, if you're looking backwards) is the document being viewed. Databases and triplestores love that kind of thing.

How do you find the generic links involved in the node being viewed? You need to keep another index, which maps from words to links which use a CONTAINS() matching on that word. And to keep this index updated, your generic linking code now has to be involved in hooking everywhere that links can come from.

(Did I mention that this implementation is building upon a system explicitly designed to be incredibly modular and flexible, such that each link type [and pretty much every other aspect] is supposed to be a self-contained little module which you can load and unload at will? Yeah. The Generic LinkMatcher's index will go askew if you do that. Don't.)

Anyway. We've gone through each word in the document being viewed, and we've then found all the links which involve documents containing those words. (We could just iterate through all links, and in fact I can't see a way to avoid that when dealing with those "full queries on document properties" endpoints in some mythical future, but then the scalability demons eat my eyeballs.) We need to test that these query endpoints match the document we care about, but for now we know that their complexity is limited to the CONTAINS() test we know we just passed, so it's a case of making sure that and endpoint in the right direction fits. (This link involves the word 'Perl', but is it linking from the word 'Perl', or to [documents using] the word 'Perl'?)

So now we have some links. We need to find out where they point. For literal link endpoints, that's easy: their targets are already explicitly specified. For generic link endpoints, we need to do the word index lookup.

Don't forget that a "generic link" may have literal target endpoints. That is, after all, what a link from the word 'Perl' to the document Perl is. (An all-generic link would be one which links the word 'Perl' to all documents with the word 'gribble' in. Finding a use for this is left as an exercise for the reader^W^W thesis writeups.) Regardless of which kind of match found you the link, you'll need to ask all the link type handlers to expand out its targets. (Or sources, if working backwards.) Well, the core code currently just plucks the targets straight out of the link, so we need to update that. But that's just a silly brainfart, and nothing bad can possibly happen refactoring that out to the literal link type handler.

Right? Nearly done?

Well, don't forget that you probably want generic endpoints to be a bit magical. If someone links from the word 'Perl', you really want to make the link anchor that word, not the whole document containing it. So you need to manipulate the document to replace the word with a magical anchor using that word.

Thankfully I've already got working code in place to handle that kind of thing. Properly. With DOM nodes.

Oh, and don't forget that you need to handle versions alongside all this. If your indicies are to just the "current" version, you'll need to pull up the previous one each time you save a replacement to make sure that references from edited-out words get removed. If they're to a specific version, your indicies are going to keep growing despite probably only being O(n) scalable across multiple values, and while links will keep working from an old version of a page, links to a word, without extra effort, will now link to every version of every page that contains that word.

By this point, generic linking has ended up touching the storage (extra indicies, and possibly extra fetches to update them against old versions), the core of the system (which is busy trying to coalesce first-class links down into HTML-style embedded ones [and back, which is fun], and now needs to know about these magical anchors), and the link-matching layer where it belongs. Encapsulation weeps.

All this so that you can make it so that clicking the word 'Perl' takes you to the Perl document.

And I bet you've forgotten something important from a few paragraphs up by the point that you've read this far. I know I have.

malfunctions, phd