Pub/Sub - it's the exciting new S&M lifestyle for non-domme authors, journalists and bloggers^Wjournalers!
Or, you know, a form of message passing technology. Which ever floats your boat.
I shall be talking about the latter. Much as I'd like to be talking about hyperloquacious wordsmiths slinging out gerunds whilst being paddled by Mistress Spanksalot.
So Pub/Sub systems then. Pub/Sub stands for Publish and Subscribe. Essentially I (let's call me Adam) send out a message with a subject, say "Hello World", and you (let's call you Bob) can run a client that receives that message and acts however you want. I, the sender, don't need to know about you. And you and I don't need to know about Brian who's also listening. Brian likes to listen. Brian gets lonely sometimes since his wife left him for a Ruby programmer. Goddamn Ruby programmers, turning up, being all hip and funky and Japanese looking and not speaking very good UTF8 ....
*cough*
Anyway, so why is this useful? Well, first, let's look at that subject again. In actuality the subject would be the thing that helps you choose what to listen to. For simplicity's sake we'll say that his can be done in two ways - with heirachical subjects or message selectors.
Heirachical subjects allow you to use wild cards to access messages. So, for example, when someone access Livejournal we could publish the log event as a message and give it the subject
/sixapart/systems/logging/livejournal/web/access
We could write something that catches all LJ web events by subbing to
/sixapart/systems/logging/livejournal/web/*
or something that catches pageviews for all SixApart properties
/sixapart/systems/logging/*/web/access
or all events for LJ at all
/sixapart/*/livejournal/*
Message selectors allow you set properties on a message as key/value pairs and then you subscribe using something that looks a lot like SQL. So when can set the properties
year=2006
month=12
day=6
hour=9
minute=4
and then get all events for today
WHERE year=2006 AND month=12 AND day=6
or all events that happen on Christmas
WHERE month=12 AND day=25
or every event that happens in the morning
WHERE hour<12
The point is with both those systems is that I, in the guise of Adam, prince of Eternia and creator of Events, and Artur and Andrea and Allison can sit on the network and generate messages. I look after posts and comments being created, Artur looks after network accesses to any of the systems, Andrea looks at Ad hits and Allison makes a note when ever anybody takes a Breakfast Burrito from the Kitchen on Friday morning. I don't know about them, they don't know about me and none of us know about Bob, Brian, Beth, Britney who are sitting there listening for these events in order to calculate our Ad revenue, project random posts on the wall, keep track of DDOS attacks or make sure that they get some tasty Mexican breakfast treats before they all run out.
And because no-one knows about anybody else the system is insanely scaleable. We can just keep adding more boxes. Hurrah.
But why am I wittering on about it?
Well, normal search engines have no real concept of temporality. Oh, they know that your site is different from yesterday and can make inferences from that but that's really all because, well, when you're searching you generally want to know about the latest revision of the page.
Posts and comments are inherently temporal though. And that allows us to make some nice optimisations - for example we can reasonably guess that anything that happened in the last month is going to be more popular than stuff that happened 5 years ago.
Google do their scaling using (amongst other things) two clever bits of code -
GFS and
MapReduce (interested parties might want to also look at an OpenSource implementation called
Hadoop) but because of their lack of temporality they have slightly different access characteristics than us.
But if we could build some sort of time-aware GFS/MapReduce such that current hot spots, such as the last few days, could have more hardware dedicated to them but older stuff, like any month in 2001, could share hardware then that would be pretty cool. Obviously it would have to be scaleable so that'll mean that different time-spans would have to be unaware of each other.
Indexing is no problem - just spew out messages with the right date attached and let the nodes pick them up but how do you query across a time range. More importantly, since you might have multiple nodes for, say, yesterday, how do you avoid duplicates and perform load balancing? How do you cope with failover and redundancy? And what happens if a node misses a post or comment? And, dammit, who took the last Breakfast Burrito?!