The search for enlightenment

Jan 19, 2007 11:55

Since I was outed in the State of the Goat post I figured it was probably a good time to let you know what's going on with search and LJ.

In short ...
  • We're building a search system for LJ: journals, entries, communities, people, or everything on the site.
  • Our new search will respect your privacy settings.
  • We tried talking to partners like Google and Yahoo about using their search stuff, but that didn't work out.
  • LJ's new search is going to be built on 100% Free-as-in-Speech Software.
  • It'll take a few months to finish everything up and to index the billions of entries already in LJ, but we'll start testing things before then.
  • We wanted to let you know about it before it launches because our New Year's resolution was to have fewer surprises. Except for Ninja attacks, which must by their very nature remain a surprise.
If you want more details then read below the cut.



As part of our New Years Resolutions we at LiveJournal have vowed to stop ringing the buzzers of our neighbours and then running away before they open the door. Oh, and because it's one of the most commonly requested things by users we're going to be more proactive in telling you what we're working on for the future. Which is probably slightly more relevant to you. This will have to suffice until Brad can finish working on his telepathy device so that finally we can all live in some sort of huge borg-like hivemind and you'll know exactly what we're up to which will, conveniently be exactly what you want. Hurrah!

So, Search then. First of all, you're probably wondering why LJ doesn't have search at the moment. There's a simple reason for that. We're lazy. No, wait, that's the reason why nobody's cleaned up the Nacho Hat we had for Cinco di Mayo last year that's still sitting in the kitchen and has kind of gone all fuzzy and grey.

The real reason is that LJ is big. Really big. Kind of like the Nacho Hat, come to think of it.

Seriously though, put it this way - a while back Yahoo! and Google were crowing about the fact that they had around 19 billion items in their index. And these are big companies with billions in the bank and large search teams.

Then there's us. In contrast we have a mere 3 billion posts and comments. Oh, and another 16 added every second. We don't, on the other hand, have a vast, inexhaustible budget (since Brad took the 7 or so billion he got from selling his soul to Six Apart and built a house out of bundles of 100 dollar bills from which he plots new ways to take over the world. And works on his telepathy device) or, indeed, a large team of people. We have me, working on indexing posts and comments, and Brad and Mischa, beavering away rewriting the directory and user searches to be shiny and spiffy.

However, despite these limitations, for the last couple of months we've been building a search engine. It uses one third of San Francisco's water supply to cool and it has a dedicated Nuclear Reactor just to provide the power to index the word "depressed". The phrase "my parents don't understand me" takes up so much physical disk space that we had to hire the hangers at Moffets Field. For those wanting more technical details it's written using 100% Free-software although, in a bit of departure from our usual Perlish nature, it uses Lucene from the Apache Foundation for various technical reasons which we may go into later. For those wanting really technical details it works sort of like Google's GFS and MapReduce. Except changed to be more optimised for blog posts and comment threads.

We haven't finalised the features yet - to be honest I'll be glad just to get the data into the index. Current, conservative guesstimates have it taking 3 months just to back index what we already have. What it will do is respect your privacy and pay attention to the waxes and wanes of your friends list. It will let you search date ranges and restrict your search to individual journals, communities, people or entries and their comments. Apart from that we'll just have to see where we go from there. Don't worry - we got ideas (and to us that's dear)[*].

You may be wondering why we didn't outsource to Google or Yahoo! or someone else. Or use an existing search engine package. And the answer is - we tried. It didn't work out. Trust us, it would have been a lot easier and we would have done it if we could but indexing something like LiveJournal and getting the most out of the meta data (explicit or implicit) requires a more custom approach.

We can't give you an exact time frame for when it's all going to be unveiled - we hope soon. We're currently doing internal testing and then we'll probably roll it out to permanent and paid accounts first to see if it falls over horribly under real load. Then we'll roll it out everywhere. Hopefully that will mean we find all the bugs first and you, my dear LJers, get pure 100% distilled awesomeness.

So that's it - feel free to comment and ask questions and we'll try and answer them as best we can without either incriminating ourselves or making promises we can't keep.

- Simon (aka deflatermouse)

[*] Spot the musical reference, pop fans.
Previous post Next post
Up