Pesky Real World!: deflatermouse

deflatermouse

Pesky Real World!

Oct 14, 2005 13:10

Occasionally I see people complaining that they can't search for "foo bar" in a search engine and only get back pages with that phrase in.

Want to know why? Ok, to have that ability they'd have to have an inverted index with every n-gram of words on every web page they index.

Yahoo! and Google have been having a pissing fight about whos index of pages is bigger but let's be conservative and go with Yahoo!'s 20 billion objects of which they claim 19.2 billion of that is web pages. Let's pretend they're lying and that it's only 10 billion pages.

Now, let's be conservative and pretend that, on average a web page has 150 words on a page. That makes, err, 5.71 times 10 to the power of 262 n-grams. Or, in scientific notation 5.71E262.

The average character length of an English word is 5 and the average length (i.e number of words) of an n-gram will be 75 so, assuming 1 byte per character, the average size of an index for a page will be

5.71E262 x 5 x 75 x 1
which equals 2.14E265. To get the size of the index for every page (in Terabytes) we have to do

(1E13 x 2.14E265)/(1024 x 1024 x 1024 x 1024)
which equals

1.95E266
or, ironically, about 2E66 (also known as 2 undecillion) googolplex Terabytes.

A 1Tb LaCie Big Disk weighs in at about 6.7x1.7x10.6 in, or 120.7 cubic inches (approximately). Given that the volume of the earth in cubic inches is

4/3 x PI x 3963^3 x 5280^3 x 12^3
or 6.62E25 cubic inches, then to have an inverted index for every page we'd need a room the size of 3.55E242 Earths.

Ok, let's try and be a little more sensible and go with only n-grams of three words or less of which there should be 50 x 75 x 150 (1312500 ) on each page. The new average n-gram length would then be 2.

1312500 x 5 x 2 x 1
Which means a measly 1.31E7 bytes index per page and 1.19E8 Terabytes for the whole index.

Of course this will still take up 1.44E10 cubic inches, or to put it another way, 52 million imperial (proper) gallons which is, apparently, approximately the amount of water used in each transit of the Panama canal.

You could probably fake it using Latent Semantic Indexing to get a list of candidate pages and then weed out the ones which a simple grep shows doesn't have the phrase in. It wouldn't be perfect but it would give the illusion of working.

Slightly more doable might be to split the search phrase into n-grams of length=3 and then find all the pages with all those n-grams in and then do the grep. Maybe.

cubic inches, foo bar, gram, occasionally, lacie, scientific notation, apparently, panama canal, err, ironically, conservative, google, terabytes, english word, inverted index, googol