Generating thousands of static index files: deflatermouse

deflatermouse

Generating thousands of static index files

Jan 16, 2006 12:00

It's been a constant bug bear that Mariachi was so slow with large number of files because of its refusal to split stuff up into months.

The two reasons it's so slow is that firstly everytime you add a new index page each index page has to be regenerated to accomodate the new pager.

My solution is for each index page to 'know' its number and then include, via an IFRAME a pager.html, passing its number.

pager.html will do something clever, either using javascript or using CSS selector pseudo classes to make the link to the current page look differently.

Best case scenario is that the item gets added at the end and only one index page will need to be regenerated.

Worst case scenatio is, of course if an item gets added right at the start (not that common) then we'll have to regenerate every page but hopefully that won't be more than a few hundred pages which isn't so bad.

The second reason it's so slow is that in this worst case scenario you have to regenerate every mail page as well (to update their pointer back to their index page) which is incredibly slow since you have to parse and then format potentially thousands of emails.

The solution to this is to have each email page include, in an IFRAME, $item_id.index.html which would contain its index page number. That way you wouldn't have to rewrite every item page (which would be expensive) just every $item_id.index.html which would be much, much cheaper.

Some other suggestions from Tom Insam and Ian Malpass amongst others include building in some flex space into the index pages (so they can have between 100 and 120 items per page for example) which should reduce the frequency of having to do a full rebuild dramatically. In fact you could even have some annealing where the overspill is spread between $n subsequent (and prior if possible) pages which would probably prevent you from having to ever do a full rebuild.

Tom came up with two solutions which can be deployed in parallel - use JSON to store the mapping of message-id to index and the number of indexes (and the next and prev links) and let each page load up the required data as necessary. Scaling issues could be dealt with by hashing the JSON files based on the first $n characters of the message-id. The potential problem with this is that the Javascript files could easily get cached.

If that wasn't available then one could fall back to having a CGI script which 301s to the correct index to provide all the links. This is, apparently, what Livejournal does. I'm less enamoured with this idea since the whole reason of having static pages is to make it easy to create an archive. Having a working CGI environment is just an order of magnitude more complicated than I really want.

Ian felt that the pager could be generated on the fly as long as each index knew its number (which it will) and knew the total number of mails (which is eminently doable). And, because he hates iframes, he thought instead of embedding iframes in the email pages, one could generate scripts and have the pages generate that instead.

I quite like this idea although, again, there's the issue of caching.

I'm less enamoured with his idea to have a $message-id.tmpl file for each email which contains the parsed and formatted email and then some place holders like $INDEX, $NEXT, $PREV etc. When you wnated to regenerate the file you'd just grep for those tokens and replace them with the correct numbers.

The big problem, of course, is that this requires parsing HTML (or just hoping that your marker sequence doesn't appear anywhere else) which is, in and of itself, a slow and tedious process.

I really need to sit down and start experimenting with this stuff.

css selector, mariachi, worst case scenario, new index, annealing, parse, index, mail, regenerate, hopefully, accomodate, email, pager, iframe, id index, flex space