I happened across
"A multimodular algorithm for computing Bernoulli numbers" which implements a cute hack. The Bernoulli number B(k) mod p can be efficiently calculated for prime p > 5. You can calculate a bunch of B(k) mod p_i in parallel, then use the Chinese Remainder Theorem to stitch those results back into the desired number. This algorithm's now part of the Sage toolkit.
Calculating B(20000000)
took 79 days in 2012, using this implementation. I wonder what other functions could be decomposed this way? Once you start paying attention to bits (and not just words), operating "mod p" can save you a lot of time spent multiplying, both theoretically and practically.