One of my friends pointed me to Tom White's
Disks have become tapes post over the weekend... very cool idea - I remember coming across
Hadoop several years ago in one of my searches of clustering technologies. I must admit, I don't think the simplicity (and moreover, the effectiveness) of MapReduce really clicked in my head back then. It certainly has now!
In Tom's post,
Learning MapReduce, it struck me that there is a parallel with
mobile software agents - obviously not from the intelligent, adaptive, social agent side, but more from the simple idea of bringing the code to the data. Aside from the idealistic dream of emerging artificial intelligence, that was one of the practical goals behind the
Agent.pm experiment. And that's exactly one of the
Hadoop design assumptions:
“Moving Computation is Cheaper than Moving Data”
A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located.
Of course, you could argue that database stored procedures do just that, but you'd be missing the point. There's always a tradeoff as explained in the MySQL manual:
Stored routines can provide improved performance because less information needs to be sent between the server and the client. The tradeoff is that this does increase the load on the database server because more of the work is done on the server side and less is done on the client (application) side. Consider this if many client machines (such as Web servers) are serviced by only one or a few database servers.
So you're just shifting the problem. What I like about MapReduce & Hadoop is the new approach to solving the problem. It's really interesting to see the idea of mobile code being re-applied here.
While I don't think MapReduce will replace databases & file storage systems (and they don't claim to), I do think technologies like Hadoop are worth investigating, and potentially adding to the scalability-on-commodity hardware arsenal, alongside the likes of
MogileFS,
memcached,
drbd,
heartbeat and
LVS.