Fingerprinting files.

Jul 27, 2006 17:15


You know those music services where you can hold your phone up to a speaker in a pub for some seconds, and then get a text message later telling you what the song is? They’re dead clever. I don’t know properly how they work, but I know the principle.

It’s the same principle that CD players use to help with skipping if your little brother has been less than careful about putting the discs back in the boxes. The police can even use this technology to scan through vast collections of pictures for “known” contraband.

It’s used daily in nearly all password systems, and in the little electronic keys which can be used to remotely unlock vehicles - nearly everywhere in fact.
Introduction to hash functions

So what is this magical technology that is so widely used? It’s called a hash function and works a bit like a fingerprinting system. It’s a fingerprint for numbers, so it works on anything that can be stored on a computer. Files, passwords, anything.

Just like human fingerprints each one is almost unique; not completely, but in most circumstances good enough. For any particular file, be it your undergraduate dissertation or a Metallica album (hi Lars!), that one file will always have a specific fingerprint. If you change the file (by changing a typo in the document or similar) then the fingerprint changes too.

No matter how big or small the original file is the fingerprint is always the same size - about 33 characters for one particular method. So I could take the fingerprint of a full-screen movie (a few gigabytes in size) and it would produce an identifying output the length of about three telephone numbers.

I’ll show you. The fingerprint of my copy of Ride The Lighting (on the album of the same name by Metallica; hi Lars!) is d7b2bc13f4055566428e2a9150b4e783. That file is over 6 megabytes in size. The fingerprint of a single full stop, just one byte, is 8cf8463b34caa8ac871a52d5dd7ad1ef. So it doesn’t matter how big or small the input is: the result is something small enough to use in real life - you could pop that into an email, or read it over the phone without too much difficulty.

(Do excuse me one moment, I’m going to put some Metallica on the stereo.)
Real world uses of hash functions

I mentioned in my introduction just a few of the vast number of applications for hash functions. The fingerprints they produce, commonly known as hashes, pop up in all sorts of places.

Everyone’s CD player uses a specific type of hash function to help correct mistakes. The CD is divided into segments, each with a chunk of music and then a corresponding hash for that chunk. These hashes have been specifically designed so that small mistakes which may develop on the disk (like reading errors from scratches) can be automatically corrected. In this case imagine that the digital fingerprint reveals a bit more about the structure of the music so that small mistakes can be spotted. Obviously there is a limit to how effective this is - enough scratching will ruin a CD no matter how clever the mathematics involved in the error checking.

Very similar error-checking goes on in modems and network cards - in fact, most circumstances that involve communication. For each segment of data transmitted some of it is ‘real’ data and some of it is a form of backup using hash functions. Ever notice how text messages take much longer to send in areas of low signal? Much of the data being transmitted by your phone will be garbled. With good signal this is not a problem since small errors can be corrected in situ. If there are too many errors the base station can still tell, because the fingerprint for each segment of data sent doesn’t match up. Then the phone resends the garbled stuff: this is what takes so much time.

These are just one use of hashes, known as checksums. They can be used for limited error detection and correction. There are other, less mundane, fields where hash functions are employed. I’ll mention passwords because it’s easy and everyone likes cloak-and-dagger discussions. (Now would probably be a good time to listen to the James Bond theme tune, if you have it.)
You only live twice, Mr Bond

Without secure hash functions, passwords would be a really tricky scenario. Let me explain. Whenever you log into a system - your computer or email, for example - the computer which challenges you has to know your password. But if the computer knows the password then someone can just read it from the disk in the same way the login program does. Well, we could encrypt the password on disk, right? Except we’d then have to store, somewhere, the correct way to decode the passwords.

Whatever we do there will always be a weak point at the end of the line. It’s like putting something valuable in a safe, then putting the combination for that safe in another safe, then the key for that safe in another one, and so on. You must reach a point where there are no more safes.

The best solution is not to store the password at all. And this is what hash functions allow us to achieve. We store the hash of the password on disk, instead. When you try to log in the computer runs your attempted password through the hash function and compares the new hash with the stored one. If they are identical then it is assumed that what you wrote was the real password. No-one can read the password from disk, since the disk only stores the resulting fingerprint. Running this through the hash function again will produce a totally unrelated hash which won’t let the intruder in. There is a chance, a really small one, that two passwords which have identical hashes are not the same; but this chance is so small it is worth the risk of using this method. The alternative of having passwords stored “out in the open” is much riskier.

There is no single hash function that does everything. Instead, there are many different types designed for different purposes. If you’re sending passwords across a network you may be using two or three different types of hash function at the same time. This post has been a bit longer than I intended it to be, but that is because there is so much to say about hash functions. In truth, I haven’t even finished talking about all the areas where they can be used. There is much more to say about security and passwords, and it’s a subject I will probably come back to in future.

mathematics, guide, computer science

Previous post Next post
Up