A nifty data structure; or, how I reinvented the Bloom filter, poorly.: mr

mr_z

A nifty data structure; or, how I reinvented the Bloom filter, poorly.

Jul 06, 2009 07:33

Awhile back, I wrote a program that was trying to solve a design verification problem. It was trying to find a compact set of tests that moved the design through as much coverage space as possible. It used a searching algorithm to try to find low-cost ways to reach each state. Most importantly, it built tests incrementally, taking existing "proto-tests", and extending them in various ways.

To keep the searcher from going in circles, I needed to filter out redundant states. Consider the following two code snippets:
// Snippet 1
x = 1;
x = x + 3;

// Snippet 2
x = 2;
x = x + 2;Both of these snippets do different things, but end up at the same final state. So, while it might be important to run both of these tests, it is not necessarily useful to try to extend both. I do need to test that I can add 1 to 3 and that I can add 2 to 2, and that both leave x set to 4. I only need to extend one of these, though, while building up my database.

For my particular problem, I had an absolutely gigantic state space to deal with-several billion states. That's large enough that I couldn't put together a full bitmap that could answer the question "Is this state redundant?" precisely. That's when inspiration struck.

I didn't need a precise answer to "is this redundant?" For what I was doing, a "pretty good" answer was good enough. Instead of a single full bitmap, I could instead use a hash function on the state information to compress the large state number (a number between 0 and about 25 billion) into something more manageable. But what about hash collisions?

To mitigate those, I reasoned, it should be possible to hash the state number with multiple, distinct hashes, and do parallel lookups on multiple bitmaps. When setting a bit (ie. marking a state as 'visited'), I set the bit in all the bitmaps. When testing to see if a state was visited, all of the bitmaps would need to say "yes" before I'd treat it as a yes. In my final code I used four bitmaps. For light to moderate "loading levels", I figured that this would reduce my likelihood of a successful collision down into the noise.

For example, let's say 1% of the bits are set in my bitmap. Testing an arbitrary index against that bitmap has a 1 in 100 chance of hitting a set bit. Now let's say I have four such bitmaps, each based on a different hash of the index. Testing an arbitrary index against all four will give a hit on all four approximately 1 in 1004 times. That's a 1000000x improvement. Or, at least, that was my reasoning.

It turns out that I was just a hair off from something called a Bloom filter. The Bloom filter works on pretty much the same principle. It goes a step further and says "Why bother having separate bitmaps? Just throw everything in one big bitmap." That simplifies things a bit from what I was doing, and it should perform slightly better for the same amount of storage.

At any rate, it's a nifty data structure, and I thought I'd share it.

programming, things that do not suck