Data vs Information: xanify

xanify

Data vs Information

Aug 03, 2016 19:27

So because I am a nerd, today on my walk home I was ruminating about the difference between data and information, and recommendations.

Probably easiest to illustrate with an example. Let's take the problem of, say, restaurants. You want to go to one. How do you pick?

Data is "here is a list of restaurants, and their locations, and their menus."
Information is "here are the 10 restaurants closest to [some location] that serve [some cuisine]"
A recommendation, on the other hand, is "go to this restaurant".

There are grey areas between these categories, but they're not as grey as one might think. For example, you could sort that list of restaurants by alphabetical order, or by rating (if your data included a rating scale) -- but that's still data, because you haven't truly added anything new. If on the other hand, you sort that list of restaurants by proximity to [some location], then that's information, because you have added something else into the data (the location you want to measure against) and then used that to provide insight. Similarly, filtering the list of restaurants to only display those that serve [some cuisine] is also information.

Sometimes the demarcation between data and information is domain specific. Take retail sales, for example. Strictly speaking, by this definition, the only true data is "here is a list of every sale we made, and the dollar value, and the date/time it went through" -- but most salespeople don't think this is very useful. So retail sales tends to split data into raw data (here is a list of every sale we made + dollar value + date/time) and aggregate data (between 1:00pm and 1:15pm, we sold X widgets for a total of $Y). It wouldn't make sense to do this with restaurants, but it wouldn't make sense to not do this with retail sales.

(Depending on the scale of the retail store and/or the cussedness of its IT department, they might store both the raw data and the aggregated data, or just the aggregated data, or have bigger or smaller aggregate chunks.)

And information, in the retail sales world, would be "here is a bar chart of your sales dollars and volumes per hour today," or "here is a graph of hourly sales volumes vs when floor staff are scheduled," or if it's a chain, "here are all stores ranked by sales volume per square foot," -- in other words, it tells you something. It adds value/insight/meaning to the data, in order to answer a question ("what hour of the day is the busiest?") or aid in a decision ("which stores should I invest more resources in?").

(You might notice that the bar chart of sales dollars and volumes could, strictly speaking, be classified as highly aggregated data -- this is another grey area that isn't really so grey, because the method of presentation adds insight that can be immediately used, and that makes it information. Similarly, in the restaurants example, if your list of restaurants included customer ratings, then displaying the list sorted by aggregate customer rating* would be information too because that's what people are interested in. Displaying the restaurant list by alphabetical order, on the other hand, usually isn't so meaningful.)

There is a lot of science behind turning data into information, and a bit of art as well -- there are an infinite number of ways one can manipulate data, but not all of them are meaningful. (Also, you can totally create misleading information on top of technically correct data: a ranked list of all store by sales dollars alone might lead you to think those are your top-performing stores, but doesn't take into account expenditures, etc.) Also, your information is only as good as your data -- how do you know the data you are using is complete and accurate? Maybe there are restaurants that opened recently that aren't on your list; maybe you've got the opening hours wrong; maybe some have closed down. Maybe one cash register isn't hooked up to your data-gathering server. Data quality is an entire field by itself.

There's also a cost in turning data into information, or even turning raw data into aggregate data. The easiest example of this is retail sales, again -- if aggregating your raw data into 15-minute chunks takes more than 15 minutes, you are going to have a bad time. Or maybe crunching through your restaurants list to find the best-rated one takes an hour (I dunno, you're using a 20 year old computer, maybe) and you're starving. The cost is why, sometimes, making a decision by randomly selecting something from your raw data is a legit strategy. (The other reasons to do random selection is if your data is so bad that no meaningful information can be derived, like if your list of restaurants has 5 names on it and nothing else, or if your data is very uniform, like if you're choosing between 10 grey widgets that cost $10 then it really doesn't matter which one you pick.)

... why am I thinking about this, you might ask? Because most people don't realise most (or any) of this!

Sometimes it's just an inherent problem with language. A question like, "Where do you think we should go for dinner?" really has three different possible answers -- a recommendation ("Let's go to that restaurant on the corner"), information ("If you like Italian food, my three favourite places are..."), or data ("Well, I've been to these places ..."). And it is never clear from the question which the asker wants and/or is more comfortable. Most of the time people want a recommendation and information ("I like these five restaurants, and we should go to this one because it's got good desserts"). Sometimes people actually do want raw data -- this happens generally when (I believe) they don't trust your data and want to check it themselves, or they don't trust your information, or they just plain like data. And sometimes what the asker wants is not the same as what they're comfortable with. I know people who are not comfortable with decision making, so they really should only be given recommendations, but they also want to feel like they're making the decision so they want information, but then they annoy the person they're asking by asking more permutations of the question and never actually choosing a place to eat.

And sometimes people don't want an answer at all -- this tends to happen more with domains that are highly fraught (like money) than with things that don't really matter. This is ... honestly I get it, but it is a deeply dangerous mentality to have. It's one thing to not seek out data/information because you've decided it's irrelevant (I do not know anything about, like, curtains, but I also don't care about curtains) -- it's another to deliberately avoid information that is important**.

*Aggregate customer rating (as opposed to an arbitrary single number that you-the-list-author came up with) is usually some weighted combination of average rating and number of reviews and the helpfulness/quality of said reviews -- like, is the 4.5-star restaurant with 500 ratings better, or the 5-star with 3 ratings? This is a conundrum with no clear answer.

**What makes information important? That is probably another huge rambling post by itself.

brb nerding forever