Data vs. Algorithms: k

k_golyaev

Data vs. Algorithms

Feb 25, 2013 17:53

Originally published at Konstantin's Private Blog. You can comment here or there.

Last time I mentioned that applying structural econometric models to datasets with millions of records is prohibitively time-consuming in most cases. This brings me to the «chicken-and-egg» problem of empirical work: which is more important, data or algorithms (models)?

Though not everyone agrees, in my experience, simpler models with lots of data usually trump more sophisticated models that were fit to smaller datasets. There is an interesting note on this question written by Google Research with a self-explanatory title «The Unreasonable Effectiveness of Data». There is also an hour-long video by Peter Norvig on the same topic.

This has major implications on the modeling workflow. First, simpler models are easier to inspect for errors. Do not get me wrong: there are many ways to get this wrong. My favorite example is detecting outliers in explanatory variables via inspecting model residuals. A large outlier can exercise so much power on a regression line that the residual that corresponds to it will end up being quite small. In such situation trimming observations with large residuals can actually hurt the model even more. But in general it is easier to detect anomalies when fitting a simpler and more transparent model.

Second, the sad truth about empirical work is that clever usage of explanatory variables or their transformations almost always yields significant payoffs in terms of results. Machine learners refer to this as “feature selection and engineering”, but the main idea remains unchanged. There are no reliable recommendations on what data representation will end up working best in a given application - this process involves a lot of trial and error. A simpler model enables me to iterate on data representations quicker, and discard non-working ideas faster.

Finally, simpler models are easier to maintain. Unlike models in academic papers, which rarely, if ever, get reused, in practice I usually build models for a specific purpose. Suppose that I build a convoluted model that models customer behavior with impressive accuracy. Such a model would almost invariably include a lot of tuning parameters that I would need to calibrate carefully. Should I, at some point, switch jobs, whoever gets tasked with maintaining the model is going to have a hard time delivering results. Most likely, the model would cease being useful soon after my departure. This is not a particularly appealing scenario from an employer’s point of view.