two cultures my mule: zlyuk

zlyuk

two cultures my mule

Mar 27, 2024 19:40

So i found a spare hour to slowly read the famous "Statistical Modeling: The Two Cultures", - a 23-years-old cornerstone of a certain uncertainty mongering. A highly recommended reading. Looking over it, my immediate association is a russian proverb, which i do not a proper english counterpart of: "Force a blockhead to bow to gods, and he will cruck his skull open."
The main theme is: the classical XX century fox statisticians are addicted to simplistic data models (many disadvantages), and there is a new rising tide of algorithmic models (many advantages), which are structure-agnostic, young, dynamic, practice-oriented, down-to-earth, [add your own, using the ever-blooming imagination of a hair-greying boomer in a desperate attempt to defy aging].

It would be educating to recall how the domain of what today is cursed "data science" looked back then. The machine learning community still struggled to prove their virtue, seeing itself as a kind of half-breed between the classical statistics and computer science/applied maths. It was already clear that they are the champions of accurate predictions, but the actual merit of their successes was yet not fully grasped (is it today, for all the consequences, good and bad?).

Well, fast-forward 20 years, and here we stand - deep models all the way down, and nobody cares about their maths or statistics (i know, there is some serious research there; i've read and listened to some of it - it makes me cry, still waiting for someone to dehydrate my tears).

It is also funny to see how some things change, and some never change. So the author complains that classical data models cannot cope with the challenges of the emerging big data problems, while the old-timers respond by complicating their models, and further notes: "Bayesian methods combined with Markov Chain Monte Carlo are cropping up all over". Wouldn't that give you a cute smile today? And so would the passing remark, that SVMs outperform neural nets (ah...). Yet, he is pretty perceptive about his field: a brand-new Vapnik's theory is mentioned, the typical troubles are referred to (model instability - a.k.a. sensitivity to randomness in data, too many variables to account for/select from, influence and inference lost). Most of those troubles are still haunting the field today, even to a higher degree. And how contemporary is the following citation, from the mouth of the man who has invented Random Forest: "There is an effective method for forming ensembles known as “boosting,” but there isn’t any finite sample size theory that tells us why it works so well"?

It is of course mind-blowing how he overlooks the third way, by which all other natural sciences go: understand the process and encode it in the model; then get helped by statistics. Well, it's a sort of "feature, not bug" - statisticians/data analysts are proud of their agnosticism; if you know your data-generator well, you are not a statistician anymore, you're a mere chemical engineer or whatever. This stupid pride-barrier is what forever slows down the scientific progress most of all, as if these hedges were god-given. Well, surprise, they were erected by your forefathers, and if you're as creative and revolutionary as you self-advertise, maybe give it a peek over. The best of us sometimes do it. (Much of these critics are already present in the first comment attached to this article; it sounds a bit old-fashioned and outdated, but in a good old english style; it is actually impressively fundamental: "Better a rough answer to the right question than an exact answer to the wrong question", "Presumably bootstrap and cross-validation ideas may give here a quite misleading illusion of stability").

To sum up: the prophecy has over-self-fulfilled. The scales tipped over, and today algorithms rule, data models shy. Some would say "this did us much good", others: "much good this did us". I would like to resume on a heavier note as usual, noting an involuntary metaphor by-produced by the author. One of the concepts (multitude of valid models) is nicknamed "Rashomon", after "a wonderful Japanese movie in which four people, from different vantage points, witness ..." etc. The fact that the text is oblivious of the Akutagawa's literary source ('In a Grove') and of the title hijacking made by Kurosawa, provides a sad allegory of the don't-look-back and never-mind-the-details state of the field at present.