Software: Stata: k

k_golyaev

Software: Stata

Apr 22, 2013 07:57

Originally published at Konstantin's Private Blog. You can comment here or there.

In the previous post, I discussed SAS in detail. Now I will turn to my favorite package, Stata. Five years ago nobody would seriously consider it as an option for dealing with large datasets. Today, it is probably one of the most effective tools of which I am aware.

Let’s get done with the bad news first: Stata is, by design, memory-bound. This means if your machine has 16 GB of RAM, realistically you would not be able to work with datasets that are over 15 GB. (Stata itself needs very little RAM, but your operating system needs some as well.) To make matters worse, Stata does not support distributed computations. It was conceived and developed in the 80s and 90s, when the “big data” challenges could be addressed by getting a more powerful host. Third, Stata can only operate on a single dataset at any given moment. Think about a single Excel sheet the size of which is only limited by the available RAM. Finally, not unlike SAS, Stata has its own unique programming language, which is quite quirky and can be tricky to master.

Given these formidable drawbacks, why would someone even bother considering Stata? Because it is otherwise made of pure awesomeness, that’s why. (Ok, I may be exaggerating a bit here, but bear with me.) Most skilled empiricists know the sad truth of working with real data: about eighty percent of time on every project is usually spent on manipulating data and wrestling it into format suitable for analysis. I am yet to find a more efficient tool than Stata for these tasks. Assuming you can fit the relevant data into RAM, once it is loaded, Stata is blazingly swift at cutting and slicing the data. Whenever I open a new dataset, it typically takes less than five minutes to identify any potential problems that crept in during the data construction phase. This makes Stata ideal for prototyping new solutions: at the early stages of any project it is important to fail quickly. Whatever takes hours to do in SAS, usually takes minutes to do in Stata.

I already mentioned that Stata’s internal language is quirky. If you stick with it long enough to get over these quirks, however, you will come to appreciate its flexibility and pithiness. In particular, it provides an uncanny level of versatility when it comes to creating loops. There is effectively no difference in looping over variable values, variable names, strings, or numbers - a feature that enables writing exceedingly compact code. Whenever I use R, SAS, Matlab, or virtually anything else other than Stata, I resent not having this kind of versatility at my disposal.

In addition, Stata was written by economists and, primarily, for economists. A number of estimation methods and routines that are specific to economics have been implemented only in Stata, mostly methods related to panel data analysis. People have conflicting opinions on whether it is best to “code up” all the methods you use from scratch. My take on this is simple: speed maters, so whatever enables rapid exploration is good. Once you find something that you think is working, re-engineering the method from scratch is a perfectly reasonable way to bring extra robustness.

I will conclude by noting that a perpetual license for a very powerful implementation of Stata costs under ten thousand dollars, and one can get a less powerful version for about a thousand. This is my default tool for exploratory data analysis, which I personally find indispensable.