Physical activity data is WEIRD!

Oct 30, 2012 15:56

When you do an undergrad degree in stats, the progression goes something like this:
  1. Learn about the normal distribution and simple linear regression. Assume everything is independent.
  2. Learn about some more distributions. Learn about ANOVA.
  3. Learn multivariate regression. Maybe something about random effects. Maybe something about nonlinearity. Maybe something about mixture models and the EM algorithm.
  4. Learn how to fit multivariate regressions to those distributions you learned all about in 2. Learn to deal with dependence among variables. Learn about robust procedures. Generally, learn how to deal with violations of the simple models you learned about in 1.

In my biostatistics work, I fit a lot of GLMs, usually logistic regressions or count models. Sometimes people ask me to fit a mixed model for repeated observations these days, and maybe to include correlation. And that's all been working pretty well up until last week, so I haven't needed anything beyond 4 above.

It turns out that the data I'm dealing with (total physical activity in minutes per week) is something like: zero-inflated overdispersed count data, longitudinally observed. Or if you prefer, a mixture of zeroes and overdispersed count data, longitudinally observed. There are a lot of sedentary people out there, and of the people who do exercise, most of them don't exercise very much. The number of people exercising drops exponentially with the number of minutes.

Fitting standard linear models to this sort of data will tell you something, but the standard error is going to be massively inflated because of the high proportion of observations at 0 and the long fat tail towards higher values which are all nowhere near the mean. You might be tempted to get yourself out of trouble by log transforming the response, which deals with the fat tails and symmetry but doesn't get rid of your mass of observations at 0. You've still got two groups of observations to worry about, and you have to deal with that some way.

So I It turns out that finding an R package to fit this sort of model isn't all that easy. lme doesn't support GLMs. glmer doesn't support zero inflated models. glmmADMB would fit them, but wouldn't converge for me. I was half-way through coding my own "perfect" model up in JAGS when I found MCMCglmm, which finally worked. But it won't produce predictions or residuals for this model, so that's not ideal either. One thing I noticed with the JAGS model is that it was very sensitive to my choice of starting values, which might be true of some of the other methods that wouldn't converge as well.
Previous post
Up