Wiki Analytics: part 1 of n

Jul 14, 2009 22:13

A very hasty note to share some of what I've been up to - then, to bed.

So a while ago Eugene Kim included me in an ongoing conversation he's been having with lots of people, about what it means for a Wiki to be "healthy". He thought that I might be interested in bringing a statistical/numerical/machine learning approach to the problem. This seemed valuable for all the standard reasons. Most important among them being that the computer might think of something a human hadn't.

As I pointed out on the Wiki Analytics wiki, numerical methods are useful inasmuch as they're tools for exploration and education. In a field as richly language-based as wikis (or collaboration tools in general), explanatory power becomes critical. So clustering and classification can tell us what we should look at, but it's probably not ideal to use them as a magic box. To even this caveat, however, apply caveats about how sometimes a magic box is just what you want.

Eugene got me a sample data set, and I've been playing around with it a little. I sort of got lost in the weeds for a while, in a long digression that ended with "A Tutorial on Principal Component Analysis", by Lindsay Smith. It was real handy, ending my thrashing around and letting me finish my (incredibly off-topic) digression successfully.

For what its worth, I now have a Python library that can do full PCA transforms (though not variance analysis after the fact), or simply output a list of the first N principal components (N chosen by the user).

Using Python's MDP library in iPython, I determined (I think) that very very close to 100% of the variation in this sample data was in 3 components. Something like 98% of it is in the first component.

Using my library (because I couldn't figure out how to do what I wanted with MDP, hence the digression), I determined that human intuition is spot on. The first three principal components of the data set are (in order of importance):
* RecentChanges
* Views
* Edits

Now, whether this is sufficient to enable k-means or something else relatively simple to *usefully* classify the data remains to be seen. More later.

math, wikis, software, python, programming

Previous post Next post
Up