Following up on a conversation with
Luke Closs, I started thinking about measuring Wiki health out in the part of the world that doesn't keep track of RecentChanges. He pointed out that even in the RecentChanges world, as Wiki use becomes heavier and people integrate them more into their routines, RecentChanges use tends to drop off. People who are actually viewing and editing things just view and edit things - they know which parts have changed recently, because they're paying attention. Luke suggested that some combination of Views and Edits might be just as predictive of Wiki classification as RecentChanges - or perhaps useful in a different way.
So on Tuesday I sat down to figure it out.
I modified my previous work such that the k-means clustering takes place both with, and without the RecentChanges column in the data. I ran the clusters 1000 trials with 150 iterations for settling in each trial. By the time this is posted, the code should be in
gitHub. The results for the smallest cluster in each trial are compiled into sums, and the sums saved to output files. This gives me the identifier for every wiki that has appeared in the smallest cluster of a given trial, and how many times they've been included. Intuitively, this should yield the set of the most similarly classified wikis over time, and if the numbers are similar between the two conditions (both with, and without, RecentChanges), then we can expect that RecentChanges, while capturing variance well, isn't necessary to predict wiki classification.
And so it is. The runs over the two conditions were correlated at a rate of 0.99; I consider it reasonable to work on wiki classification without recourse to RecentChanges.
This has the following immediately obvious results:
- We can project the two remaining principal components into a 2D display space, so we can make nice scatter plots of the data and see if we can see the clusters ourselves.
- All this stuff I'm doing is germane to the wider Wiki world where RecentChanges aren't available.
There are probably other effects as well, but I haven't seen them shake out yet.
In playing around with this, I generated some additional questions. I'm hoping that conversations with wiki pros can clear some of them up:
- I would expect Views to be artificially inflated, because users would go to a page, see it, *then* edit it. Should it be safe to assume that Edits can be factored out of Views? For example that we can have "Adjusted Views" = Views - Edits? And if my intuition that we can is reasonable, what does it mean when something has an Adjusted Views that's negative? Spambot?
- If it indicates a spambot, I think I've discovered a cheap way to detect spambotting.
- Do the unsupervised clusters actually reflect reality? Can I get some set of experts to provide classifications for, say, a hundred wikis, and then we can run a supervised method and compare these results?