In my
ongoing discussion with
eekim about the
Wiki Analytics project, the following question came up:
Question about PCA: Are the number of principal components related to the number of clusters? In other words, if you ask for 10 clusters, will you get 9 principal components?
What follows is my answer, lightly edited.
That's a good question. I don't think that it is.
The principal components are the axes along which, if you projected the data from a higher-dimensional space to a lower-dimensional space, you'd lose the least information. You almost always lose some unless you keep every dimension in the data. Generally projecting down to two axes makes sense, because we can see 2-D. In this data set, though, a very large portion of the variance is captured by the first principal component (RecentChanges).
Imagine projecting this really high dimensional space down to dots along a single line. That's what using one principal component does. You can imagine that the dots wouldn't be evenly spaced - they'll be lumpy along the line. The lumpiness is what's being captured (in a vague, stochastic way) by the clustering algorithm. The shape of the lumpiness (what elements end up in a given cluster) varies depending on how many dimensions are in the data set, and how much variation is in each dimension, and it varies depending on some degree of randomness. But essentially, they're independent. You could have a single line with two or ten distinct groups of lumps.
k-means is different from a kernel machine (like
Ed Chi used) in this important way. k-means randomly selects some number of points in the space (you can choose the probability distribution) and then does geometric proximity to those cluster centroids for every data point. A kernel machine would actually do crosswise distance estimation for the entire data set and find the hyperplanes (or lines, if you project down to one principal component) that provide maximum separability.
Actually come to think of it, that gives me an idea about explaining this stuff:
Knowing that the data is so well described in a single dimension makes computing partitions of maximum separability both easy and cheap. You just move along the line, calculating the distance from one point to its nearest neighbor. Build a table, and find the biggest gaps, and call those the edges of your binning. You're enforcing a deterministic ordering on the scalar. In two dimensions and higher, you're back to doing pairwise comparison across the dataset - something that looks like a very simple
SVM. Perhaps this gives a useful way to begin to understand what SVMs are doing.