May 21, 2007 15:21
Note to self: NP complete problems do not make good class projects. Also, deciding to test the NP complete algorithm on two different sample sizes does not make it better.
My sublinear time algorithms (6.896) final project is just that. Clustering of data points (separating the data into k clusters of similar points) is an NP complete problem, although people have made many approximation algorithms which do quite well. In my case, I'm working on text from Wikipedia articles. I create a matrix where the rows are articles and the columns are words (one column per word, for all words seen in all documents). From this, I can compare any pair of articles by using a similarity metric (in this case, cosine similarity - cosine angle between the two vectors).
So basically, what I end up doing is taking in some data and a number of clusters, then trying all possible cluster centers (the point within each cluster which is the "average" of the cluster, or the most representative element, or the element which minimizes overall distance to the rest of the elements in the cluster). So essentially, I need to try comb(n,k) clusters (n!/[k!(n-k)!]).
The point of this project is that we can cluster a small fraction of the points and find those cluster centers, and there's a nice proof that says the centers from the small section of data are no more than x far from the centers if you had clustered the whole set of data. So theoretically, it's great. In practice, I'm having trouble clustering even a small fraction of data.
I'm not really sure why this takes as long as it does (that's one thing I'm trying to figure out). Also, I'm not really sure why CLUTO (a clustering toolkit) can do it so fast. Nor am I sure what I'm going to do to finish this project. Too many open questions.
edit: I just realized that with my algorithm as is, my *small set* of data will finish in approximately 185 DAYS. Yes, days. Either I'm calculating something terribly wrong or this is not feasible.