The Haltman has asked about what my research is this summer, and I promised him I'd post it to LJ. Ergo, this exists. If anyone else is interested in reading it, you're welcome to do so, and I'd be happy to hear feedback, but I'm mostly just making this post to satisfy the afforementioned Haltman's curiosity.
This summer, I'm working at the MedIX REU program in computer science at DePaul university. The group there does a lot of research on the Lung Image Database Consortium (LIDC) dataset, which is a collection of lung CT scans. In these scans, ~900 nodules (potential tumors) have been identified by various radiologists, and each nodule has been rated on a scale of 1 to 5 on various characteristics, such as spiculation (how spiky the nodule is) and malignancy. Our goal is to write a computer aided diagnostic tool whose purpose is to give radiologists a second opinion about the malignancy of a given nodule. Note that we are not trying to replace radiologists here. Rather, we want to give them a tool that can say "Hey... maybe you want someone else to take a look at this, because while you think it's benign, I think this is going to kill your patient."
Apart from the usual machine learning challenges, we are faced here with the problem of having no ground truth. Somewhere between 1 and 4 radiologists have rated each nodule, and their ratings do not always agree! Thus, two things are unclear. 1: How should our machine learning algorithm be trained on such data, and 2: How do we evaluate the success of our machine learning algorithm on such data? My current research/idea focuses on #1, but #2 is also an important question.
So here's the game plan. I think it is incorrect to have our algorithm output a single value. For example, if I tell you that the radiologists' mean rating was a 3 (3 signifies unsure on the 1-5 scale), that might mean that no one is confident in the reading, or it might mean that most radiologists give the nodule a 2 (likely benign), while a few radiologists noticed some often overlooked feature, and so gave it a 5 (definitely malignant). These two distributions mean _very_ different things, and so, in fact, we want our algorithm to return some indication of the distribution -- either by learning distribution parameters (mean and standard deviation, perhaps?) or by directly learning the distribution. Granted, I'm not too sure how I'm going to go about learning the distribution other than smacking the problem with kernel regression. Since I've been unable to find any literature on the topic of learning an actual distribution, I'm just going to assume that kernel regression is a reasonable route, but if anyone has other suggestions, I'd be glad to try them.
Ok, modulo not knowing how to learn the distribution, so far so good... except that we have at most 4 radiologists rating each nodule. Such a small dataset is not even remotely reasonable for estimating a probability distribution. So... let's try augmenting each nodule's ratings by throwing in the ratings of nearby nodules in feature space (for people who haven't had AI: In this case, features are things like "how spherical is it?" or "how much variation in density is there in the nodule?". If we give these a numerical value, we can associate each nodule with a point in \mathbb{R}^n, and then use some measure of distance (say... euclidean distance) to determine how "close" two nodules are to each other), and _then_ learning the distribution. Most learning algorithms do this implicitly one way or another, but this seems like a situation where explicitly augmenting the dataset would be particularly useful. But... how about we get some better justification than "seems like expliclitly augmenting would be a good idea."
So how can we test this? Well... what other datasets have similar properties? Answer: Any dataset where an opinion is involved... like... say... movie ratings. There are a plethora of movie rating datasets out there. With advice from Jonathan Gemmel (a PHD student at DePaul, but _not_ working in the MedIX lab), I've grabbed the MovieLens dataset. This consists of 10 million ratings and 100,000 tags across 10,000 movies. At an average of 1,000 ratings per movie, I think we're in good shape size wise. I'm planning to use the tags as features, and attempt to predict the rating distribution on a new movie given its tags. Call the resulting ratings predictor the "full-data predictor." Once this is done, I will randomly sample some small number of ratings from each movie (e.g. 4, as we have in the LIDC dataset), and train an algorithm that again predicts the rating distribution on the new movie. Call the resulting predictor the "small-data predictor" Finally, we augment the "small dataset" constructed above by using the nearest neighbor approach described in the previous paragraph, and again train a predictor, which we call the "augmented predictor." The question here is... does the augmented predictor outperform the small-data predictor. If so, by how much, and by how much does it underperform the full-data predictor? If the answers are favorable, this approach might be good to then use on the LIDC.
I think that's about it. I think this is a really cool problem, and I'm pretty excited about it/having fun coding the experiment/parsing data/whatnot (I should start getting results tomorrow. Yay!). Still, I'm a bit worried that explicitly using nearby neighbors is going too far, but there are a couple of recent papers that show that explicit consideration of nearby neighbors can be useful in other scenarios, so... I should be in the clear.