Re: Ooh. Tell me more.serapioDecember 4 2008, 06:50:22 UTC
Besides completely throwing out variables that are near zero weight, regularization (and I guess the spacial transformations of PCA and PLS do this too) also reduces the number of relevant variables in any particular case. Spreading out the weight distribution makes it easier to approximate the result by just including the largest few terms in each case.
Optimality Theory originally grew out of a perceptron-like formalism, with numerical weights on the variables and a categorical output, and the original analogies between perceptrons and OT kind of suggest an L1 exponential prior with discrete (discontinuous) support. The move to OT switches from comparison of numerical sums to a decision-tree-like comparison of candidate outputs, using the variables in order of importance. It essentially breaks down each case into a bunch of candidate decisions, and in each decision it only pays attention to the highest variable that distinguishes between the two candidates. This was motivated by an apparent scarcity in language of the kind of "ganging up" sum effects that you see in perceptrons. But there remain a few phenomena that look like "ganging up", and recently people have been looking beyond the categorical phenomena that are traditional in theoretical linguistics. The probabilistic phenomena seem to require some allowance for these ganging up effects. But they still seem less common than we would expect given a uniform prior, or even the L1 exponential prior.
I suspect that there is an interesting explanation for all of this, but I'm kind of stuck at this point about how to look for it. I wrote a long term paper about this 6 months ago, but then came to back to China, so between the shortage of people who have a good understanding of both learning theory and language phenomena, and being away from my peeps in SD who are obligated to read the paper, I've not had any feedback on it. I'm going back to SD in a few weeks, so maybe I'll continue with it then.
Re: Ooh. Tell me more.gustavolacerdaDecember 4 2008, 07:19:54 UTC
<< Besides completely throwing out variables that are near zero weight, regularization ... >>
First of all, this is *L1* regularization. Secondly, no, not *near* zero weight. L1 methods throw out the subset of variables whose exclusion least hurts (in terms of prediction error).
Re: Ooh. Tell me more.serapioDecember 4 2008, 07:38:51 UTC
You're throwing out variables that are within your tolerance level of approximately zero, right? I thought this tolerance was usually set to something well above machine tolerance.
Optimality Theory originally grew out of a perceptron-like formalism, with numerical weights on the variables and a categorical output, and the original analogies between perceptrons and OT kind of suggest an L1 exponential prior with discrete (discontinuous) support. The move to OT switches from comparison of numerical sums to a decision-tree-like comparison of candidate outputs, using the variables in order of importance. It essentially breaks down each case into a bunch of candidate decisions, and in each decision it only pays attention to the highest variable that distinguishes between the two candidates. This was motivated by an apparent scarcity in language of the kind of "ganging up" sum effects that you see in perceptrons. But there remain a few phenomena that look like "ganging up", and recently people have been looking beyond the categorical phenomena that are traditional in theoretical linguistics. The probabilistic phenomena seem to require some allowance for these ganging up effects. But they still seem less common than we would expect given a uniform prior, or even the L1 exponential prior.
I suspect that there is an interesting explanation for all of this, but I'm kind of stuck at this point about how to look for it. I wrote a long term paper about this 6 months ago, but then came to back to China, so between the shortage of people who have a good understanding of both learning theory and language phenomena, and being away from my peeps in SD who are obligated to read the paper, I've not had any feedback on it. I'm going back to SD in a few weeks, so maybe I'll continue with it then.
Reply
First of all, this is *L1* regularization.
Secondly, no, not *near* zero weight. L1 methods throw out the subset of variables whose exclusion least hurts (in terms of prediction error).
Reply
Reply
I don't understand the question.
I'm not familiar with your terminology. I also don't know this stuff very well.
Reply
Reply
Reply
Leave a comment