Mar 21, 2010 19:49
* Some time ago, Andy asked me to collaborate with him on a project involving English vowel mergers. We are, roughly, testing the hypothesis that statistically speaking, vowel mergers that occur tend to create fewer homophones than potential vowel mergers that could have occurred but didn't.
* This week, I asked Brent to help me come up with a Classics research project idea. I have no publications in Classics, and I want to apply for Classics jobs. I asked him for a project involving corpus linguistics, because I'm getting kind of good at that, and we have these giant electronic Latin and Greek corpora. He suggested a project with literary applications, for marketability purposes.
What he came up with was stylometry, the rigorous computational study of literary style to infer facts about the texts such as authorship or chronological ordering. It is popular in Shakespeare studies, because of the age-old authorship controversy. Stylometry has also been applied to authorship attribution problems in Classics as well. He is now looking for some possible Latin corpora to study, while I read up on how to do stylometry. I am particularly pleased about this choice of project because of its application to English literature, which I also need to improve my marketability in.
You may also be familiar with forensic applications to stylometry: the sort of questions such as "Who wrote this ransom note?" "Did the victim write this suicide note, or was he murdered and the murderer wrote it?"
* There are three types of authorship classification texts: binary, i.e., "Was this text written by Author A or Author B?"; multi-class, i.e., "Was this text written by Author A, B, C, ..., N?"; and single-class, i.e., "Did Author A write this or not?" Those tasks are in increasing order of difficulty.
I am likely to end up doing the third one, given Brent's example problem of: "Did Plato write these letters attributed to him, or are they pseudo-Plato (written by someone else, anyone else)?"
* The most exciting thing in my life right now is that the math I am learning for both projects, the vowel mergers and the stylometry, is the same. I will now talk about information theory entropy.
* In information theory, entropy is a measure of randomness, which can also be understood in terms of our ability to make predictions based on the information we have. For example, if we have a regular, fair, two-sided coin, the entropy works out to be 1, which corresponds to a 50-50 chance of making the correct prediction. If both sides of the coin are heads, then the entropy is 0, because we have a 100% chance of making the correct prediction. If the coin is unfair and weighted a little bit, say with a 60% chance of coming up heads, we have entropy of somewhat less than 1, corresponding to a somewhat better than 50% chance of making the correct prediction. Of course, a larger number of flips will enable us to determine the unfairness of the coin more accurately.
The lower the entropy, the less randomness in the system.
* In terms of the vowel mergers problem, consider language as a sequence of events, like a sequence of coin flips. The events may be phonemes, syllables, characters, words, etc. A text produced by a monkey at a typewriter will have very high entropy (be extremely random). The complete works of Shakespeare will have lower entropy, because he is mathematically more limited in his possibilities by writing coherent English. He has a smaller set of possible words to choose from, and can order them in only so many grammatical ways.
* We'll now play with simple pretend languages. Pretend language X has two words: {a, b}. The text ababababababababababababab has very low entropy. We can predict that the next event will be a. The text abbaaaaabaabbbbbaabaaabbaa has much higher entropy. It's hard to predict what the next event will be.
Our next pretend language, Y, has three words: {a, b, c}. In the sequence (corpus) aabcbaacbababcbcacab, there is a certain amount of entropy, call it esub1. Now, let's say Y undergoes a phonological merger such that the words a and b are now homophones. We'll call the resulting homophone d. The corpus will now look like this: dddcdddcdddddcdcdcdd. Its entropy, e2, is much lower than the entropy, esub1, first string: we have a better chance of predicting the next event. It's either a d or a c, and based on the frequency of d in this corpus, the odds of getting d are even better than 50%.
* We can then compare this actual merger of a-b to other possible mergers, a-c and b-c, that did not occur, by computing the loss of entropy. We can also figure out how much work the contrast a-b was doing in the language, by taking the original entropy, e1, subtracting the entropy in the new language with the merger and homophones, e2, and dividing by the entropy of the original language, e1. This gives us the 'functional load' of the a-b contrast.
What features of the language to take into account when computing entropy is not obvious and is something Andy and I are testing.
* The hypothesis is that actual mergers will result in a smaller loss of entropy, i.e. fewer homophones (taking word frequency into account), than potential mergers that do not occur. In other words, there is a force in language that wants to maximize disambiguation. Because, look, if we merged a, b, and c into d, we could always predict with certainty that the next word would be d. We would go around saying "ddddd" all the time, and no communication would take place. Obviously, this force does not operate unchecked, because mergers do take place, we have homophones, and language is more predictable than our monkey on a typewriter, who is generating random nonsense with super-high entropy (although occasionally he does produce the works of Shakespeare, given enough time).
* Now for authorship attribution.
Entropy, the ability to make predictions about what comes next, can also be viewed as the ability to model the system. The more complicated the model, the greater the entropy. We can model our ababababababab string very easily, by saying "alternating occurrences of a and b." Our random string abbaaaaabaabbbbbaabaaabbaa is much more complicated to model.
* One of the things people have looked at is how to make predictions about one system (text) by using a model designed for another system. For example, there are Federalist papers known to have been written by Madison, and Federalist papers known to have been written by Hamilton, and Federalist papers known to have been written by one of the two. If we model Madison's texts using some feature--such as his use of function words--and compute the entropy em, and model Hamilton's texts using that feature, and compute the entropy eh, we can figure out which model makes better predictions about the disputed papers. We ask which of the two pairs em ~ edisputed and eh ~ edisputed yields a smaller relative entropy. In other words, what is the difference between the distribution of events (words) in each pair of texts? This difference between the distributions of the texts is called the Kullback-Liebler divergence. Smallest KL-divergence wins.
This is all making me insanely happy right now.