I'm here this evening to talk to you about token frequency analyses of the State of the Union. Here are a couple that have been tossed around your interwebs over the past few days:
The ostensible goal of this kind of analysis is to determine, proportionally, what the President is talking about most. I don't think it does that. For the two following reasons.
First: What someone "means," what they're talking about, isn't localized to the individual lexical items in that individual's speech. In other words, there isn't a one-to-one relationship between a word and an act of meaning something. Because of this, token frequency analysis fails to capture what might be important references to a particular thing that don't use the word in question. Bush might say, for example, "the country where our military operations are focused," or even "For the rest of the speech, I will be referring to 'Iraq' with the word 'Smoo.'" Even though Bush is obviously talking about Iraq, the word count for the token "Iraq" wouldn't get incremented in those instances.
Moreover, some words might have their count incremented by uses that don't align with the meaning you'd expect. A President might say, "We should worry about the welfare of all our citizens" or "Now that I think about it, 'homeland security' is a pretty dumb phrase." Those words in italics would get bigger in the tag cloud, even though their meaning in context doesn't line up with what the word out of context seems to mean.
Our intuition about words is that they have a definition in the dictionary, and that's what they mean, regardless of context (except, maybe, in special circumstances). Token frequency analysis counts on this intuition being true: that a use of a word will, more often than not, correspond with its dictionary definition. This assumption works well for glib analysis, but I don't see any empirical reason to believe it. (In fact, it doesn't seem like a question particularly open to empirical investigation.)
The second reason is that token frequency analyses like the ones above arbitrarily reject certain words. The NYT analysis doesn't allow you to search for words with less than three letters; the historical tag cloud page says that its algorithm "removes the most common words like 'the', 'and', 'this', 'that' and some not so common language-specific words like 'hitherto', and 'notwithstanding.'" I can't see a good reason for doing this. If saying the word "Iraq" a lot means that the President is talking about Iraq a lot, why doesn't it follow that saying "and" a lot means that the President likes to conjoin phrases? That saying "the" a lot means that the President likes to pick out one salient referent among many?
Geoffrey Pullum at Language Log makes another good point, which is that a token frequency analysis needs to either decide to count word forms or lexemes. Counting word forms means that you'd have separate counts for bless, blessing, and blesses; a lexeme count would lump all of these together. Both methods have problems: a word form count misses generalizations among words, while a lexeme count lumps together words that might have significant semantic differences in context (Pullum's example is insure and insurance).
In conclusion, here are funny videos of
talking dogs and
hamsters or something doing backflips.