Dec 29, 2009 04:02
The great thing about computational linguistics (in particular: text rather than speech) is that it's very easy to come up with research questions that can be answered by doing (often simple) statistics on large corpora, e.g.:
* in checklists, people don't always end their sentence with a period/mark. What grammatical structures tend to be closed off with explicit punctuation?
* when do bloggers complain the most about their partners / their bosses? What are the correlates to company earnings, unemployment rates?
* how does one's writing reflect one's linguistic (or cognitive) impairments (e.g. aphasics or L2 speakers)? How much insight can you get into someone's mind from their writings?
* what can you predict about other data sources (e.g. stock prices, movie ratings) based on newspaper text?
* find correlates of font choice
(and if you're getting people to type for you, keylogger data can be cognitively much more interesting! Perhaps as interesting as eye-tracking data.)
The not-so-great thing is that shallow approaches don't work for everything (although they can be surprisingly good!) and annotations can be expensive (though Mechanical Turk is making this a lot cheaper).
Having said that, I'm simply more interested in statistics: theory, methodology, modeling and algorithmics. And although engineering can be lots of fun, it can also be a pain to use other people's tools (lemmatizers, parsers, POS taggers, etc) or hacking up your own.
academic