Nov 15, 2007 08:41
- Do not try to wget -m the whole of the Electronic Text Corpus of Sumerian Literature. There's rather a lot more there than you'd expect. Why they couldn't just provide a zipfile and/or tarball of the XML they store in the backend database is anyone's guess. I can't be the only geek who's read Snow Crash and wants to contribute.
- Other than that, I'm really rather impressed with the ETCSL. Check out the mouseover text. Of course, all that stuff needs stripping away for my purposes :-( The background articles on Sumerian language, literature and cuneiform (literally, "wedge-shaped") writing look pretty useful too. Annoyingly, their funding ran out in late 2006, so the site hasn't been updated for a while, and they seem to have made it unnecessarily hard for anyone to take over.
- The correlation coefficient of a constant signal with anything else, even itself, is always zero, so if I take elvum's suggestion to use autocorrelation then the "short short short short" problem becomes a non-issue. However, I then end up with another signal, whereas what I really need is a single number with higher values representing higher levels of poeticity. Possibly I can limit the number of possibilities I need to check by counting syllables-per-line; or maybe I could just take the maximum value of the autocorrelation? It's been nearly ten years since I did any statistics, so this is all a bit painful. I've tried asking friends in the stats department, and been met with the slightly worried look of an expert challenged on something that's just outside their narrow specialism. I know it well, because it's a look I often use myself.
- I'm not the first person to apply statistical ideas to analyse the corpus. There's even a book out: Analysing literary Sumerian: corpus-based approaches (or you can buy it from Amazon!) Nothing especially relevant-looking in the chapter headings, but I wonder if I could persuade the library to buy a copy... they don't have it in stock right now, but they do have the intriguing-looking Sumerian or Cryptology? Further investigation reveals that it used to be thought that Sumerian wasn't an actual language, but rather a priestly cryptosystem used for enciphering Semitic texts. More details here.
assyriology,
poetry,
projects,
books,
computers,
maths,
beware the geek,
language