Computational Assyriology: pozorvlak

pozorvlak

Computational Assyriology

Nov 11, 2007 22:49

So I was down in Cambridge last week, as part of an exercise in convincing mathematics Part 3 students that there are other universities in the UK, and that some of them are even worth doing PhDs at. I didn't manage to see everyone I'd have liked to, but I did get to see Antarctic Mike for the first time in nearly two years (as the name suggests, he spent most of that time Down South), and to spend a night on his new houseboat. I also went out to the pub with Mike and Ros (our musical director from The Matrix) and had a very nice lunch with scribeofnisaba, whence the rest of this entry.

scribeofnisaba is an Assyriologist (coolest job title ever, no?), which means that she studies the literature and languages of the ancient Near East - the languages spoken by the Assyrians, Sumerians, Babylonians and so on. Roughly speaking, the languages spoken in modern-day Iran and Iraq, in the period 4000-1000BC(ish). In other words, the study of the oldest literature in the world. She tells me that it's a very exciting field, and that now is a very exciting time to be working in it, as so little is known, and much of what we thought we knew has recently been shown to be wrong - for instance, it was always thought that Sumerian scribes were all male, but this is now thought increasingly suspect. One surprisingly basic thing we don't know (if I'm understanding her correctly) is whether or not Sumerian poetry was metric.

A brief word of explanation. Poetry in many languages is characterised by the use of a repetitive pattern of long and short syllables, called the "metre". Shakespeare, for instance, used the patternSLSLSLSLSL
SLSLSLSLSL
...
whereas Virgil usedLSSLSSLSSLSSLSSLL
LSSLSSLSSLSSLSSLL
...

Now, it seems to me that it might well be the kind of problem that reacts well to having a few billion processor cycles thrown at it, but there are a couple of things I'm not too clear on, and maybe you lot can help me out.

As I understand it, we don't know
0. whether the Sumerians wrote metric poetry at all,
1. if they did, what metre they used,
2. how individual Sumerian words were stressed.

Assuming 0, then if we know 1, we can work out 2, and vice-versa. But if we don't know either, we might be able to bootstrap a knowledge of both by considering their interaction. This could well turn into a lifetime's work if done by hand, but fortunately we now have wonderful machines for doing repetitive calculations very fast, and the Electronic Text Corpus of Sumerian Literature seems like it was designed for precisely this kind of automated search.

[scribeofnisaba: You mentioned that cuneiform was syllabic: is the same symbol typically used in many different words? Because if so, we might be able to assume that it's stressed the same way whenever it occurs. But then again, probably not. Also, do we know that the lines are metrical units?]

Anyway, here's my idea. We search the space of possible ways of stressing words, and looking at the metres they produce on the corpus of poetry, and search for the stress patterns that produce the most regular metres. The searching's not that difficult, or rather there are well-known ways of doing it (more on that in a minute). The tricky bit, actually, is finding an automated way of looking for "poetic" stress patterns - the problem is that the pattern "short short short short short..." is as regular as you like. We could (for instance) calculate the stress pattern of every poem in the corpus, given our current guess at word stresses, then compress the resulting bitstream and observe the compression ratio - but that would tend to produce "short short short short...". We could, er, apply a discrete Fourier transform to split it into a sum of periodic functions, and then, er, um. Just eyeballing it won't work - we'd probably have to go through several thousand (mutate, calculate metre, compare) cycles. Is there some kind of constraint on word stress patterns for Sumerian that we know about? Each word must have at least one long syllable, or something? Anyway, there's presumably some way of doing this that I'm either too stupid or ignorant to find right now :-)

As for the algorithm for the search: the most obvious thing to do would be to simply try every pattern, but with nearly a thousand nouns in the glossary (and many other words of other sorts), each containing several syllables which can be either short or long, we'd have a search space with around 2^2000 elements. This is, to say the least, computationally infeasible. Another approach might be via "hill-climbing": start somewhere random, change it a bit, see if the change improves matters, keep it if so, repeat until you can no longer improve your results by making small changes. The problem with this is that you can get stuck at a "local maximum", or a point better than everything nearby but still not the best overall. Every mountain-top is a local maximum for the height of the Earth above sea level, but not every mountain is Everest. One way of dealing with this problem is to introduce a bit more randomness: the following elegant algorithm (which is the one I was thinking of using) is due to Metropolis:

Start at a random point in the search space.
Make a small change to your position.
See if this has improved things. If it has, keep your change. If not, toss a (weighted) coin, and keep the change if it comes up heads. The coin should be weighted proportionally to how much worse the change has made your results: a very deleterious change has a very low chance of being kept.
Repeat several hundred or thousand times until your position stabilises.

I attended a series of lectures by the mathematician and magician (mathemagician?) Persi Diaconis on (among other things) this algorithm about a year ago, and it's really surprisingly effective.

computers, travel, assyriology, lazyweb, maths, poetry, language