Tokenizing: lertulo

lertulo

Tokenizing

Oct 06, 2008 19:38

A short article for an easy step. The first part of compiling or interpreting a script is breaking the script's text up into tokens. If you're using lex (which you probably should do), that's pretty easy--but since "easy" is no fun, I'll be doing it the old-fashioned hand-coded way.

A clever tokenizer won't run too far ahead of the actual compiler: that's called a streaming tokenizer, and it's marginally more complex than a tokenize-everything-and-remember-it-all-at-once tokenizer. I'm using the latter, since I've used up all my "easy is no fun"-ness in the first paragraph.

The basic tokenization loop looks more or less like this:

as long as there's content left in the input script {

skip whitespace (including EOLs)
if we hit "//", skip to the end of the line and restart this loop
if we hit "/*", keep reading until we hit "*/" then restart this loop
if we ran into the end of the script, quit now

if the next char is a digit, look for number sequences
don't forget to look for hex and octal radixes ("0x5E13", "0777")
don't forget to look for decimals and exponents ("15.37", "27e+5")
remember to look for special cases ("0", "0.3" which look octal-ish)

see if the next 3 characters match a 3-character token (like ">>=");
if so, record that token into our output and restart this loop

likewise, see if the next 2 characters match a 2-character token ("+=", "<<")

likewise, see if the next character matches a 1-character token (":", ";" etc)

if the next character is an apostrophe, crack character sequences like 'x'
remember to handle encodings like '\t' for tab, '\n' for newline etc
and of course '\x7f', '\127', '\035' should be supported too

if the next character is a quotation mark, try to pull a whole string
this is pretty easy--just skip the ", then keep reading until hit another one
again, look for \ prefixes, and don't be fooled by \"

okay, the next word must be plain text--either a keyword or something like a variable.
scan forward until we run out of legal characters for either, and accumulate the text.
then match against known keywords ("for", "return" etc)
}

And that's it. No magic involved--just some simple text cracking. The result is that we can stop worrying about the text file that the user supplied; instead, we have a much more programmatically accessible array of tokens. The compiler will start pawing through those tokens to get its work done--in the next post.