Defining the Grammar

Oct 09, 2008 00:43

Having designed the scripting-language-to-be, and having already broken an inbound script into tokens, the next step is to compile the thing. To do that, we'll need a formal grammar for the language: something more precise than the human-readable examples whipped up so far.

The traditional way to lay this out is through BNF format, and to make use of YACC ("yet another compiler compiler"). Again I'm going to skip over that thing and just roll the code myself, but writing out some BNF-like form for the language is essential--especially when you plan to use a recursive descent compiler, since the methods you write will almost exactly follow the BNF assignment statements.

A first cut at the grammar for Dualscript is below. I've smoothed out a few things in the language since I posted earlier: added a keyword here or removed one there, eliminated class-scope methods and data variables, and so forth. The result is a pretty simplistic scripting language that has a certain degree of agreeable symmetry. There are probably some significant errors below--in particular I've omitted the array definition syntax and I'm not happy with postfix_expr through postfix_operator, so I may need to rework those. But it's time for bed now, so this is going to have to suffice.



-------------------------------------------------------------------------------
LANGUAGE GRAMMAR
-------------------------------------------------------------------------------

Dualscript has a particularly simplistic grammar--at least, as simplistic as
anything can be that understands mathematical expression evaluation. Since
the compiler requires a formal definition of the structure of the language,
though, I'm laying it out below.

Personal note: I don't care for or about any particular variant of BNF, so
this is more or less a mixture of various syntactical descriptions. But I bet
you can follow it anyway.

Some conventions I've followed here:

abc* means zero or more instances of abc
[abc] means abc can optionally appear
"abc" refers to the keyword abc
abc | def matches either abc or def
abc def refers to abc then def in sequence

-------------------------------------------------------------------------------
GLOBAL SCOPE
-------------------------------------------------------------------------------

A script is, at heart, a series of global_statement expressions. Each
global_statement is just something that defines a class, a method or a
namespace, or maybe just does some arbitrary thing like "X=5".

script ::= global_statement*

global_statement ::= class_definition |
method_definition |
namespace_definition |
enum_definition |
statement

-------------------------------------------------------------------------------
CLASSES AND METHODS
-------------------------------------------------------------------------------

These two--defining a class and defining a method--are intentionally nearly
identical syntactically. (Later on we'll see that they're nearly identical
in implementation and invocation too, so that makes sense.) Both start with
a short header, then have a brace-encapsulated array of statements. Note that,
while classes can only be defined at global scope (e.g., no classdef statments
are allowed within other classdefs or within methoddefs), methods can be
defined both at global scope and at class scope (but still not inside another
method).

Remember the ellipses we allow on the input parameters for a method? That's
Dualscript's equivalent for stdarg; all unrecognized input params get
thrown into that array.

class_definition ::= "class" identifier "{" class_statement* "}"

class_statement ::= statement | method_definition

method_definition ::= "method" identifier method_params "{" statement* "}"

method_params ::= "requires" identifier ["," identifier]* ["..."]

-------------------------------------------------------------------------------
ENUMERATIONS
-------------------------------------------------------------------------------

Enumerations are really trivial. Note that I don't let you assign particular
values to enum entries; you just get whatever the compiler chooses. Keeps you
from cheating.

enum_definition ::= "enum" identifier "{" identifier ["," identifier]* "}"

-------------------------------------------------------------------------------
NAMESPACES AND IDENTIFIERS
-------------------------------------------------------------------------------

A quick word on these. Although we haven't described them yet in the syntax,
there are two kinds of identifiers: the identifier syntax type is just for a
single C-compliant name (e.g., letters and underscores, with numbers allowed
too except in the first character), while a compound_identifier syntax type
allows one or more "identifier." prefixes first. The namespace command is
the perfect example for introducing the compound identifier, so we'll do that
next.

namespace_definition ::= "namespace" [compound] ;

compound ::= { identifier "." }* identifier

-------------------------------------------------------------------------------
FLOW CONTROL
-------------------------------------------------------------------------------

C allows a lot of ways to make loops happen, and Dualscript supports all of
them except goto. (Goto sucks, right?) So we need to describe those things
too. This is also where we define a "statement"--and, like C, notice that a
single "statement" thing can also be recursively defined as a bunch of
statement things wrapped in braces.

statement ::= for_loop |
foreach_loop |
do_loop |
while_loop |
if_clause |
switch_clause |
"break" ";" |
"continue" ";" |
"return" [expr] ";" |
[expr] ";" |
"{" [statement]* "}"

for_loop ::= "for" "(" [expr] ";" [expr] ";" [expr] ")" statement

foreach_loop ::= "foreach" compound "in" expr statement

do_loop ::= "do" statement "while" "(" expr ")" ";"

while_loop ::= ""while" "(" expr ")" statement

if_clause ::= "if" "(" expr ")" statement ["else" statement]

switch_clause ::= "switch" "(" expr ")" "{" case_statement [case_statement]* "}"

case_statement ::= "case" expr ":" statement |
"default" ":" statement

-------------------------------------------------------------------------------
EXPR
-------------------------------------------------------------------------------

Believe it or not, that's almost all there is to the overall language. At this
point, we've defined the entire language of a script except for that pesky
"statement" type. It's the hardest one we have to deal with, but only because
it allows expressions like this:

x <<= 3 * (2 + function(17, 3 ^ myobject.data));

Anyway, we'll break it down into pieces. First a few categories:

assignment_operator ::= "=" | "*=" | "/=" | "%=" | "+=" | "-=" |
"&=" | "|=" | "^=" | "<<=" | ">>=" | ":="

equality_operator ::= "==" | "!="

relational_operator ::= "<" | ">" | "<=" | ">="

shift_operator ::= "<<" | ">>"

additive_operator ::= "+" | "-"

multiplicative_operator ::= "*" | "/" | "%"

prefix_operator ::= "++" | "--" | "+" | "-" | "~" | "!"

constant ::= integer | float | character | string

Okay, now let's do expr itself. The first step is pretty easy; you see this
rule applied a lot in for loops. From there things get harder, with a deep
cascade supplying precedence for important operations.

expr ::= one_expr [ "," one_expr ]*

one_expr ::= [ unary_expr assignment_operator ]* conditional_expr

conditional_expr ::= logical_or_expr [ "?" expr ":" conditional_expr ]

logical_or_expr ::= logical_and_expr [ "||" logical_and_expr ]*

logical_and_expr ::= bitwise_or_expr [ "&&" bitwise_or_expr ]*

bitwise_or_expr ::= bitwise_caret_expr [ "|" bitwise_caret_expr ]*

bitwise_caret_expr ::= bitwise_and_expr [ "^" bitwise_and_expr ]*

bitwise_and_expr ::= equality_expr [ "&" equality_expr ]*

equality_expr ::= relational_expr [ equality_operator relational_expr ]*

relational_expr ::= shift_expr [ relational_operator shift_expr ]*

shift_expr ::= additive_expr [ shift_operator additive_expr ]*

additive_expr ::= multiplicative_expr [ additive_operator multiplicative_expr ]*

multiplicative_expr ::= unary_expr [ multiplicative_operator unary_expr ]*

Whew. That was pretty ugly, but it brings us all the way down to unary
expressions, which is really where the last bit of magic happens.

unary_expr ::= [prefix_operator] postfix_expr

postfix_expr ::= constant | value_expr [affix_expr]

value_expr ::= identifier | "(" expr ")"

postfix_operator ::= "++" |
"--" |
"." value_expr |
"(" [expr] ")" |
"[" expr "]"

Previous post Next post
Up