Hi again!
Welcome to the second chapter in the PxdScript tutorial series. This chapter deals
with the first module in a compiler - the scanner. If you haven't read the first chapter
yet I suggest you start there and read the chapters in order.
We'll make a soft start and begin with one of the easier tasks in writing a compiler
- the scanning.
So, what does a scanner do? The scanners job is to perform a lexical analysis of the
source code to transform it from being just a text file into a list of "tokens" defined
by the programming language. In its basic form the text file containing the source
code for the program is nothing more than a stream of characters. The text file itself
does not contain any information about which characters belongs to which tokens. An
assignment for example could look like this in the source code:
a = 5
but what we need is really a sequence of three tokens :
tIDENTIFIER tASSIGNMENT tINTCONST
Well, the above is almost true except for a tiny detail. It is custom not to introduce a special token name for tokens, which always has the length of one character. Thus the assignment above is really converted into the following three tokens:
tIDENTIFIER '=' tINTCONST
but that's just a minor detail one should keep in mind.
It is also the scanners job to make sure that every token in the source code is actually
contained in the programming language. The scanner should complain if I tried to write
something like:
a #= 5
because the token '#=' would not be recognized as a valid token in PxdScript.
At first sight it might seem like an easy task to write a scanner. One just reads
one token at a time from the file and checks if it's a valid one, right? And if it
is, it should be easy to classify it as a certain token type? Well, the problem is
that it's not so easy to find out where a token starts and ends in a text file. If
one could always assume that tokens were separated by spaces it could be done but
consider the following source fragment:
while(b>5)
Even though the fragment consists of only one "word" in the text file it is clear that it is a sequence of many tokens. A correct scanning of this single "word" in the text file should result in a list of 6 tokens:
tWHILE '(' tIDENTIFIER '>' tINTCONST ')'
So we abandon the idea of just reading the source file one word at a time and accept
the fact that tokens can begin and end in the middle of "words". But how do we figure
out when a token starts and ends then? Lets look at the example from above again.
If we only read the first 4 letters of the "word" then we'll get an identifier ("whil")
but as soon as we read the fifth letter that identifier suddenly becomes a totally
different kind of token - namely the reserved word "while" which is represented by
its own token tWHILE.
Ok, so we have a problem - what do we do? We don't have time for this - if we're ever
going to finish our compiler we cannot spend weeks trying to write a scanner which
does everything right. Luckily it turns out that someone has already solved the problem
for us!
Now is the time to introduce the first of two tools we'll use in our compiler creation
- namely "Flex"!
Well, obviously we are not the first to stumble across this problem There exists thousands
of different programming languages and each of those have faced the very same problem
as we do now. At some point someone got tired enough of this to sit down and write
a tool which automatically can generate a scanner module, written in C, for a given
programming language. Flex is exactly one such tool. Of course it needs some sort
of description of our language - it needs to know which tokens we want to allow and
which we want to reject. We tell it through a configuration file - a lex file - written
after a certain syntax, which I'll describe below.
The main part of the lex file consists of a set of rules describing the tokens in
our language. These rules are written as "regular expressions" and our tokens must
match at least one of these regular expressions to be accepted by our language.
A regular expression can be used to describe a whole class of strings. I'll briefly
describe regular expressions through a number of examples covering most of the features
we need rather than writing a detailed description of what it is. Regular expressions
are not that exciting to read about but they are very useful when writing a scanner.
Ok, then there are certain characters, which can be applied to the expressions above. The first such character is the '*' character. In the form above each regular expression matched exactly one character. If one writes a '*' after an expression it mean "zero or more times" so:
means a string starting with any number of a and b characters and then
ends with a c. aaaac match the expression above and so does ababac but aca does
not.
Another character you can write after an expression is + which means "one or
more times". And finally you can write ? after an expression to indicate either
zero or one occurrences of the expression (and not more than one!). For example, c matches
the regular expression above ([ab]*[c]) because * means 'zero or more' occurrences
but it would not match the regular expression [ab]+[c] because + means 'one
or more'. The other examples (aaaac and ababac) would also match the
new expression. Try and think of some strings that matches the regular expression [ab]?[c] now...
Enough about regular expressions - lets take a look at how we use them in a lex file!
p="p" < %} <string.h> #include %{>
Following that we write:
%%
telling Flex that now we'll list the regular expressions it should match. After each regular expression you can write a block of C code which will be executed when Flex reads a token matching the regular expression from the file. In this small example we just want to see how a file can be turned into a list of tokens so we simply print a line to the prompt using "printf":
[ \t]+
/* ignore spaces and tabs*/;
"+"
{printf("plus\n");}
"-"
{printf("minus\n");}
"*"
{printf("mul\n");}
"/"
{printf("div\n");}
"%"
{printf("mod\n");}
"("
{printf("left paran\n");}
")" {printf("right
paran\n");}
0|([1-9][0-9]*)
{printf("intconst(%i)\n", atoi(yytext));}
[A-Za-z_][A-Za-z0-9_]* {printf("identifier(%s)\n",
yytext);}
. /*
ignore other chars*/;
There is a build-in variable we can use to access the string which actually matched the regular expression called 'yytext'. Notice how we use that to actually save the values of the identifiers and integer constants. Finally we finish the lex file by writing another
%%
to tell Flex that there are no more regular expressions. You can find this tiny lex file here or create your own from the above.
Lets take a closer look at the two more complex regular expressions in the list of rules above. First there is the expression for integer constants:
0|([1-9][0-9]*)
In English this rule reads as: "Either the string contains just one character '0' - or it consists of first one of the digits 1 to 9 followed by any number og digits 0 to 9". At first you might think this is a very complicated way of representing a sequence of digits. Why not just use the expression '[0-9]+' - that is: "one or more digits from 0 to 9"? Well, such an expression would also allow numbers of the form '007' which is cool for a James Bond movie but not practical for a programming language. We don't want any prefixed zeros and if you look closely at the regular expression I used in the toy lex file you'll see that it can form any integer - but it does not accept numbers with prefixed zeros.
The rule for identifiers has a similar argument for why it's not just '[A-Za-z0-9_]+'. We want to avoid a certain type of strings as identifiers - which strings are those??
Flex works from the philosophy that it's better to "eat" as much from the source file
as it possibly can at each step. Thus Flex will always use the regular expression
which has the longest match when it scans the source file for the next token. This
is to avoid having "while" recognized as two identifiers "whi" and "le" for example.
If more than one regular expression can match a token from the source file with equal
(longest) length then the rule listed first in the lex file is used. So it
is safe to use the "dot" expression above to "eat" illegal characters because as it
is listed last it will only be used when every other expression fails. If Flex did
not have this rule about order of the expressions we would have a problem because
we would then often have that two expressions both were the best possible match -
for example when the next token in the source file was a '+' which both the "+" and
the "." expression would match.
Now we want to test the lex file to see if it does what we want it to do and to do
that we'll have to compile the scanner into a program. First we must run 'Flex' on
the lex file to automatically generate the C code for the scanner. To do that open
your bash (assuming you use Cygwin or a real Linux/Unix system) and type:
flex numbers.l
This will compile the lex file into the C file lex.yy.c. To compile this file you use gcc by typing
gcc -c lex.yy.c
gcc lex.yy.o -o numbers.exe
Fairly painful huh? I know, so I've also provided you with a Makefile to compile this
little example (including running flex) so just type make and hit the enter
key.
To test the scanner you can use the small test file I also provided with the zip package.
WARNING: the numbers.exe does NOT read the filename from its arguments - the
test file must be piped to the scanner by typing
numbers < numbertest.txt.
Now you should see a list of tokens in the bash matching the expression in the test file.
Ok, before you can understand the lex file for PxdScript (lang.l) I'll have to tell
you a little about some of the more advanced features of Flex. The full listing of
lang.l can be found in the section below this one.
The first advanced feature shows right after the
%{
%}
block in lang.l and is the single line
%x commentsc
This line declares that we have grouped the regular expressions into two groups -
each with a totally different set of rules! We always start in what is marked as the
INITIAL start condition, which is similar to where we put the rules in the tiny example
above.
The other start condition is called "commentsc" and the idea is that Flex can only
be using rules from one group at a time. When we are scanning a comment in the source
code we want to use the special set of rules defined in the <COMMENTSC"> { }
block. This is to avoid recognizing the text, which has been out-commented as tokens.
As you can see we never return any tokens from within the comment start condition
- we just "eat" them from the source code while we still keep track at linenumber
etc. One switches between start conditions using the BEGIN(name) keyword. Also notice
how easy it turns out to be to implement nested comments - that is multi-line
comments inside other multi-line comments. That's something I've always wanted in
C++.
We could easily define many more start conditions if we needed them but for most languages
two will suffice.
Secondly don't worry about where all the tokens used in the lex file has been defined.
Flex works very tightly together with Bison (which we will cover in the next chapter)
and the tokens and some of the variables used comes from Bison. Actually it's Bison
which calls Flex asking for the next token in the program when it needs one, but we'll
look at that in the next chapter. So for now you won't be able to get the lang.l to
compile into a scanner like we did with the small toy example above!
Finally, notice how I have simply written the reserved keywords as they are and Flex
recognize them as regular expressions (because that's what they are ;)
Ok, so now you should be able to understand the lex file used for scanning PxdScript
programs, so lets have a look at it:
/* pxdscript flex-file
*
> * History:
*/
%{
#include "lang.tab.h"
#include
#include "memory.h"
#include "error.h"
extern int lineno;
int commentDepth=0;
%}
/* exclusive startconditions
*/ %x commentsc
%%
/* comment start-condition for recognizing multi-line comments
*/
<COMMENTSC>
{ \n
lineno++;
"*/" {
commentDepth--; if (commentDepth == 0) BEGIN(INITIAL);}
"/*" commentDepth++; "//"[^\n]*
/* Single line comments take precedence over multilines */;
<<EOF>>
yyerror("Unclosed comment.");
[^*/\n]*
/* ignore anything thats not a '*' or '/' to increase speed */;
. /*
ignore all other characters (the text actually out-commented)*/; }
/* INITIAL start-condition for ordinary pxdscript-code */
[ \t]+< /*
ignore */;
"/*"
{ commentDepth++; BEGIN(commentsc);}
\n lineno++;
"//"[^\n]* /*
ignore (singleline comment)*/;
bool
return tBOOL;
const return tCONST;
else return tELSE;
for return tFOR;
if return tIF;
int return tINT;
return return tRETURN;
program return tPROGRAM;
string return tSTRING;
void return tVOID;
while return tWHILE;
trigger_on_init
return tTRIGGERONINIT;
trigger_on_collide
return tTRIGGERONCOLLIDE;
trigger_on_click
return tTRIGGERONCLICK;
trigger_on_pickup
return tTRIGGERONPICKUP;
setint return
tSETINT; TYPE_MDL return tENTITYMDL;
TYPE_PARTICLE
return tENTITYPARTICLE;
TYPE_PLY
return tENTITYPLAYER;
TYPE_LIGHT
return tENTITYLIGHT;
TYPE_CAMERA
return tENTITYCAMERA;
sleep
return tSLEEP;
"&&" return
tAND;
"=="
return tEQUALS;
">="
return tGEQUALS;
"<=" return
tLEQUALS;
"!="
return tNEQUALS;
"||" return
tOR;
"=" return
'=';
">"
return '>';
"<"
return '<';
"!"
return '!';
"+" return
'+';
"-" return
'-';
"*"
return '*';
"/"
return '/';
"%"
return '%';
"{"
return '{';
"}" return
'}';
";" return
';';
"(" return
'(';
")" return
')';
"," return
',';
0|([1-9][0-9]*) {
yylval.intconst = atoi(yytext);
return
tINTCONST; }
true {
yylval.boolconst = 1;
return
tBOOLCONST; }
false
{ yylval.boolconst = 0;
return tBOOLCONST; }
\"[^\"\n]*\" {
yytext[strlen(yytext)-1]='\0'; /* remove "'s from yytext */ yylval.stringconst = strdup(yytext+1);
return
tSTRINGCONST; }
[A-Za-z_][A-Za-z0-9_]*
{ yylval.identifier = strdup(yytext);
return
tIDENTIFIER; }
. yyerror("Invalid
character '%s'.", yytext);
%%
Program start trigger_on_init {
int b@5;
}
but will happily accept programs like this:
program int foobar = 5 < 8;
Programs like the one above is the kind of illegal programs which we'll reject in
the next chapter.
IMPORTANT WARNING: There is one thing I'll point out to save you from a few
sleepless nights - never begin a comment in a lex file at the first character in a
line! Notice how I start the comments a few characters in the line. Try moving the
comments all the way to the left and run flex on it (add a comment in numbers.l).
Well, maybe Flex is a nice tool but it sure also has some flaws ;)
Well, that was the second chapter of this series. So far it's progressing fairly well
I think. The next chapter will be a bitch - both for me to write and for you to read.
Now you have been warned ;)
The good thing is that after chapter 3 the chapters will be shorter and after each
completed chapter you will be able to see the progress that your compiler has made
since the last chapter which is quite fun ;)
As always - send me comments on what I do well and what I do badly.
Keep on codin'
Kasper Fauerby
www.peroxide.dk