void : the 'empty' type.
int : the type consisting of whole numbers.
bool : the type consisting of truth values true and false.
string : the type representing a series of characters (actually a series of char's - a type we chose not to include)
When you make a variable declaration in your source code like for example
int x;
you are actually defining an invariant promising that at any time the
variable x will only contain values of the type int.
A program is type correct if all such invariants are valid.
Unfortunately this is an undecideable problem - in other words,
if this is our only restriction then we cannot program a phase to check if a program
is valid or not. If that were the case then this series would be over and as you can
see I have planned 7 more chapters which should be a good indication that we'll do
something else. To see why this is undecideable one must realize that the validness
of the invariant depends on program flow - for example the following piece
of code :
int x;
if (false)
x = "foobar";
actually is a type correct program. At any time in program execution the variable will only contain values of type int. Now in this simple case it is easy to see that x will never be assigned the string. One could even add a rule to the weeding phase to weed out all occurrences of if(false) statements. However it is not hard to construct more complex examples where it is not so easy for a computer-program to see that the code is indeed valid even if it seems obvious to humans. For example :
int x = 3;
int y = 7;
bool b = (x == y);
if (b)
x = "foobar";
I'll bet that no-matter how many and how good rules you make to check these things
then I can always come up with a valid program which your compiler will reject!
So, instead of going for a type correct program we decide to simply define
a bunch of type rules and then we say that a program is statically
type correct if it satisfies all these rules. We will always end up with a
type-checker which accepts too few programs! It will always reject certain valid programs.
The way to improve a type system is to find a special case which the type-checker
rejects even though it should accept it and then write some new code to make it accept
that case - without breaking the code so it will accept illegal programs
too! If you want your 5 minutes of fame you could download the source to the GNU compiler gcc and
do this :)
So now we have to define a bunch of type rules for PxdScript. I have decided to go
for a fairly strict type system where each component in an expression must match each
other perfectly. For example, you can only add int to int, assign string to
a string variable etc. Well, actually this is a truth with some modifications.
Remember that I added the cast operator when I designed the language. This
means that you can explicitly cast a variable to certain other types. Of course we
need to put up type rules about which variables can be cast to which etc. but in the
case of PxdScript where we have so few types that it should be fairly easy.
However, one would quickly get tired of explicitly having to cast variables even for
the most trivial cases. One would like the language to behave like for example C where
it is legal to assign an int to a float variable etc. This is called implicit
casting and it simply means that in certain cases the compiler will insert
casts for you when it can be done without loss of data. For example the following
would be legal (if we had floats in PxdScript) :
int x = 3;
float y = x;
because the integer value could be cast to a float without any loss of data. On the other hand the opposite direction :
float x = 3.4;
int y = x;
would not be accepted because even though a float could be cast to an int we should not do so automatically because chances are that it was a mistake and the programmer actually meant for y to be a float variable too. In this case the programmer would have to explicitly cast the variable like this :
float x = 3.4;
int y = (int)x;
The beauty of having implicit casts is that it allows us to define strict type rules
like I mentioned above. The compiler will actually insert a cast-node into
the AST to make any implicit cast explicit and thus the types of the components in
the expressions will always match perfectly. The problem is of course that we have
to code this functionality but we wont let this hold us back.
Notice that while explicit casts are something you can define generally for the types
then implicit casts are something you define for each type of operation. There are
quite a few places in PxdScript where we apply implicit casting and one of the places
is the addition rule. Of course we allow strings to be added to strings but
we'll also allow adding ints to strings like this:
int nr = 5;
string s = "this is nr "+nr;
This is a good example of an implicit cast which only holds for the addition
operation. It would not make sense to allow the same for subtraction!!
Ok, so which explicit casts do we allow then?
Source type: |
Legal casts: |
void |
(none) |
int |
string, int |
bool |
string, bool |
string |
string, int, bool |
One could argue that one should be able to cast between bool and int too, but then
you are in the lucky position of having the source code for the type checker so you
can add this yourself ;)
Besides all this casting stuff the type rules are as you would expect them to be when
I told you they were strict. Look through the code to see the exact rules.
It should come as no surprise that we do the type checking by making a recursive run
through the AST, just as we have seen it in the other phases. Most of the code is
pretty self-explaining but there are a lot of functions which are not part of the
recursion but rather tools we need in the type checking. I'll step through them here:
The first is initTypes() which simply
initializes a few variables we use to compare with when we want to check if a certain
type is int or bool etc.
One thing to notice is the variable undefinedTYPE which I declare here.
As I'm sure you know from your work with C++ (or even worse: JAVA) a small error can
result in a huge list of cryptic errors from the compiler and quickly
one learns to start with the topmost error, correct that one and then try to compile
again. Often a lot of the other errors has disappeared too then. This is because they
weren't really errors but rather results of the first error!
Now that you know a bit more about compilers you might be able to see why this happens.
Imagine there is a type error deep inside an expression. In the AST this error will
then be deep down some branch and as the type checking is recursive it will also happen
deep down a recursion. When the error is detected the compiler marks that particular
node in the AST as not type correct and returns from the recursion.
The parent to the node probably uses the node and expects it to be of a certain type.
Because of the error though the node has not been marked as having the
expected type and therefore the parent will also fail its type check
and report another error and so on.
The idea behind this undefinedTYPE is that when an error occurs
the node is marked as having this type and the parent will check for this and not
report an error if the child has been marked with this type. The result is that PxdScript
will behave nicely and not come up with those cryptic error messages we know from
other languages.
The next function worth noticing is
TYPE *greaterType(TYPE *l, TYPE *r)
{
/* Pass on undefinedTYPE: */
if (l == undefinedTYPE || r == undefinedTYPE)
return undefinedTYPE;
/*bool*/
if(l->kind == boolK && r->kind == boolK)
return boolTYPE;
/*int*/
if(l->kind == intK && r->kind == intK)
return intTYPE;
/*string*/
if(l->kind == stringK && ( r->kind == stringK ||
r->kind == intK || r->kind == boolK))
return stringTYPE;
/* l is not greater than or equal to r */
return undefinedTYPE;
}
This function takes two types and returns the 'greatest' of them after a certain ordering.
The ordering is fairly intuitive and represents an ordering of how complex the different
types are compared to each other. The function only defines half the ordering
- i.e.. it is based on the 'left' parameter and then looks at the right. This is because
when we need it we simply call it twice, the second time with its arguments switched.
Look at the source below for more details about using this function.
Next comes:
TYPE *greatestCommonType(TYPE *t1, TYPE *t2) {
TYPE *type;
if((type = greaterType(t1,t2)) != undefinedTYPE)
return type;
else if((type = greaterType(t2,t1)) != undefinedTYPE)
return type;
return undefinedTYPE;
}
which does exactly what the name implies - it finds the greatest common type that the two types are both compatible with. This is used when deciding which type one should give an expression and is also used to find out about the implicit casts. For example it is used in the equals node in the AST where you could (theoretically) write:
bool b = true;
if (b == "true") { .. }
In this, one parameter would be of type bool and the other of type string. greatestCommonType would figure out that the greatest type the two have in common is string and the compiler would insert an implicit cast to cast b to a string. The final code would then look like:
bool b="true;
if ((string)b == "true") { .. }
Next follows the function:
bool assignable(TYPE *left, TYPE *right) {
/* undefinedTYPE always yields success: */
if (left == undefinedTYPE || right == undefinedTYPE)
return true;
if (greaterType(left, right) != undefinedTYPE)
return true;
else
return false;
}
which decides if the right type can be assigned to the left type.
This is possible exactly if the left type is 'greater' than the right.
And finally, the last 'utility' function is the function which implements the casting
rules from the table above:
bool validCast(TYPE *source, TYPE *dest) {
/* rules for cast int -> ?? */
if (source->kind == intK)
if (dest->kind == stringK || dest->kind
== intK)
return true;
/* rules for cast string -> ?? */
if (source->kind == stringK)
if (dest->kind == intK || dest->kind
== stringK || dest->kind == boolK)
return true;
/* rules for cast bool -> ?? */
if (source->kind == boolK)
if (dest->kind == stringK || dest->kind
== boolK)
return true;
return false;
}
Ok, now for the recursion through the AST. As usual we start at the topmost level and work our way down the tree. The first interesting point in this recursion comes in the function which type-checks variable declarations:
void typeDECL(DECL *d)
{
switch(d->kind){
case formalK:
break;
case variableK:
// recurse on initialization
if(d->val.variableD.initialization != NULL){
typeEXP(d->val.variableD.initialization);
// check
if we can assign initialization to the variable
if (!assignable(d->type, d->val.variableD.initialization->type))
reportError(d->lineno, "Cannot assign
%s value to %s lvalue",
typeToString(d->val.variableD.initialization->type),
typeToString(d->type));
// Notice we check if the undefinedTYPE flag is set and only touch the
node if it isn't
if (d->type != undefinedTYPE && d->val.variableD.initialization->type
!= undefinedTYPE) {
/* make implicit casts explicit */
if (d->val.variableD.initialization->type->kind
!= d->type->kind){
d->val.variableD.initialization
= makeEXPcast(d->type, d->val.variableD.initialization);
d->val.variableD.initialization->type
= d->type;
}
}
}
break;
case simplevarK:
....
break;
}
if(d->next != NULL)
typeDECL(d->next);
}
Look through the code above and note how we apply both type-checking and implicit
cast here.
The last bit of code I'll list here is the code for type-checking a call to a function:
void typeEXP(EXP *e)
{
struct TYPE *type;
int argumentNo;
struct DECL *currentFormal;
struct EXP **currentArgument;
if (e == NULL)
return;
switch (e->kind) {
.
.
case invokeK:
// First type check all the
arguments. Remember these can be
// non-trivial and consist
of complex expressions and even other
// function calls.
if (e->val.invokeE.arguments != NULL)
typeEXP(e->val.invokeE.arguments);
// Check which kind of symbol the invoked
name refers to
if (e->val.invokeE.symbol->kind ==
functionSym) {
/* It is a function
*/
/* Check argument
types against formal types: */
argumentNo=1;
currentFormal =
e->val.invokeE.symbol->val.functionS->formals;
currentArgument
= &(e->val.invokeE.arguments);
while (currentFormal
!= NULL && (*currentArgument) != NULL) {
/* Test if current argument is assignable to current formal */
if(!assignable(currentFormal->type, (*currentArgument)->type)) {
reportError(e->lineno,
"Argument %d cannot be assigned to formal %d in function '%s'",
argumentNo,
argumentNo,
functionToSignatureString(e->val.invokeE.symbol->val.functionS));
}
/* make implicit casts explicit */
type = greaterType(currentFormal->type, (*currentArgument)->type);
if (type == undefinedTYPE) // if no greater type then we know there has been an error
and bails
break;
if ((*currentArgument)->type->kind != type->kind){
*currentArgument = makeEXPcast(currentFormal->type, *currentArgument);
(*currentArgument)->type = currentFormal->type;
}
/*
Next: */
argumentNo++;
currentFormal = currentFormal->next;
currentArgument = &((*currentArgument)->next);
}
/* Report argument
count errors: */
if (currentFormal
!= NULL && (*currentArgument) == NULL)
reportError(e->lineno, "Too few arguments to function '%s'",
functionToSignatureString(e->val.invokeE.symbol->val.functionS));
else if (currentFormal
== NULL && (*currentArgument) != NULL)
reportError(e->lineno, "Too many arguments to function '%s'",
functionToSignatureString(e->val.invokeE.symbol->val.functionS));
} else {
/* It is a program
*/
reportError(e->lineno,
"Programs cannot be called");
}
/* Set expression's type to the returntype
of the function: */
e->type = symbolType(e->val.invokeE.symbol);
break;
.
.
.
}// end switch
if(e->next != NULL)
typeEXP(e->next); }
Read through the above until you get the hang of it and then move on to the source you downloaded for this chapter and look through the rest. As with many other phases in a compiler there is a lot of code, but it's not that hard when you have an idea about what is going on in it.
Well, that was the last phase in our front-end. Our compiler should now be able to
reject all illegal programs as well as provide the user with good error messages when
errors occurs. It is possible that there are errors in the code for this chapter.
Type-checking is one of the mode complex phases and there are a lot of special cases
to take into account when one starts playing around with this undefinedTYPE stuff
etc. So if you find (and maybe even fix) an error in the source please email me so
that I can add the changes to the 'notes' page.
We are now about halfway through our compiler. My teacher once made a funny comment
about writing compilers. He said "Writing compilers is like skiing. At first you walk
uphill (the front-end) for hours which is fairly tedious but then you reach the top
and it's all getting fun from here (the back-end)". So lets take a breather while
sitting on the top, looking down at what we have accomplished so far but fasten your
seatbelt because next time we'll start the trip downhill ;)
Kasper Fauerby
www.peroxide.dk