PxdScript language tutorial series.

Chapter 4 - Weeding:

Introduction

Hi again guys.

Today we'll take a look at the first module in our analyzing phase - namely the weeding module. So what does this module really do? It does exactly what the name implies - it looks through the code and weeds out programs which contains certain 'bugs' which we didn't catch in our parsing phase. Weeding is very easy to do and it uses the same technique as the pretty-printing module which I gave you in the last chapter to traverse recursively through the AST. So let's take a look at which problems we wish to catch during our weeding phase.

Checking returns / Unreachable code:

The first and most important thing we want to check is that every non-void function actually returns a value - always! This might sound like a very trivial thing but try to construct a context free grammar which only allows non-void functions if they contains a return statement in every possible branching through the function body! If you can do this then I would be very interested in seeing it - and so would a lot of other people ;)

But even if you could write such a grammar it would be very big and hard to read and understand - the added feature of the language would have too high a cost. So we choose to create a grammar which accepts a slightly larger language than the one we want and weed out the bad AST's during this weeding phase. So how do we do this? We run through the AST and for every non-void function we check the STM nodes which makes out its body to see if every pass through it contains a return statement. I wrote this function (which can be found in weed.c in the zip):

bool weedSTMCheckReturns(STM *s)
{
    switch(s->kind){
        case returnK:
            return true;
        case ifelseK:
            return (weedSTMCheckReturns(s->val.ifelseS.thenpart) &&
                    weedSTMCheckReturns(s->val.ifelseS.elsepart));
        case sequenceK:
            if (weedSTMCheckReturns(s->val.sequenceS.first) == true) {
                reportWarning(s->lineno, "Unreachable code");
                return true;
            } else
                return weedSTMCheckReturns(s->val.sequenceS.second);
        case scopeK:
            return weedSTMCheckReturns(s->val.scopeS.stm);

        default:
            return false;
    }
return false;
}

and that's all!

So let's see that this really does what we want. As you can see it recurses through the STM nodes of a function body. Remember how the statements are build up in our AST. They are not linked lists of STM nodes - if a function or program body has more than one statement then the single STM node in the FUNCTION or PROGRAM node is expanded using the sequence STM node.

Now there are only very few interesting cases: If a return STM is reached then obviously the function returns and the recursion is ended and 'true' is returned. If we reach a if-then-else STM then we recurse on both the 'if' and the 'else' part and requires that both of these paths contains a return statement. If we reach a sequence STM we just checks both statements in the sequence. Notice how we get the 'unreachable code' check for free here - if the first statement in a sequence actually is a return STM (which is the only way it can return true) then we will never execute the second statement in the sequence (think about it).

I have decided that unreachable code isn't an error as such so I just report a warning if I find it - if you want a more strict language you could report an error which would halt the compilation.Every other statement is covered by the default case in the switch and the default answer to whether or not a statement is a return is false.

Notice our conservative choice regarding if sentences (not if-then-else, but those only containing an if part). Without even checking the body of the if sentence we return false when we reach it. That means that we will reject functions such as:

int test() {
   if (true) {
       return 5;
    }
}

even though we can see that every possible path through this function will contain a return. The reason is that the boolean condition in the if sentence could be anything, it's only because we can see that it is constant 'true' that we accept this function as valid. The compiler is more general but of course you can always add special code for checking these things and thus allowing a function like the one above. This is called static analysis and is not something I'll cover in this series.

And how do we use the function above? Well, we start our recursion through the AST and when we reach a function node we do the following:

void weedFUNCTION(FUNCTION *f)
{
    if(f->formal != NULL)
       weedDECL(f->formals);

    if(f->stms!= NULL) {
       weedSTM(f->stms);

    if (weedSTMCheckReturns(f->stms) == false && f->type->kind != voidK )
        reportError(f->lineno,"Function must return a value");
    }
}

Weeding double scopes:

There is another small thing that I remove during the weeding phase and that is the appearance of double scopes. They appear because of my general handling of the places where a new scope might be introduced - for example at a for-sentence. Consider the following two for-sentences:

for (int i=0; i<5; i=i+1)
{ // here I explicitly declare a new scope
a = a * 2;
}

for (int i=0; i<5; i=i+1)
a = a * 2;

We are used to that we are allowed to write for-sentences without { } around the body if it only contains a single statement. In our grammar we simply say that the body of a for sentence is a 'stm' nonterminal. The stm could then either expand to a single other statement (if we do not use { } ) or to a compound statement if we do use the { } in the body. The compound statement would introduce a new scope but the single statement would not.

However, we want a new scope for the body of the for sentence whether or not the user wrote the {} so in our production rule for the for-sentence in the lang.y file we explicitly made a scope around the body. If the user then actually did write the { } around the body then we ends up with a double scope - i.e.. the two examples above would be pretty printed as:

for (int i=0; i<5; i=i+1)
{ // automatically added scope
{ // users scope from above
a = a * 2;
}
}

for (int i=0; i<5; i=i+1)
{ // automatically added scope
a = a * 2;
}

Actually it wouldn't be a problem with the extra scope but I don't think it looks very nice so we might as well weed it out.

This is done in the weedSTM function in weed.c file and the part where it happens looks like this:

case scopeK:
    /* weed 'double scopes' */
    if(s->val.scopeS.stm->kind == scopeK){
        s->val.scopeS.stm = s->val.scopeS.stm->val.scopeS.stm;
    }
    weedSTM(s->val.scopeS.stm);
break;

so if we reach a scope STM and the stm pointer in the scope points at another scope node then we simply link the pointer to the second scopes statements.

Homework:

Urgh, you don't really like the sound of the heading for this section do you? ;)

Well, this chapter has been a bit too easy don't you think? Well, now it is time for you to mess around with the code on your own for the first time. As you can see I have actually only analyzed a small part of the AST in this chapter - namely the STM nodes. Yet I have left code to recurse on a large part of the rest of the AST - for example expressions and declarations. How come? Well, there are more things you could weed out in this module and you are the one who is going to do it. I'll give you one thing to weed out and then you can spend some time thinking about other things you could catch. The one thing I'll suggest is weeding of division by zero. You decide how to do it but in its simplest form (namely weeding out division by constant zero) it should be easy to do. Good luck fooling around with the code...

Checklist

Final words

So, this chapter was fairly easy huh? So what is our current status? Well, about the same as in chapter 3 except that we fixed a few flaws introduced by our context free grammar. Another thing, I've put up a small html page on my site where I'll write small notes now and then when I discover errors in the previous chapters or in the source code. This saves you from having to re-read the chapters just because of small changes and provides a fast way to maybe get an answer to a question you might have if you think you have spotted an error somewhere. If you are reading this you should know but the address of my homepage is: www.peroxide.dk

Stay tuned for chapter 5 soon..

Keep on codin'
Kasper Fauerby

www.peroxide.dk