PxdScript language tutorial series.

Chapter 7 - Resource calculations:

Introduction

Hi again guys.

Today's topic is resource calculation which, as you might remember, is the first module in the compilers backend. Let us quickly repeat what we have achieved so far and what the back-end is all about. During the first 6 chapters we analyzed the source code by first translating it into an AST which we then went through in multiple passes to check for errors. When the compiler reaches it's backend we know that there are no errors in the source code. This is an important observation because we can then assume that the AST is also error-free and this allows us to write much simpler code throughout the rest of the compiler. We don't have to check each pointer to see if it's NULL, because if it is, it means that there is an error somewhere in the source code and we would have noticed this in an earlier module and halted the compilation. Of course there are pointers for which NULL is a valid value - those we will still have to check of course.

As for the purpose of the back-end: Where the front-end was all about analyzing and checking the source code (in essence, getting the source code into the compiler) the back-end is about getting the source code out of the compiler in a compiled form. The back-end must translate the source program from the high-level language in which it was written into a series of instructions which can be executed on the target machine of the compiler.
A C/C++ compiler for a PC running Windows would translate the source program from C/C++ into machine code - that is, instructions that the Windows kernel and the PC CPU supports. In Windows this machine code would be saved as a .exe or .com file and we know that such a file can be run directly from the OS simply by clicking it or typing its name in a DOS-prompt. A C/C++ compiler on a PC running Linux would translate the same program into a series of different, but equivalent, instructions than those generated on the Windows machine and the same can be said for the Mac or Amiga. The output files for these other platforms can not be run directly under Windows because they use a different set of instructions that Windows doesn't know. Notice though, that in theory the same C program could be compiled and executed on a wide range of different hardware platforms and OS's. Because of the modular way our compiler is build it is easy to support multiple platforms - only the back-end needs to be changed for each target machine/OS, the front-end is hardware/OS independent and can be used in all versions of the compiler.

Today we will begin the calculations of the series of instructions that our compiler must translate the high-level language 'PxdScript' into. But before we can begin I'll have to explain a bit about the concept of a virtual machine, our target language and about the workings of an interpreter. So even if the code for today's module is small and easy to understand there will be a lot of new theory and new terms that you will have to learn, so hang on and lets get started!

Make sure you have downloaded the source code for the compiler by now. Today I will often refer to it and ask you to look at a certain point, so I suggest you keep it open in a text-editor while reading this chapter.

Virtual machine

Our compiler for PxdScript differs a bit from the examples given above in that it doesn't compile to machine code which can run directly on the computers hardware. Instead it compiles to the instruction set of a virtual machine which is then emulated by the hardware/OS specific platform on which the virtual machine runs. In other words, the virtual machine is a program written specifically for a certain platform - in our case for Windows - which can read in a compiled file containing a PxdScript program and execute the program.
This is not as unusual as you might think - JAVA for example is a language designed to run on virtual machines (the JVM). You might have heard that JAVA is a platform independent language and that every JAVA program can run on a wide range of computer systems. This is also true - the .class files that the JAVA compiler emits contains byte code which can run on many different systems like Windows, Linux, Mac etc. But remember that the .class files does not execute directly on the platforms. You cannot simply type in the name of a .class file in a DOS prompt and have it execute under Windows. Instead you have to run the .class file by using another programs - the JVM or "Java Virtual Machine" - by typing 'java myprog.class' at the prompt.
The file 'java.exe' is a program compiled specifically for Windows and it cannot be run on other platforms. The reason why the 'myprog.class' program can run on many different platforms is that similar programs to 'java.exe' exist on all these platforms.

The power of a virtual machine is that once it has been implemented on a specific platform then we can forget that the language is emulated and consider the virtual machine as valid and 'physical' as the one on which the virtual machine runs. It has become a layer of higher abstraction through which we can operate the physical hardware. It is for example not important to the JAVA programmer that the compiled file doesn't execute directly on the different machines - the only thing important is that the program solves the problem it was designed to solve. As a matter of fact the computer as we know it is already build up from a stack of virtual machines each emulating a language of higher level than the one on the level below it. At the very bottom we have the CPU which provides its users with a certain instruction set. This set typically contains instructions such as addition, subtraction etc. But even these seemingly singular instructions (we usually consider the addition of two numbers as a single operation) is actually 'virtual' instructions emulated by micro programs which controls the different transistors, registers and gates inside the CPU. The "VM" for these micro programs is implemented in hardware on the CPU. The different OS runs on top of these CPU instructions and provides the user with instructions of a higher level such as instructions to handle files without having to manually specify which physical disk-blocks on the hard disk that you wish to access. It is more convenient to be able to type 'fopen("myfile.txt", "tr");' instead of having to tell the hard disk to start its motors, tell it which disk block we want to access etc. - and we are able to do that because the OS provides us with an emulated instruction 'fopen'. On top of the OS we have the assembly language which provides the user with a programming interface which is easier to read than pure machine code (i.e.. hex numbers in a file) and on top of that we have the high-level languages we know such as C, Pascal etc. to further simplify the task of programming. These last 2 layers are different in that they are not emulated by the layer below them - instead they are compiled into machine code using the OS instruction set - that is, they are converted from a high-level language to the low-level language defined by the OS instruction set.

Each new layer in the stack is added to make the computer easier to work with. It is for example much easier to write code in C than in assembly (in most peoples opinion at least). If one wish to do so, one is free to add a new layer on top of the stack - and this is what we're going to do with PxdScript. PxdScript is going to be emulated by a VM written in C/C++ - a language one level below PxdScript in the stack.

Interpreted code

There is still another difference between languages like C or Pascal and PxdScript. Languages like C is (as mentioned above) compiled into machine code on OS level and as far as we are concerned they execute directly on our computer. Languages like PxdScript and JAVA are compiled to something called byte codes which can not be executed directly but instead the meaning of the code must be interpreted by an interpreter as the program execute. A program compiled to machine code is loaded into memory and then the CPU begins to execute the program instruction by instruction. Interpreted code can also be placed totally in memory but it must be 'fed' to the CPU one instruction at a time by the interpreter. The overall operation of an interpreter is very simple and looks like the following loop:

Get next byte code instruction from memory or disk.
Process the byte code and convert to OS instruction level, possibly reading in arguments.
Execute the converted instruction on the CPU using one or more instructions on OS level.
If more byte codes remains, repeat from 1.

And this is in a nutshell how we will code our virtual machine to execute PxdScript programs in chapter 11 when we have the final compiled byte code - only we will not convert the PxdScript instructions to OS level code but simply match them with corresponding functions within our game engine (which of course in the end is converted to OS level code by the C/C++ compiler we use for our game engine).

More Virtual Machine (VM) talk

So far I have just said that PxdScript will run on a VM and that it will compile to an instruction set defined by that VM - but I haven't described how the instruction set is defined or how the virtual machine runs the programs. The VM will be a so-called stack-based machine which is different from for example a PC which is register-based. The difference is perhaps best described through an example - the addition of two numbers.

Those of you who has written some assembly code before knows that to add two numbers on a PC we would first place the numbers in two registers and then add those. Something like this:

mov ax, 5 ; move 5 into the ax register
mov bx, 8 ; move 8 into the bx register
add ax, bx ; add the content of bx to ax<

Notice that to add the numbers we would have to find two unused registers to hold our values. The PC only has a very limited number of registers available for general purposes (ax,bx,cx and dx) so one has to be very careful about how one uses the registers. It is for example not possible to hold 100 different numbers in the registers at the same time. If there are no more free registers then we will have to save the value of some registers to memory and read them back later. This limitation is due to the fact that the registers we use for the operation represents actual physical registers inside the CPU - and the number of those are of course limited.

Because PxdScript runs on a virtual machine we do not have to suffer from such limitations. Our VM is an imaginary machine and if we wanted to do so we could define it to have thousands of registers (we could easily emulate them by using the disk and some addressing scheme - right?). But as it turns out using registers is not necessarily the best way to do it. It is for example quite complicated for the compiler to figure out which registers to use - even if we had virtually infinitely many. Instead we define the VM to run using a stack for all its calculations. Again let me demonstrate through an example:

ldc_int 5 ; load constant '5' and put on stack. Stack is now : 5
ldc_int 8 ; load constant '8' and put on stack. Stack is now : 5,8
iadd ; pop two int's from stack, add them and push the result. Stack is now : 13

The little snippet of assembly above is actually how the PxdScript compiler generates code for the simple expression '5 + 8'. Notice how the meaning of the assembly code is perfectly clear - without the use of any registers! I'll show you the code for a larger expression so you can get a little more used to the concept of stack-based calculations '(5 + 8)*5 - 4'.

ldc_int 5 ; Stack : 5
ldc_int 8 ; Stack : 5,8
iadd       ; Stack : 13
ldc_int 5 ; Stack : 13,5
imul       ; Stack : 65
ldc_int 4 ; Stack : 65,4
isub       ; Stack : 61

So the VM does all its calculations on a stack. But how does it store its variables then? Well, also on the stack actually! The number of variables is something we can calculate at compile time and then reserve space for them at the bottom of the stack. This is one of the things we will calculate in the code for today's chapter! So the variables are numbered from 1 and up, starting with the formals (the parameters to a function - remember?) and then the variables in the body of the code as we reach them. So if for example we calculate that we have 7 different variables then we reserve 7 stack-spaces for variables and start our calculations at stack position 8. We can then load a variable to the top of the stack with the 'iload' instruction and save the top-most stack item to a variable with the 'istore' instruction. Examples:

iload 3 ; Load the 3rd variable to the top of the stack
istore 3 ; save the top-most stack item to variable number 3

Ok, so that is how calculations are done and how variables are stored - but what about the byte code itself? The above examples has been byte code in symbolic form. Each of the instructions above like 'ldc_int', 'iadd' etc. has a corresponding number. As the name implies this number is a byte so the VM can only have 256 different instructions in its instruction set. As you shall see this is more than enough so there is no need to worry!! The assembling from byte code in symbolic form to byte code in binary form is fairly simple - we simply convert the symbolic tokens to their byte number and save that to a binary file. The arguments to the instructions are saved in a similar way (but more on this in the next chapter on code generation).

Eventually the code is represented as a stream of bytes. In our VM I will call the space where the actual byte code lies 'CS' (code segment). The VM has a pointer into this space called 'PC' (Program Counter) which points out the next instruction to be interpreted. The program progresses like described in the section on interpreters - it reads the byte pointed at by the PC, possibly reads arguments to that byte code instruction and advances the PC accordingly. Then based on the byte code instruction read it might update the stack.

The stack also have a few pointers associated with it - namely the 'SP' (Stack Pointer) which always points to the top of the stack and the 'BSP' (Base Stack Pointer) which points to the base of the stack. The BSP is used in function calling and to access variables. Remember that I said the bottom positions of the stack was used for variables, so to get variable 2 we simply access the stack at 'BSP + 2'.

Look at the drawing below to get a better idea of how the VM would represent the code snippet for '(5 + 8)*5 - 4'.

Ok, enough about the VM for now but know that we will return to it in MUCH greater detail in the next chapter on code generation. In that chapter you will find a complete list of the byte code instructions used and their effect on the stack but for now the important thing is just that you understand the basics of how the VM operates and understands the difference between byte code in symbolic form and byte code in binary form. In symbolic form it looks like assembly code but in binary form it looks like machine code and is 'unreadable' by humans (not true, but it's not so easily done) - however the difference between the two forms is very minor.

Resource calculation

Finally it's possible for me to explain what we are trying to calculate in this chapters module! First of all we wish to calculate the number of variables used in the different programs and functions and map each variable to a number which will be its address in the stack. If you look in the 'tree.h' file and find the FUNCTION structure you will see that it has a field called 'localslimit'. After this module it must contain the number of variables used in that function. A similar field is found in the PROGRAM structure.
Now find the structure DECL (for 'variable declaration') in 'tree.h'. Notice that for all kind of declarations it contains a field called 'offset'. After this module this field must contain the offset relative to the BSP where the variable is found. When these fields are all filled out we have completed our first goal and has calculated what we need to know to create the correct byte code for variable access in the next chapter.

The second thing we will calculate today is a mapping from labels to addresses. I haven't mentioned labels yet but obviously some sort of labeling scheme combined with certain jump instructions is needed to create linear assembly code for things like conditional statements (for example it is needed to program the 'if-then-else' construct). Let's look at an example. What follows is the byte code (in symbolic form) for the 'is equal' construct - that is, expressions such as (a == b). This is a Boolean operation and based on whether or not the two operands are equal the VM must place a 0 (for false) or 1 (for true) on top of the stack:

iload_1               ; Load the first argument (variable 1)
iload_2               ; Load the second argument (variable 2)
if_icmpeq true        ; Pop two stack items and compare them. If equal jump to label 'true'
iconst_0              ; If we get here they were NOT equal. Place 0 (false) on stack.
goto stop             ; Unconditional jump to label 'stop'
true:
iconst_1              ; If we get here they were equal. Place 1 (true) on stack.
stop:

Look through the assembly above and notice that no-matter how you set the values of variable 1 and 2 then the result will be that either a 0 or a 1 is placed on the stack and the PC will in both cases end at the label 'stop'.

While the block above is easy to read it is not enough when the code is represented as binary byte code. In binary mode the arguments to the jump instructions are addresses into the CS - that is, the address the PC (Program Counter) should be moved to. Thus all labels must be unique and eventually mapped to an offset into the block of memory we use as CS. For now it is enough that we translate the labels into numbers unique to the function or program in which they appear. Later when we link the different functions and programs into the same CS we will change the labels into absolute addresses into the CS.

Let's address the two issues one at a time:

Calculating variable offsets

It should come as no surprise by now that we do the resource calculation by making a recursive run through the AST. So we start at the top-most node which is the SCRIPTCOLLECTION node. Eventually we will reach either a function or a program. The offset calculation is almost identical in both cases - except for the fact that programs cannot have any formals (parameters). I'll guide you through the resource calculation for a function (the resFUNCTION() function).

Notice in the top of the file 'resource.c' I declare a global variable called 'offset' and another called 'localslimit'. Also notice the function 'nextoffset' which does nothing more than increase the offset variable and return it. It also makes sure the 'localslimit' variable at all times is set to the greatest offset so far. The first thing we do when we reach a FUNCTION node is to reset the global variables to 0. This is because the offsets we want is local to the current function or program (each program and function will operate on its own stack as you will see in a later chapter).
Now we recurse on any formals the function might have. This is the first time we might reach the function 'resDECL()' which is the most important function for calculating offsets.

void resDECL(DECL *d) {
switch(d->kind){
    case formalK:
      d->val.formalD.offset = nextoffset();
      break;

    case variableK:
      /* at this point no difference because of weeding */
      d->val.variableD.offset = nextoffset();

      if(d->val.variableD.initialization != NULL)
        resEXP(d->val.variableD.initialization);
      break;
    case simplevarK:
      d->val.simplevarD.offset = nextoffset();

      if(d->val.simplevarD.initialization != NULL)
        resEXP(d->val.simplevarD.initialization);
     break;
}

if(d->next != NULL)
   resDECL(d->next);
}

Important or not - as you can see it is a very very simple function. I think now is a good time to re-discuss the difference between the 3 variable types 'formalK', 'variableK', 'simplevarK'.

The variables of type 'formalK' are formals (or arguments if you prefer) to a function and they are special in that each declaration of a formal can only have one identifier (a formal declaration cannot be 'int a,b,c' and they cannot have any initialization (we do not support default values for parameters to functions). Formals are the simplest possible variable declaration.

The variables of type 'variableK' are the normal kind of variable declarations you find in the body of a function or program. They can be a list of declarations, have initializations etc. An example could be 'int a,b,c = 0' which is a single declaration of type 'variableK' which declares 3 different variables and initializes all of them to 0.

The variables of type 'simplevarK' are variable declarations which can have initializations - but there can only be one identifier. I.e.. you cannot declare more than one variable at a time using 'simplevarK' declarations. The 'simplevarK' declaration is used in the FORINIT node - that is the initialization to a for-loop. Here ',' is used to separate the different parts of the initialization so it wouldn't make sense to allow declarations of type 'variableK'. So 'simplevarK' is used for things like: 'for (int a=0, a < 8, a = a+1)'.

As you can see the two cases 'variableK' and 'simplevarK' are treated exactly the same in the code above. So it would seem that we would miss any additional variables besides the first one in a 'variableK' declaration because we do not recurse on the 'next' pointer in the IDENTIFIER node. But remember the weeding phase - in that phase we converted every declarations of type 'variableK' which had more than one identifier into a list of 'variableK' declarations of only one identifier. I.e.. we converted 'int a,b,c=0' into 'int a=0; int b=0; int c=0'.

Ok, but in any case what the function does is to pull the next offset number from the offset-automaton and assign that number to the variable :)

Not much else is happening in the code for offset calculation - follow the recursion in the AST yourself.
There is only one more subtle point to the calculation: Look at the 'scopeK' case in the function 'resSTM()':

case scopeK:
/* Notice how we save offset to be able to go back to it */
baseoffset = offset;
resSTM(s->val.scopeS.stm);
offset = baseoffset;
break;

We save the offset value before we enter the scope and set the value back after we leave? Why do we do this? Actually this is not strictly necessary - but it is an optimization which saves us some space on the stack. Remember that we can only see variables in our own scope and scopes higher than our level. Thus, when the PC leaves a scope all variables declared inside that scope cannot ever be reached again and we might as well override their values by reusing their address in the stack. Look at this example to get the idea:

int var1 = 1; // offset 1
int var2 = 2; // offset 2

while (var1 < 8) {
int var3 = 3; // offset 3
int var4 = 4; // offset 4
var1 = var1 + 1;
}

int var5 = 5; // offset 3

When we reach the declaration of 'var5' we are through using variables 'var3' and 'var4' and we might as well reuse their addresses for new variables. If we had not saved the value before entering the scope of the while-loop and set it back afterwards the variable 'var5' would have gotten the offset 5 even though we would never again use the addresses 3 and 4.

Calculating label addresses

Calculating label addresses is even easier than calculating the variable offsets because here there are NO special cases. Like with the offsets we have a function to get the next label and like with offsets we reset the label count every time we reach a function or program. Then we recurse through the body of the function or program and insert label values where needed.

Remember the example with the 'equals' expression where I showed you that we needed two labels to translate it into assembly code. In the function 'resEXP()' there is a case for the 'equals' expression which looks like this:

case equalsK:
e->val.equalsE.truelabel = nextlabel();
e->val.equalsE.stoplabel = nextlabel();
resEXP(e->val.equalsE.left);
resEXP(e->val.equalsE.right);
break;

We simply pull two label addresses from the label-automaton and recurse on the expressions (which might contain labels of their own). Every other place where labels are needed works the same way. Look through the code to get the idea.

Checklist

Final words

We finally started working on the back-end of our compiler. This particular chapter had a lot of easy code but never the less it was quite a challenge for me to write because there were so many places where I felt I needed to explain stuff which belongs more to the next two chapters. For example, I touched the subject of code generation a little in my description of the VM which also had to include some stuff I had planned to push off until chapter 11 on the VM. The problem is that the 3 topics are so closely related that I found myself wondering how I learned this stuff myself - it seems that to understand one part you also have to understand the other two ;)

I hope the result is not too messy and that you can find head and tail in my explanation. If there is something you find very unclear I would like to hear about it because it is possible I skipped the explanation of something which seems obvious to me - but makes part of this text look like it was written in alien for you guys. Stay tuned for the next chapter which should be out soon (and which will be quite short because we covered part of it here already).

Keep on codin'

Kasper Fauerby

www.peroxide.dk