CS160: Project 5 - Code Generation (20% of project score)

Project Goals

The goals of this project are:

to generate x86 code from your CSimple programs
to complete the compiler so that it can produce something useful

Administrative Information

This is an individual project.

The project is due on Friday, December 15, 2023, 23:59:59 PST.

Project Introduction

The goal of this project is to generate x86 assembly code that "implements" the functionality of some code expressed in the CSimple programming language. Using gcc (or, more precisely, the gas assembler), you can then convert the generated x86 assembly code into its corresponding machine code representation (object file). This object file can then be linked into an executable program that you can actually run on your (Linux) machine - no need for an emulator or anything similar!

For this project, you need to write the class codegen. This class is a visitor class, similar to ast2dot and typecheck. The codegen class traverses your abstract syntax tree (AST) that you have created in project 3 and type-checked in project 4. While it traverses the AST, it produces x86 assembly code that implements the functionality of the different CSimple statements.

For this assignment, we will not use nested procedures and we will not declare any variables within nested scopes. That is, you can assume that there won't be any nested procedures when generating code. Moreover, you can assume that all variables are declared at the beginning of a procedure (recall that we don't allow nested procedures that would come before the variable declarations). This will make your life easier.

Tour of the Code

You will extend the compiler that you wrote for the previous Project 4. More precisely, you will develop a new class that will perform code generation. To get you started with the codegen visitor, you should download the archive with the project files here. This archive contains the skeleton for the codegen class (in the file codegen.cpp). Note that the archive also contains the files from the previous project. As before, there were some small changes to the code. For example, in your main procedure in main.cpp, there is now a line that invokes the codegen visitor, and the makefile was updated to take into account the new class. As a first step, please merge your existing code into this new skeleton code.

Steps to Solve the Challenge

Because you have to generate 32-bit x86 assembly code for this project, the first requirement is to make sure that you understand x86 assembly code. You will need to understand how to move values between memory and registers (using mov), perform arithmetic and logic computation (using add, and, ...), and implement control flow (using cmp, j* instructions, call, ret). Moreover, you need to familiarize yourself with the registers that the x86 processor offers (the general purpose registers, such as %eax, and the ones with special meaning, such as %esp). For a good overview, you can read the text on Wikibooks (check for IA-32). Other helpful resources are a more detailed PC Assembly book and the complete Intel instruction set reference (again, focus on IA-32). Of course, you can always check Google for "x86 assembly" and find many more useful documents.

An important note: Many of the documents that you will find online use the Intel syntax (first the destination operand, then the source operands). This is different from the assembly code that gas understands and that you will have to emit. Gas follows the AT&T syntax (sources first, destination operand last). Keep that in mind when reading the documents. A nice comparison between the two syntaxes can be found here.

To check that your assembly code is correct, and to produce an object file that can later be linked into an executable, you should use gas, the assembler that comes with gcc. Consider the following, simply assembly program that represents an empty procedure:

  .text

  .globl Main
  Main:
        ret

You can see that we declare a label called Main, which stands for the beginning of our Main procedure. Note that we use the assembler directive .globl to declare Main as a label that is visible to the outside. Also note that, for some compilers, you might need to add an underscore character before the name Main (i.e., _Main) both in the globl directive and the start label to make the linker recognize your procedure.

When we have this (extremely simple) piece of code in a file called csimple.s, we can use the gas assembler to turn this into an object file. For that, do:

  gcc -c -m32 -o csimple.o csimple.s

The -c flag advises gcc to "compile or assemble the source files, but do not link. The output is in the form of an object file for each source file," as man gcc will tell you. The -m32 flag tells gcc to produce code for the 32-bit ABI (the Intel IA-32).

Now that we have an object file, how can we actually run our code. The easiest is to write a little C program that acts as a wrapper and that invokes our Main procedure. For example, you can write a simple C program start.c that contains:

  #include <stdio.h>

  void Main();  // inform the compiler that Main is an external function

  int main(int argc, char **argv) {
      Main();
      return 0;
  }

Now, we can compile the small start program and link it with the csimple object (csimple.o) to actually produce an executable that can run. For this, simply do:

  gcc -c -m32 -o start.o start.c
  gcc -m32 -o start start.o csimple.o

You can ignore the security warning. At this point, you have an executable that you can run and debug. Of course, not much happens at this point, since your function simply returns. Now, try the following and extend csimple.s with:

  .text

  .globl Main
  Main:
        movl  $10, %eax
        ret

and then extend start.c with

  #include <stdio.h>

  int Main();  // note that Main now returns an integer!

  int main(int argc, char **argv) {
      printf("Main returned: %d\n", Main());
      return 0;
  }

Then rebuild both object files and link them. Since return values are passed via the %eax register on the x86, you will notice that your program prints out 10. This is great! Now we have a way to return an integer back to our C program that then prints out this value. Clearly, that will be very handy when debugging.

At this point, it is time to produce some actual code. You should start with integer expressions. That is, you should generate code for arithmetic expressions (those that use integer literals and the operations +, -, *, /, and unary minus).

One problem that you might encounter at this point is that many operations need arguments that are in registers. Since we do not want to play around with register allocation algorithms in this project, the best way is to emit stack-based code. That is, whenever you emit code that performs a computation, you will first move (push) the arguments on the stack. The values are either taken from the activation record (in case of local variables) or they are immediate values (constants). Then, for an operation (e.g., an add), you will pop from the stack the first operand into a register (say, %eax). Then, the second argument is popped into another register (say, %ebx). At this point, you can emit code that performs the computation with the registers (e.g., addl %ebx, %eax). The result will be in %eax in this case (make sure to understand why!). At this point, just push the result back on the stack. When emitting stack-based code, your program will be likely inefficient, producing many "unnecessary" push and pop operations. However, it is easy to produce this code, and you do not need to worry about register allocation. The reason is that you will typically use only two registers for a single operation, and before and after, all involved operands (and temporary results) are on the stack.

Once you have finished your code generator for expressions, try and emit code that computes some complex integer expressions (e.g., (3-5)*(7+2)). You can always move the result into %eax, have your Main function return, and let the C program start print out the result. This is a great way to debug.

At this point, you need to work on procedures:

First, you need to decide how and where to store local variables and the remaining information that you need as part of an activation record. For this, you will probably need to introduce a function prologue that sets up the proper activation record. This function prologue will need to store the activation record pointer of the caller (%ebp) as well as other callee saved registers. Also, it needs to create space for the local variables. Finally, you need to set up the activation record pointer (preferably, %ebp) so that you can access the function's parameters and variables. Note that the return address is already on the top of the stack when your procedure is called (this is done automatically as part of the x86 call instruction).
A note on alignment for local variables and the stack pointer. The stack pointer must be aligned by the stack word size (which us 4-bytes). Moreover, you want to have multi-byte local variables (such as integers) properly aligned as well. Thus, we recommend that you align every variable on a multiple of 4 (that is, each local variable starts at an address that is divisible by 4). While this might waste some space between local variables (for example, when you have multiple character variables), it will make things easier (and avoid random crashes due to unaligned memory accesses). To make this happen, you will need to make small changes to the way the local variable offsets are computed in the symbol table (symtab files).
Once this is done, you need a way to clean up when your function is finished (a function epilogue). In there, you will want to restore the callee saved registers and restore the activation record pointer (%ebp) of your caller. Then, you want to reset the stack pointer to where it was when your procedure was invoked. Also, make sure that the return value is moved into %eax (which is where the caller expects it to be) when you have not done so already. Finally, you can use the ret instruction to return control to the caller (note that ret pops the value from the top of the stack and uses this as the return address).
We want that your procedures can be called by code generated by gcc. Thus, you have to respect the 32-bit x86 C calling conventions (or procedure linkage or calling contract) that gcc and other x86 compilers use. I found this document helpful.
Once you make sure that your procedures can be called, it is time to call a procedure yourself. For this, you need to put the arguments on the stack (from right to left, in reverse order). Then, simply use the x86 call instruction.

Once you have the integer expressions and the function calls done, it is time to focus on the Boolean expressions. When Booleans work, you are ready to implement control flow (if statements and while loops).

Once you have implemented control flow, there are only the strings (character arrays) and pointers left that need to be taken care of. A few things need to be considered:

You need to be able to assign string literals to string variables. A string literal can be stored in your assembler output in the data section. That is, when you encounter a string literal, switch to the data section and store the string as a .ascii character string (don't forget the null character as a string terminator).
When you have string literals, we need to be able to assign them to a string variable. An assignment of a string literal to a string variable (or, for that matter, the assignment of a string variable x to another string variable y) has to be implemented as a string copy. That is, you have to emit code that copies the source string into the destination string. Before you can copy the string, you need to make sure that the destination string array is large enough for the source string. Otherwise, you might overflow the destination. For this, check, during code generation, that the static (declared) size of the destination string is equal or larger than the size of the source string or the length of the string literal that is copied (don't forget that the string literal is terminated by an additional null character).
Once you can copy entire strings, you need to implement the functionality to access individual elements of the string. For this, emit code that, given the index into the string, computes the address of the corresponding character. Then, use the movb instruction to copy individual bytes (characters).

Congratulations! At this point, your compiler is basically finished. You generate code that is assembled and linked into an executable that you can actually run!

What Your Compiler Has to Do!

Your compiler must successfully parse any valid input file.
Your compiler must generate the correct AST (but do not output a dot representation this time).
Your compiler must typecheck the program and terminate in case an error is detected.
Your compiler must output 32-bit x86 assembly code (IA-32 code) that corresponds to the CSimple input program, assuming that no nested procedures are used and all variables are declared at the beginning of the body of a procedure. This code must assemble into an object file without warnings or errors, using gcc/gas.

Deliverables

Like for the previous project, we are using Gradescope (and its auto-grader feature) to grade this assignment and your submissions.

Once you are done with your scanner/parser, go to the third assignment and submit your code.
For this project, please submit your "lexer.l", "parser.ypp", "typecheck.cpp", "codegen.cpp", "symtab.hpp", and "symtab.cpp" files. We supply the rest and build your project.
We do not show you the test cases and the expected output, but you should get some feedback about the types of tests that your submission passes and where it fails.
You can make a new submission once every hour. Make sure you thoroughly test your program locally, and don't (ab)use the auto-grader as a test harness.

Created by Christopher Kruegel (© 2008, using Apache Cocoon).