CS160: Project 4 - Type Checking and Semantic Analysis (30% of project score)

Project Goals


The goals of this project are:

  • to analyze semantic properties of a CSimple program
  • to perform type checking

Administrative Information


This is an individual project.

The project is due on Friday, December 1, 2023, 23:59:59 PST (extended).

Project Introduction


For this project, we want you to analyze semantic properties of programs and to perform type checking. At this point, you have perfected your scanner, parser, and built your abstract syntax tree (AST). That is, your compiler should be able to scan in an input file, check for syntax errors, and build an AST data structure. By looking a the visitor ast2dot.cpp class, you should understand how to traverse the AST and print out the nodes of the tree.

Now, it is time to find some more errors and to perform type checking! There are two major hurdles to this project: Understanding what you need to check for and understanding how to use a symbol table. We provide the code for the symbol table for you, you just have to understand how to use it in your analysis.

For this project, you need to write the class typecheck.cpp. This class is a visitor class, similar to ast2dot.cpp. That is, typecheck operates similar to ast2dot. It is a visitor class that traverses your abstract syntax tree (AST) and provides some useful functionality. The difference to ast2dot is that typecheck will not print out nodes. Instead, the class will perform a number of semantic and type checks, as discussed below.

In typecheck.cpp, you will need to implement a visit*() function for ALL AST classes as well as the primitive and symbol table classes. Look closely at the ast2dot.cpp file for reference. Also, the skeleton class that we provide will help you to get started quicker.

Tour of the Code


You will extend the compiler that you wrote for the previous Project 3. More precisely, you will develop a new class that will perform type checking. To get you started with the typecheck.cpp visitor, you should download the archive with the project files here. This archive contains the skeleton for the typecheck class (in the file typecheck.cpp). Note that the archive also contains the files from the previous project. Some of these classes have slightly changed, so please use this latest version (and copy over your grammar and lexer code). For example, we realized that we needed an explicit class for the NULL pointer (you can see that we added it to ast.cdef). So, please make sure that your code takes these small changes into account.

For this project, you need to understand the symbol table in more detail (which is implemented in symtab.hpp and symtab.cpp). The symbol table is actually composed of four classes: SymName, Symbol, SymScope, and SymTab. Below, we highlight important aspects of each class. You will need to go through the entire files to really understand what is going on here.

  • SymName: SymName is a data structure that stores two things: the symbol name (actual, literal spelling of the ID) and a pointer to a symbol object for that name. You already know this class from Project 3, where you used it to store the symbol names in the AST.
  • Symbol: Symbol is a data structure that simply defines the type of each object. If you look through the AST classes, at no point is a symbol created. Why? The answer is that you did not have to define (and should not have defined) the types when building the AST. That is, you should create symbols when you typecheck, not when you parse. This implies that you will have to create symbols as your typecheck visitor traverses the AST. Note that the actual type is stored in the member m_basetype of a Symbol object.
  • SymScope: SymScope is a data structure to help you in checking the scope of each object. You do not actually have to create a SymScope object in your visitor class. SymScope is encapsulated by the SymTab class. You do have to decide when to open or close scopes, though!
  • SymTab: SymTab is the actual symbol table. Check the interface that this class exposes to see how it can be used, in particular, the insert* and lookup functions.

In addition to the symbol table, there is the Attribute class. It is a struct that stores management information for AST nodes. In particular, it stores the line where the corresponding grammar symbol appears in the source file, the scope of the current symbol, and the type of the subtree. You have to manipulate the scope (m_scope) and the type (m_basetype) when checking the program (i.e., when walking the AST with your typecheck visitor).

Steps to Solve the Challenge


  1. The idea is that your typecheck visitor calls accept on an AST node. The AST node (object) calls back one of the visit*(this) functions that the typechecker implements. The typechecker function then does its work. At one point, will have to call accept on the node's children. You start with the root node of the AST and then traverse it, performing the necessary checks that are listed below.

  2. You need to perform the following checks:
    • One main function:
      Only one procedure Main() can exist, and must exist at file scope (global), and this is case sensitive. If there are multiple main functions, exit with error code 2.

    • Main() has no arguments:
      Main() cannot have arguments. If it does, exit with error code 3.

    • Duplicate Procedures:
      A procedure ID can be used only once in the same scope. If this property is violated, exit with error code 4.

    • Duplicate Variables:
      A variable ID can be used only once in the same scope. If this property is violated, exit with error code 5.

    • Undefined Procedures:
      All procedures must be defined in the current or higher scope before they are used (before they can be called). If this property is violated, exit with error code 6.

    • Undefined Variables:
      All variables must be defined in the current or higher scope before they are used. If this property is violated, exit with error code 7.

    • Number of argument mismatch:
      When a procedure is called, the number of arguments passed in must match the number when the procedure was declared. If this property is violated, exit with error code 8.

    • Argument type error:
      When a procedure is called, the types of the arguments passed in must match the types of the arguments in the procedure declaration. The arguments cannot be strings. If this property is violated, exit with error code 9.

    • Return type error:
      Return statements must return a value of the same type as declared by the procedure. The return type cannot be of type string. If this property is violated, exit with error code 10.

    • Procedure call assignment type error:
      When a procedure is used the return type of a procedure must match the variable to which it is being assigned. If this property is violated, exit with error code 11.

    • If statement premise type error:
      The premise of an if statement must be of type Boolean. If this property is violated, exit with error code 12.

    • While loop requirement type error:
      The requirement of a while statement must be of type Boolean. If this property is violated, exit with error code 13.

    • String index type error:
      Strings (character arrays) can only be indexed by an expression that evaluates to an integer. If this property is violated, exit with error code 14.

    • No array variable error:
      Only string variables can be indexed. If this property is violated, exit with error code 15.

    • Incompatible assignment error:
      The types of the left-hand side and the right-hand side of an assignment must match. Note that one can only assign characters to individual string (character array) elements. The NULL pointer can be used as either a char pointer or an integer pointer. If this property is violated, exit with error code 16.

    • Expression type error:
      The types of expressions must match. The rules for expressions are the following: For arithmetic operations (+,-,*,/), both operands must be integer, and the resulting type is integer (see exceptions for pointers below). For logic operations (&&,||), both operands must be Boolean, and the resulting type is Boolean. For the following comparison operations (<,<=,>,>=), the operands must be integer, and the result is Boolean. For (in)equality operators (==, !=), the operands can be both integer, both Boolean, both characters, both char pointers, or both integer pointers (the NULL pointer can be used whenever a char or an int pointer is valid). The absolute values operator (| |) can be applied only to integer expressions or string variables, and the result is of type integer. The not operation (!) can only be applied to Boolean expressions, and the result is Boolean. If this property is violated, exit with error code 17.

    • Pointer arithmetic:
      It is possible to add/subtract an integer to/from a pointer. No other arithmetic operations are possible on pointers. If this property is violated, exit with error code 18.

    • Usage of AddressOf:
      The AddressOf operator (&) can only be applied to integers, chars, and indexed strings (string[i]). If this property is violated, exit with error code 19.

    • Usage of Deref:
      The deref operator (^) can only be applied to integer pointers and char pointers. If this property is violated, exit with error code 20.

  3. In your typechecker class, you will use the symbol table (SymTab* st) to store symbols (variable names, function names, ...) together with their types. That is, whenever a variable is declared, you can store the name and its type in the symbol table. The same can be done for function names, function arguments, and function return values. Whenever a symbol is about to be entered into the symbol table, you probably want to check whether it is already in there (to detect duplicate variables and functions). When a procedure is invoked, you can check whether the invocation conforms to the declaration (correct number of arguments and return value). Similarly, when variables are used in expressions, their previously declared types can be retrieved from the symbol table to check whether they are used in the correct context (e.g., only integer values can be used as operands for arithmetic expressions).

  4. As an example, consider the code fragment below, which shows one possible way to implement the check for duplicate variable declarations:
     0:  // add symbol table information for all the declarations following
     1:  void add_decl_symbol(DeclImpl *p)
     2:  {
     3:    list<SymName_ptr>::iterator iter;
     4:    char *name; Symbol *s;  
     5:    
     6:    for (iter = p->m_symname_list->begin(); iter != p->m_symname_list->end(); ++iter) {
     7:      name = strdup((*iter)->spelling());
     8:      s = new Symbol();
     9:      s->m_basetype = p->m_type->m_attribute.m_basetype;
    10:
    11:      if (! m_st->insert(name, s))
    12:	   this->t_error(dup_var_name,  p->m_attribute);
    13:    }
    14:  }
    15:
    16:  void visitDeclImpl(DeclImpl * p)
    17:  {
    18:    ...
    19:    add_decl_symbol(p);
    20:  }
    	
    The function visitDeclImpl(DeclImpl * p) will be invoked when you visitor calls accept on a variable declaration AST node. At one point, this function calls add_decl_symbol() (Line 19). As you can see, add_decl_symbol() iterates over the list of variables that are declared (Line 6). For each variable, it first extracts its name (Line 7) and creates a Symbol object (Line 8). Its type is set to the type that this variable declaration block declares (Line 9). Then, the new variable, together with its type, is inserted into the symbol table (Line 11). Note that this operation also checks whether the variable name is already in the symbol table. If the symbol is indeed present, the insert call will return false, and an appropriate error should be raised (Line 12). Make sure that you understand how variables of the same name can be legally declared in different scopes, though.

What Your Compiler Has to Do!


  1. Your compiler must successfully parse any valid input file.
  2. Your compiler must generate the correct AST.
  3. Your compiler must check the properties listed above. When a certain program property is violated, an appropriate error must be thrown (please use the appropriate error code for each type error to help us with automated grading). Correct programs must be accepted.

Deliverables


Like for the previous project, we are using Gradescope (and its auto-grader feature) to grade this assignment and your submissions.

  1. Once you are done with your scanner/parser, go to the third assignment and submit your code.
  2. For this project, please submit your "lexer.l", "parser.ypp", and "typecheck.cpp" files. We supply the rest and build your project.
  3. We do not show you the test cases and the expected output, but you should get some feedback about the types of tests that your submission passes and where it fails.
  4. You can make a new submission once every hour. Make sure you thoroughly test your program locally, and don't (ab)use the auto-grader as a test harness.