Syntax analysis #

Syntax analysis is the process of checking the input token sequence’s adherence to the programming language’s grammar and constructing a parse tree (or an abstract syntax tree) representing the program’s hierarchical structure. Parsers can be classified as top-down or bottom-up, with recursive descent parsers being a common top-down approach.

Syntax analysis, also known as parsing, is the second phase of the compilation process. Once the lexer has converted the source code into a sequence of tokens, the parser checks whether these tokens follow the grammatical rules of the programming language. If the token sequence adheres to the language’s grammar, the parser constructs a parse tree, which represents the hierarchical structure of the program.

The parser usually employs a formal grammar to define the language’s syntax. Context-free grammars (CFGs), represented using Backus-Naur Form (BNF) or Extended Backus-Naur Form (EBNF), are often used for this purpose. The grammar consists of production rules that describe the structure of valid statements or expressions.

Here’s a simple EBNF grammar for a basic arithmetic expression:

expression  ::= term (('+' | '-') term)*
term        ::= factor (('*' | '/') factor)*
factor      ::= NUMBER | '(' expression ')'

This grammar describes an arithmetic expression as one or more terms separated by addition or subtraction operators, a term as one or more factors separated by multiplication or division operators, and a factor as either a number or an expression enclosed in parentheses.

There are two primary types of parsers: top-down and bottom-up. Top-down parsers, such as recursive descent parsers or LL parsers, start with the start symbol (e.g., expression) and attempt to derive the input token sequence by recursively applying grammar rules. Bottom-up parsers, such as shift-reduce parsers or LR parsers, start with the input token sequence and attempt to reduce it to the start symbol using grammar rules.

Let’s create a simple recursive descent parser for the arithmetic expression grammar mentioned above. The parser will be written in Python and will use the token sequence generated by the lexer in the previous example.

Create a file called expr_parser.py and add the following code:

class ExprParser:
    def __init__(self, tokens):
        self.tokens = tokens
        self.pos = 0

    def parse(self):
        return self.expression()

    def expression(self):
        node = self.term()
        while self.current_token() in ('+', '-'):
            token = self.current_token()
            self.consume()
            node = (token, node, self.term())
        return node

    def term(self):
        node = self.factor()
        while self.current_token() in ('*', '/'):
            token = self.current_token()
            self.consume()
            node = (token, node, self.factor())
        return node

    def factor(self):
        token = self.current_token()
        if token.isdigit():
            self.consume()
            return ('NUMBER', int(token))
        elif token == '(':
            self.consume()
            node = self.expression()
            if self.current_token() != ')':
                raise ValueError("Expected ')' but found", self.current_token())
            self.consume()
            return node
        else:
            raise ValueError("Expected NUMBER or '(' but found", self.current_token())

    def current_token(self):
        return self.tokens[self.pos] if self.pos < len(self.tokens) else None

    def consume(self):
        self.pos += 1

if __name__ == "__main__":
    # Simple lexer to produce a token sequence for the arithmetic expression grammar
    def tokenize(expr):
        tokens = []
        for c in expr:
            if c.isdigit() or c in '+-*/()':
                tokens.append(c)
        return tokens

    expr = "3*(5+2)-8/2"
    tokens = tokenize(expr)
    parser = ExprParser(tokens)
    parse_tree = parser.parse()
    print(parse_tree)

The parse_tree is an abstract syntax tree (AST) representing the input expression. The AST is a simplified, more abstract version of the parse tree, containing only the essential information needed for further compilation steps. For the sample expression 3*(5+2)-8/2, the generated AST would look like this:

(
  '-',
  (
    '*',
    ('NUMBER', 3),
    (
      '+',
      ('NUMBER', 5),
      ('NUMBER', 2)
    )
  ),
  (
    '/',
    ('NUMBER', 8),
    ('NUMBER', 2)
  )
)

This tree represents the expression as follows:

      '-'
      /   \
    '*'   '/'
    / \   / \
   3  '+'  8 2
      / \
     5   2

The AST can be used for various purposes, such as generating intermediate code, performing optimizations, or generating machine code. In the case of interpreters, the AST can also be directly executed.

To make the code more complete and reusable, you could integrate the lexer and parser into a single module or package, allowing them to work together seamlessly.

Remember that this example demonstrates a simple arithmetic expression parser, and real-world programming languages have much more complex grammars. To build a parser for a full-featured language, you could use parser generator tools like ANTLR or Bison, which can generate parsers from a given grammar definition.