Lexical Analysis #

Lexical analysis, also known as tokenization or scanning, is the first phase of the compilation process. The goal is to read the source code as a sequence of characters and convert it into a sequence of tokens, which are the basic building blocks of the programming language.

Tokens are categorized into different types, such as keywords, identifiers, literals, operators, and punctuation symbols.

keywords (e.g., if, while), identifiers (e.g., variable and function names), literals (e.g., numbers, strings), operators (e.g., +, -, *), and delimiters (e.g., {, }, ;).

For example, in a C++ code snippet like this:

int x = 5;

The lexer would generate the following tokens:

Keyword: int
Identifier: x
Operator: =
Literal: 5
Punctuation: ;

Lexical analyzers are usually implemented using finite automata, which are abstract machines that can recognize patterns in the input text. There are two types of finite automata: deterministic finite automata (DFA) and nondeterministic finite automata (NFA). Both can be used to create a lexer, but DFAs are more efficient because they don’t require backtracking.

Here is a very basic lexer for a C-like language, written in Python:

import re

TOKENS = [
    ('KEYWORD', r'\b(int|float|if|else|while|return)\b'),
    ('IDENTIFIER', r'[a-zA-Z_][a-zA-Z0-9_]*'),
    ('NUMBER', r'\d+(\.\d+)?'),
    ('OPERATOR', r'[+\-*/=<>!&|]+'),
    ('WHITESPACE', r'\s+'),
    ('DELIMITER', r'[;(),{}]'),
]

def tokenize(code):
    tokens = []
    pos = 0
    while pos < len(code):
        match = None
        for token_type, pattern in TOKENS:
            regex = re.compile(pattern)
            m = regex.match(code, pos)
            if m:
                text = m.group(0)
                if token_type != 'WHITESPACE':
                    tokens.append((token_type, text))
                pos = m.end()
                match = True
                break

        if not match:
            raise ValueError(f"Unexpected character: {code[pos]} at position {pos}")

    return tokens

code = '''
int main() {
    int x = 42;
    return x;
}
'''

tokens = tokenize(code)
print(tokens)