compiler-construction Basics of Compiler Construction Simple Lexical Analyser

Help us to keep this website almost Ad Free! It takes only 10 seconds of your time:
> Step 1: Go view our video on YouTube: EF Core Bulk Insert
> Step 2: And Like the video. BONUS: You can also share it!

Example

In this example I will show you how to make a basic lexer which will create the tokens for a integer variable declaration in python.

What does the lexical analyser do?

The purpose of a lexer (lexical analyser) is to scan the source code and break up each word into a list item. Once done it takes these words and creates a type and value pair which looks like this ['INTEGER', '178'] to form a token.

These tokens are created in order to identify the syntax for your language so the whole point of the lexer is to create the syntax of your language as it all depends on how you want to identify and interpret different items.


Example source code for this lexer:

int result = 100;

Code for lexer in python:

import re                                 # for performing regex expressions

tokens = []                               # for string tokens
source_code = 'int result = 100;'.split() # turning source code into list of words

# Loop through each source code word
for word in source_code:
    
    # This will check if a token has datatype decleration
    if word in ['str', 'int', 'bool']: 
        tokens.append(['DATATYPE', word])
    
    # This will look for an identifier which would be just a word
    elif re.match("[a-z]", word) or re.match("[A-Z]", word):
        tokens.append(['IDENTIFIER', word])
    
    # This will look for an operator
    elif word in '*-/+%=':
        tokens.append(['OPERATOR', word])
    
    # This will look for integer items and cast them as a number
    elif re.match(".[0-9]", word):
        if word[len(word) - 1] == ';': 
            tokens.append(["INTEGER", word[:-1]])
            tokens.append(['END_STATEMENT', ';'])
        else: 
            tokens.append(["INTEGER", word])

print(tokens) # Outputs the token array

When running this code snippet the output should be the following:

[['DATATYPE', 'int'], ['IDENTIFIER', 'result'], ['OPERATOR', '='], ['INTEGER', '100'], ['END_STATEMENT', ';']]

As you can see all we did is turn a piece of source code such as the integer variable declaration into a token stream of type and value pair tokens.


Let's break it down

  1. We begin of by import regex library because it will be needed when checking if certain words match a certain regex pattern.
  1. We create an empty list called tokens. This will be used to store all of the tokens we create.

  2. We split our source code which is a string into a list of words where every word in the string separated by a space is a list item. We then store those in a variable called source_code.

  3. We start looping through our source_code list word by word.

  4. We now perform our first check:

    if word in ['str', 'int', 'bool']: 
       tokens.append(['DATATYPE', word])
    

    What we check for here is a datatype which will tell us what type our variable will be.

  5. After that we perform more checks like the one above identifying each word in our source code and creating a token for it. These tokens will then be passed on to the parser to create an Abstract Syntax Tree (AST).

If you want to interact with this code and play with it here is a link to the code in an online compiler https://repl.it/J9Hj/latest



Got any compiler-construction Question?