In this example I will show you how to make a basic lexer which will create the tokens for a integer variable declaration in
python
.
The purpose of a lexer (lexical analyser) is to scan the source code and break up each word into a list item. Once done it takes these words and creates a type and value pair which looks like this ['INTEGER', '178']
to form a token.
These tokens are created in order to identify the syntax for your language so the whole point of the lexer is to create the syntax of your language as it all depends on how you want to identify and interpret different items.
Example source code for this lexer:
int result = 100;
Code for lexer in python
:
import re # for performing regex expressions
tokens = [] # for string tokens
source_code = 'int result = 100;'.split() # turning source code into list of words
# Loop through each source code word
for word in source_code:
# This will check if a token has datatype decleration
if word in ['str', 'int', 'bool']:
tokens.append(['DATATYPE', word])
# This will look for an identifier which would be just a word
elif re.match("[a-z]", word) or re.match("[A-Z]", word):
tokens.append(['IDENTIFIER', word])
# This will look for an operator
elif word in '*-/+%=':
tokens.append(['OPERATOR', word])
# This will look for integer items and cast them as a number
elif re.match(".[0-9]", word):
if word[len(word) - 1] == ';':
tokens.append(["INTEGER", word[:-1]])
tokens.append(['END_STATEMENT', ';'])
else:
tokens.append(["INTEGER", word])
print(tokens) # Outputs the token array
When running this code snippet the output should be the following:
[['DATATYPE', 'int'], ['IDENTIFIER', 'result'], ['OPERATOR', '='], ['INTEGER', '100'], ['END_STATEMENT', ';']]
As you can see all we did is turn a piece of source code such as the integer variable declaration into a token stream of type and value pair tokens.
We create an empty list called tokens
. This will be used to store all of the tokens we create.
We split our source code which is a string into a list of words where every word in the string separated by a space is a list item. We then store those in a variable called source_code
.
We start looping through our source_code
list word by word.
We now perform our first check:
if word in ['str', 'int', 'bool']:
tokens.append(['DATATYPE', word])
What we check for here is a datatype which will tell us what type our variable will be.
After that we perform more checks like the one above identifying each word in our source code and creating a token for it. These tokens will then be passed on to the parser to create an Abstract Syntax Tree (AST).
If you want to interact with this code and play with it here is a link to the code in an online compiler https://repl.it/J9Hj/latest