This section explains how the tokenized input from Part 1 is processed - it is done using Context Free Grammars (CFGs). The grammar must be specified, and the tokens are processed according to the grammar. Under the hood, the parser uses an LALR parser.
# Yacc example
import ply.yacc as yacc
# Get the token map from the lexer. This is required.
from calclex import tokens
def p_expression_plus(p):
'expression : expression PLUS term'
p[0] = p[1] + p[3]
def p_expression_minus(p):
'expression : expression MINUS term'
p[0] = p[1] - p[3]
def p_expression_term(p):
'expression : term'
p[0] = p[1]
def p_term_times(p):
'term : term TIMES factor'
p[0] = p[1] * p[3]
def p_term_div(p):
'term : term DIVIDE factor'
p[0] = p[1] / p[3]
def p_term_factor(p):
'term : factor'
p[0] = p[1]
def p_factor_num(p):
'factor : NUMBER'
p[0] = p[1]
def p_factor_expr(p):
'factor : LPAREN expression RPAREN'
p[0] = p[2]
# Error rule for syntax errors
def p_error(p):
print("Syntax error in input!")
# Build the parser
parser = yacc.yacc()
while True:
try:
s = raw_input('calc > ')
except EOFError:
break
if not s: continue
result = parser.parse(s)
print(result)
Each grammar rule is defined by a function where the docstring to that function contains the appropriate context-free grammar specification. The statements that make up the function body implement the semantic actions of the rule. Each function accepts a single argument p that is a sequence containing the values of each grammar symbol in the corresponding rule. The values of p[i]
are mapped to grammar symbols as shown here:
def p_expression_plus(p):
'expression : expression PLUS term'
# ^ ^ ^ ^
# p[0] p[1] p[2] p[3]
p[0] = p[1] + p[3]
For tokens, the "value" of the corresponding p[i]
is the same as the p.value
attribute assigned in the lexer module. So, PLUS
will have the value +
.
For non-terminals, the value is determined by whatever is placed in p[0]
. If nothing is placed, the value is None.
Also, p[-1]
is not the same as p[3]
, since p
is not a simple list (p[-1]
can specify embedded actions (not discussed here)).
Note that the function can have any name, as long as it is preceeded by p_
.
The p_error(p)
rule is defined to catch syntax errors (same as yyerror
in yacc/bison).
Multiple grammar rules can be combined into a single function, which is a good idea if productions have a similar structure.
def p_binary_operators(p):
'''expression : expression PLUS term
| expression MINUS term
term : term TIMES factor
| term DIVIDE factor'''
if p[2] == '+':
p[0] = p[1] + p[3]
elif p[2] == '-':
p[0] = p[1] - p[3]
elif p[2] == '*':
p[0] = p[1] * p[3]
elif p[2] == '/':
p[0] = p[1] / p[3]
Character literals can be used instead of tokens.
def p_binary_operators(p):
'''expression : expression '+' term
| expression '-' term
term : term '*' factor
| term '/' factor'''
if p[2] == '+':
p[0] = p[1] + p[3]
elif p[2] == '-':
p[0] = p[1] - p[3]
elif p[2] == '*':
p[0] = p[1] * p[3]
elif p[2] == '/':
p[0] = p[1] / p[3]
Of course, the literals must be specified in the lexer module.
Empty productions have the form '''symbol : '''
To explicitly set the start symbol, use start = 'foo'
, where foo
is some non-terminal.
Setting precedence and associativity can be done using the precedence variable.
precedence = (
('nonassoc', 'LESSTHAN', 'GREATERTHAN'), # Nonassociative operators
('left', 'PLUS', 'MINUS'),
('left', 'TIMES', 'DIVIDE'),
('right', 'UMINUS'), # Unary minus operator
)
Tokens are ordered from lowest to highest precedence. nonassoc
means that those tokens do not associate. This means that something like a < b < c
is illegal whereas a < b
is still legal.
parser.out
is a debugging file that is created when the yacc program is executed for the first time. Whenever a shift/reduce conflict occurs, the parser always shifts.