Python Language Tutorial => Part 2: Parsing Tokenized Input with Yacc

Example

This section explains how the tokenized input from Part 1 is processed - it is done using Context Free Grammars (CFGs). The grammar must be specified, and the tokens are processed according to the grammar. Under the hood, the parser uses an LALR parser.

# Yacc example

import ply.yacc as yacc

# Get the token map from the lexer. This is required.
from calclex import tokens

def p_expression_plus(p):
    'expression : expression PLUS term'
    p[0] = p[1] + p[3]

def p_expression_minus(p):
    'expression : expression MINUS term'
    p[0] = p[1] - p[3]

def p_expression_term(p):
    'expression : term'
    p[0] = p[1]

def p_term_times(p):
    'term : term TIMES factor'
    p[0] = p[1] * p[3]

def p_term_div(p):
    'term : term DIVIDE factor'
    p[0] = p[1] / p[3]

def p_term_factor(p):
    'term : factor'
    p[0] = p[1]

def p_factor_num(p):
    'factor : NUMBER'
    p[0] = p[1]

def p_factor_expr(p):
    'factor : LPAREN expression RPAREN'
    p[0] = p[2]

# Error rule for syntax errors
def p_error(p):
    print("Syntax error in input!")

# Build the parser
parser = yacc.yacc()

while True:
   try:
       s = raw_input('calc > ')
   except EOFError:
       break
   if not s: continue
   result = parser.parse(s)
   print(result)

Breakdown

Each grammar rule is defined by a function where the docstring to that function contains the appropriate context-free grammar specification. The statements that make up the function body implement the semantic actions of the rule. Each function accepts a single argument p that is a sequence containing the values of each grammar symbol in the corresponding rule. The values of p[i] are mapped to grammar symbols as shown here:
```
  def p_expression_plus(p):
      'expression : expression PLUS term'
      #   ^            ^        ^    ^
      #  p[0]         p[1]     p[2] p[3]

      p[0] = p[1] + p[3]
```

For tokens, the "value" of the corresponding p[i] is the same as the p.value attribute assigned in the lexer module. So, PLUS will have the value +.
For non-terminals, the value is determined by whatever is placed in p[0]. If nothing is placed, the value is None. Also, p[-1] is not the same as p[3], since p is not a simple list (p[-1] can specify embedded actions (not discussed here)).

Note that the function can have any name, as long as it is preceeded by p_.

The p_error(p) rule is defined to catch syntax errors (same as yyerror in yacc/bison).

Multiple grammar rules can be combined into a single function, which is a good idea if productions have a similar structure.

  def p_binary_operators(p):
      '''expression : expression PLUS term
                    | expression MINUS term
         term       : term TIMES factor
                    | term DIVIDE factor'''
      if p[2] == '+':
          p[0] = p[1] + p[3]
      elif p[2] == '-':
          p[0] = p[1] - p[3]
      elif p[2] == '*':
          p[0] = p[1] * p[3]
      elif p[2] == '/':
          p[0] = p[1] / p[3]

Character literals can be used instead of tokens.

  def p_binary_operators(p):
      '''expression : expression '+' term
                    | expression '-' term
         term       : term '*' factor
                    | term '/' factor'''
      if p[2] == '+':
          p[0] = p[1] + p[3]
      elif p[2] == '-':
          p[0] = p[1] - p[3]
      elif p[2] == '*':
          p[0] = p[1] * p[3]
      elif p[2] == '/':
          p[0] = p[1] / p[3]

Of course, the literals must be specified in the lexer module.

Empty productions have the form '''symbol : '''
To explicitly set the start symbol, use start = 'foo', where foo is some non-terminal.
Setting precedence and associativity can be done using the precedence variable.
```
  precedence = (
      ('nonassoc', 'LESSTHAN', 'GREATERTHAN'),  # Nonassociative operators
      ('left', 'PLUS', 'MINUS'),
      ('left', 'TIMES', 'DIVIDE'),
      ('right', 'UMINUS'),            # Unary minus operator
  )
```
Tokens are ordered from lowest to highest precedence. nonassoc means that those tokens do not associate. This means that something like a < b < c is illegal whereas a < b is still legal.
parser.out is a debugging file that is created when the yacc program is executed for the first time. Whenever a shift/reduce conflict occurs, the parser always shifts.

PDF - Download Python Language for free

Previous Next

Python Language

Fastest Entity Framework Extensions

Example

Breakdown

Got any Python Language Question?

Python Language

Python Language Python Lex-Yacc Part 2: Parsing Tokenized Input with Yacc

Fastest Entity Framework Extensions

Example

Breakdown

Got any Python Language Question?