Recursive Descent (Predictive) Parsing

Recursive descent is a decent parsing technique.

descent:
a movement downward

decent:
adequate: sufficient for the purpose; “an adequate income”; “the food was adequate”; “a decent wage”; “enough food”; “good enough”

(from Wordnet)

Basic Idea

construct parse tree top-down
use current symbol to decide which production to use
write one procedure for constructing each non-terminal
recursive non-terminals lead to recursive procedures

alternative in grammar becomes 'if'
repetition becomes 'while'

Example: Recursive Descent Parsing of Polynomials

In context-free grammars we have seen a grammar of polynomials.

Consider first this version of grammar:

polynomialsLL.grammar
this grammar version is very nice for recursive descent parsing

Here is a parser for this grammar: polynomials.pscala

Note correspondence:

(“+” term)*

⇒ while (lex.current=PLUS) { lex.next; parseTerm }

Example: running the code above on “x + y*(u+3)”

remaining input	recursion stack
x + y*(u+3)	polynomial
x + y*(u+3)	polynomial,term
x + y*(u+3)	polynomial,term,factor
+ y*(u+3)	polynomial,term
+ y*(u+3)	polynomial
y*(u+3)	polynomial
y*(u+3)	polynomial,term
y*(u+3)	polynomial,term,factor
*(u+3)	polynomial,term
(u+3)	polynomial,term
(u+3)	polynomial,term,factor
u+3)	polynomial,term,factor
u+3)	polynomial,term,factor,polynomial
…	polynomial,term,factor,polynomial,…
)	polynomial,term,factor,polynomial
)	polynomial,term,factor
	polynomial,term,factor
	polynomial,term
	polynomial

For statements, we use keyword to decide what we are about to parse:

“if” X | “while” Y

⇒ if (lex.current=IF) parseX else if (lex.current=WHILE) parseY

Transforming Grammar: Left Factoring

Consider a variant of grammar with this definition:

polynomial ::= term | term "+" polynomial

How to write parsePolynomial procedure for this grammar?

'term' can be arbitrarily complex
which alternative to choose?

Solution: apply left-factoring:

polynomial ::= term ("" | "+" polynomial)

Now we can construct parser:

def parsePolynomial = {
  parseTerm
  if (lex.current==PLUS) {
    lex.next
    parsePolynomial
  }
}

We obtained tail-recursive version equivalent to a while loop.

Transforming Grammar: Left Recursion Makes Trouble

Instead of

polynomial ::= term ("" | "+" polynomial)

consider grammar defining the same language:

polynomial ::= ("" | polynomial "+") term

we need to parse polynomial, so we check which alternative to parse, so we check whether the current symbol is parsed by polynomial, so

we need to parse polynomial, so we check which alternative to parse, so we check whether the current symbol is parsed by polynomial, so …

This does not work, because it contains left-recursive alternative

polynomial ::= polynomial "+" term

For recursive descent parsing, we need right-recursive rules, which work, like

polynomial ::= term "+" polynomial

Simple forms of right recursion can be expressed using repetition (and become while loops), which work.

Recursive Descent Parser for While Language

If you understand how we do this, you should be able to do it for MiniJava in homework as well.

The parser: WhileParser.scala

simplified version of Concrete Syntax of While
can connect to Hand-Written Scanner for While Language i.e. Lexer.scala

When Exactly Does Recursive Descent Work?

When can we be sure that recursive descent parser will parse grammar correctly?

it will accept without error exactly when string can be derived

Consider grammar without repetition construct * (eliminate it using right recursion).

Given rules

X ::= p
X ::= q

where p,q are sequences of terminals and non-terminals, we need to decide which one to use when parsing X, based on the first character of possible string given by p and q.

first(p) - first characters of strings that p can generate
first(q) - first characters of strings that q can generate
requirement: first(p) and first(q) are disjoint

How to choose alternative: check whether current token belongs to first(p) or first(q)

Computing 'first' in Simple Case

Assume for now

no non-terminal derives empty string, that is:

For every terminal X, if X =⇒* w and w is a string of terminals, then w is non-empty

We then have

first(X …) = first(X)
first(“a” …) = {a}

We compute first(p) set of terminals for

every right-hand side alternative p, and
every non-terminal X

Example grammar:

S ::= X | Y
X ::= "b" | S Y
Y ::= "a" X "b" | Y "b"

Equations:

first(S) = first(X|Y) = first(X) $\cup$ first(Y)
first(X) = first(“b” | S Y) = first(“b”) $\cup$ first(S Y) = {b} $\cup$ first(S)
first(Y) = first(“a” X “b”|Y “b”) = first(“a” X “b”) $\cup$ first(Y “b”) = {a} $\cup$ first(Y)

How to solve equations for first?

Expansion: first(S) = first(X) $\cup$ first(Y) = {b} $\cup$ first(S) $\cup$ {a} $\cup$ first(Y)

could keep expanding forever
does further expansion make difference?
is there a solution?
is there unique solution?

Bottom up computation, while there is change:

initially all sets are empty
if right hand side is bigger, add different to left-hand side

Solving equations

first(S) = first(X) $\cup$ first(Y)
first(X) = {b} $\cup$ first(S)
first(Y) = {a} $\cup$ first(Y)

bottom up

first(S)	first(X)	first(Y)
{}	{}	{}
{}	{b}	{a}
{a,b}	{b}	{a}
{a,b}	{a,b}	{a}
{a,b}	{a,b}	{a}

Does this process terminate?

all sets are increasing
a finite number of symbols in grammar

There is a unique least solution

this is what we want to compute
the above bottom up algorithm computes it

General Remark:

this is an example of a 'fixed point' computation algorithm
also be useful for semantic analysis, later

General Case: Nullable Non-terminals

In general, a non-terminal can expand to empty string

example: statement sequence in while language grammar

first(Y Z) = first(Y)? what if Y can derive empty string?

A sequence of non-terminals is nullable if it can derive an empty string

this is case iff each non-terminal is nullable

Computing nullable non-terminals:

empty string is nullable
if one right-hand side of non-terminal is nullable, so is the non-terminal

Algorithm:

nullable = {}
changed = true
while (changed) {
  if X is not nullable and either
  1) grammar contains rule
    X ::= "" | ...
  or
  2) grammar contains rule 
    X ::= Y1 ... Yn | ...
      and
    {Y1,...,Yn} is contained in nullable
  then
    nullable = nullable union {X}
    changed = true
}

Computing First Given Nullable

Computing first(X), given rule X = $Y_1$ … $Y_i$ … $Y_k$

if $Y_1$ ,…, $Y_{i-1}$ are all nullable, then add first( $Y_i$ ) to first(X)

Then repeat until no change, as before.

The Need for Follow

What if we have

X = Y Z | U

and U is nullable? When can we choose a nullable alternative (U)?

if current token is either in first(U) or it could follow non-terminal X

t is in follow(X), if there exists a derivation containing substring X t

Example of language with 'named blocks':

statements ::= "" | statement statements
statement ::= assign | block
assign ::= ID "=" (ID|INT) ";"
block ::= "begin" ID statements ID "end"

Try to parse

begin myPrettyCode
  x = 3;
  y = x;
myPrettyCode end

Problem parsing 'statements':

identifier could start alternative 'statement statements'
identifier could follow 'statements', so we may wish to parse “”

Computing follow( $Y_i$ ), given rule X = $Y_1$ … $Y_i$ … $Y_j$ … $Y_k$

add first( $Y_j$ ), if $Y_{i+1}$ ,…, $Y_{j-1}$ are all nullable
add follow( $X$ ), if $Y_{i+1}$ , …, $Y_k$ are all nullable

Possible computation order:

nullable
first
follow

Example: compute these values for grammar above

follow = {}
first
  statements {ID, "begin"}
  statement  {ID, "begin"}
  assign     {ID}
  block      {"begin"}
follow
  statements {ID}
  statement  {ID, "begin"}
  assign     {ID, "begin"}
  block      {ID, "begin"}

The grammar cannot be parsed because we have

statements ::= "" | statement statements

where

statements $\in$ nullable
first(statements) $\cap$ follow(statements) = {ID} $\neq \emptyset$

If the parser sees ID, it does not know if it should

finish parsing 'statements' or
parse another 'statement'

Recursive Descent Parsing

First, compute nullable, first, follow

Then, make parsing table which stores the alternative, given

non-terminal being parsed (in which procedure we are)
current token

Given (X ::= $p_1$ | … | $p_n$ ) we insert alternative j into table iff

t $\in$ first(p_j), or
nullable( $p_j$ ) and t $\in$ follow(X)

If in parsing table we have two or more alternatives for same token and non-terminal:

we have conflict
we cannot parse grammar using recursive descent

Otherwise, we say that the grammar is LL(1)

left-to-right parse (the way input is examined)
leftmost derivation (expand leftmost non-terminal first–recursion in descent does this)
one token lookahead (current token)

What about empty entries?

they indicate errors
report that we expect one of tokens in
- first(X), if X $\notin$ nullable
- first(X) $\cup$ follow(X), if X $\in$ nullable

Error Recovery

According to some opinions error recover is not worth it

one error tends to trigger others
report error and stop
put cursor on the error
restart when user fixes it

Approaches to error recovery in recursive descent parsing:

insert some tokens (hard to guarantee termination)
skip some tokens: when parsing X, skip until follow(X) or EOF

Summary

Recursive descent is decent because

it is efficient
it is simple enough to write by hand
when we write by hand, we are not limited to following only grammar rules
- if e.g. multiple alternative begin with same ID, we can lookup previous declaration of ID and decide which alternative to follow

Exercise

Compute parsing table for WhileParser.scala.