Earley Parser

What is Earley Parser

A parser for context-free grammars

works for arbitrary (even ambiguous) context-free grammars
driven by the input stream and start symbol
uses the well-known technique of dynamic programming (avoids unnecessary computation)
runs in O(n^3) in general
for unambiguous grammars it runs in O(n^2)
for grammars handled by LR parsers (we will see those later), it runs in O(n)

Items in Earley Parser

Earley parser uses grammar rules extended with exactly one 'dot' inserted, indicating the point reached by parser so far.

Call this a 'dotted production'.

Consider an example grammar rule X::=UVW. There are 4 dotted productions for this rule:

X::=.UVW
X::=U.VW
X::=UV.W
X::=UVW.

Another example: for the grammar production

assignment ::= ID ":=" expression ";"

we have the follownig dotted productions:

assignment ::= . ID ":=" expression ";"
assignment ::= ID . ":=" expression ";"
assignment ::= ID ":=" . expression ";"
assignment ::= ID ":=" expression . ";"
assignment ::= ID ":=" expression ";" .

For each of the positions j in the input string, Earley parser stores items (called states by Earley), which are pairs of

dotted right-hand side (e.g. X::=UV.W)
starting position i

If e.g. the item (X::=UV.W, i) is in the set S(j), then it should be possible to parse the substring of input from position i to position j using the sequence of symbols UV.

Earley Parsing Idea

Input string: $w(0) ... w(N-1)$ EOF

Parser examines input left to right, computes $S(i)$ for i from 0 to N-1

If D is start symbol in the original grammar, we add extra production

D'::=D EOF

and make D' start state, so parser can detect end of input.

Let $\Gamma^*$ denote sequences of terminals and non-terminals of grammar.

Goal: $S(j)$ should contain all items that could appear in parsing of strings beginning with $w(0),...,w(j-1)$

More precisely: $S(j)$ contains those items $(X::=r.s, i)$ for $i \le j$ , such that for some $p,q \in \Gamma^*$

$D' \Rightarrow^{*} p X q$
$X::=rs$ is a grammar rule, $r,s \in G^*$
$p \Rightarrow^{*} w(0)...w(i-1)$
$r \Rightarrow^{*} w(i)...w(j-1)$

We sketch these conditions as follows:

w(0) ... w(i-1) w(i) ... w(j-1) w(j)     ...  w(N-1)
  |         |                    ^
  |    p    |         r          |   s?  |       |
            |                   parser position
            |              X?            |    q? |

For $s,X,q$ the parser did not check yet whether they agree with the input string.

Steps of Early Parsing Algorithm

Initially, let $S(0) = \{(D'::=.D\ \mbox{\small EOF},\ 0)\}$

When scanning input at position j, parser does the following operations ( $p,q,r$ are sequences of terminals and non-terminals):

Prediction

If $(X::=p.Yq, i) \in S(j)$ and $Y::=r$ is a grammar rule, then

$\begin{equation*} S(j) = S(j) \cup \{ (Y::=.r,j) \} \end{equation*}$

Scanning

If $w(j)=a$ and $(X::=p.aq, i) \in S(j)$ then (we can skip a):

$\begin{equation*} S(j+1) = S(j+1) \cup \{(X::=pa.q, i)\} \end{equation*}$

Completion

If $(X::=p., i) \in S(j)$ and $(Y::=q.Xr, k) \in S(i)$ then

$\begin{equation*} S(j) = S(j) \cup \{(Y::=qX.r, k)\} \end{equation*}$

sketch of completion:

w(0) ... w(k)   ...   w(i)   ...   w(j)
           |     q      |     p      |
                    Y::=q.Xr       X::=p.
                                   Y::=qX.r

After finishing all steps that are possible for $j$ , the parser moves to $j+1$ .

Remarks on Complexity

We measure parsing time as a function of input length N, for a fixed grammar.

fixed number of dotted rules

In S(j) there can be at most j copies of each dotted rule (for all possible start positions)

prediction and scanning take O(N)
completion can take O(N^2) (but size of result still O(N))

Summing up work over all S(j), we get O(N^3)

For un-ambiguous grammar, completion is proportional to its result, so it is O(N)

overall parsing time is O(N^2)
becomes O(N) if |S(j)| is independent of N (bounded-state grammars)
- if number of previous points in input to examine is bounded by constant

LARA