Earley Parser

What is Earley Parser

A parser for context-free grammars

works for arbitrary, even ambiguous, grammars
runs in O(n^3) in general
unlike CYK Parsing Algorithm, it does not require Transforming to Chomsky Normal Form
for unambiguous grammars it runs in O(n^2)
for grammars handled by LR parsers, it runs in O(n)

Can be thought of as using dynamic programming, like CYK

but driven by the input stream and start symbol (avoids unnecessary computation)
like LR parsers uses notion of items

Items in Earley Parser

For right-hand sides of the form X::= UVW, a CYK Parsing Algorithm first uses Transforming to Chomsky Normal Form to introduce fresh non-terminals for partial right-hand sides, such as Z::=UV.

Earley parser instead uses grammar rules extended with exactly one 'dot' inserted, indicating the point reached by parser so far

X ::= UV.W

Call this a 'dotted production'.

We think of this as denoting non-terminal Z standing for UV.

CYK Parsing Algorithm uses a matrix d(i)(j) storing all non-terminals that can derive substring from i to j.

instead of having e.g. d(i)(j) = {Z,P}, d(i')(j)={Q},
use array d' storing non-terminals with their starting position: S(j)={(Z,i),(P,i),(Q,i')}

Earley parser stores at position j items (called states by Earley), which are pairs (X::=UV.W, i) of

dotted right-hand side X::=UV.W
starting position i

Note: for X::=UVW there are 4 dotted productions:

X::=.UVW
X::=U.VW
X::=UV.W
X::=UVW.

Earley Parsing Idea

Input string: w(0) … w(N-1) EOF

Parser examines input left to right, computes S(i) for i from 0 to N-1

If D is start symbol in the original grammar, we add extra production

D'::=D EOF

and make D' start state, so parser can detect end of input.

Let $G^*$ denote sequences of terminals and non-terminals of grammar.

Goal: S(j) should contain all items that could appear in parsing of strings beginning with w(0),…,w(j-1)

More precisely: S(j) contains those items $(X::=r.s, i)$ for $i \le j$ such that for some $p,q \in G^*$

$D' \Rightarrow^{*} p X q$
$X::=rs$ is a grammar rule, $r,s \in G^*$
$p \Rightarrow^{*} w(0)...w(i-1)$
$r \Rightarrow^{*} w(i)...w(j-1)$

We sketch these conditions as follows:

w(0) ... w(i-1) w(i) ... w(j-1) w(j)     ...  w(N-1)
  |         |                    ^
  |    p    |         r          |   s?  |       |
            |                   parser position
            |              X?            |    q? |

For $s,X,q$ the parser did not check yet whether they agree with the input string.

Steps of Early Parsing Algorithm

Initially, let S(0) = {(D'::=.D EOF,0)}

When scanning input at position j, parser does the following operations (p,q,r are sequences of terminals and non-terminals):

Prediction

If $(X::=p.Yq, i) \in S(j)$ and $Y::=r$ is a grammar rule, then

$\begin{equation*} S(j) = S(j) \cup \{ (Y::=.r,j) \} \end{equation*}$

Scanning

If $w(j)=a$ and $(X::=p.aq, i) \in S(j)$ then (we can skip a):

$\begin{equation*} S(j+1) = S(j+1) \cup \{(X::=pa.q, i)\} \end{equation*}$

Completion

If $(X::=p., i) \in S(j)$ and $(Y::=q.Xr, k) \in S(i)$ then

$\begin{equation*} S(j) = S(j) \cup \{(Y::=qX.r, k)\} \end{equation*}$

sketch of completion:

w(0) ... w(k)   ...   w(i)   ...   w(j)
           |     q      |     p      |
                    Y::=q.Xr       X::=p.
                                   Y::=qX.r

Remarks on Complexity

We measure parsing time as a function of input length N, for a fixed grammar.

fixed number of dotted rules

In S(j) there can be at most j copies of each dotted rule (for all possible start positions)

prediction and scanning take O(N)
completion can take O(N^2) (but size of result still O(N))

Summing up work over all S(j), we get O(N^3)

For ambiguous grammar, completion is proportional to its result, so it is O(N)

overall parsing time is O(N^2)
becomes O(N) if |S(j)| is independent of N (bounded-state grammars)
- if number of previous points in input to examine is bounded by constant

LARA