LARA

Earley Parser

What is Earley Parser

A parser for context-free grammars

Can be thought of as using dynamic programming, like CYK

  • but driven by the input stream and start symbol (avoids unnecessary computation)
  • like LR parsers uses notion of items

Items in Earley Parser

For right-hand sides of the form X::= UVW, a CYK Parsing Algorithm first uses Transforming to Chomsky Normal Form to introduce fresh non-terminals for partial right-hand sides, such as Z::=UV.

Earley parser instead uses grammar rules extended with exactly one 'dot' inserted, indicating the point reached by parser so far

X ::= UV.W

Call this a 'dotted production'.

We think of this as denoting non-terminal Z standing for UV.

CYK Parsing Algorithm uses a matrix d(i)(j) storing all non-terminals that can derive substring from i to j.

  • instead of having e.g. d(i)(j) = {Z,P}, d(i')(j)={Q},
  • use array d' storing non-terminals with their starting position: S(j)={(Z,i),(P,i),(Q,i')}

Earley parser stores at position j items (called states by Earley), which are pairs (X::=UV.W, i) of

  • dotted right-hand side X::=UV.W
  • starting position i

Note: for X::=UVW there are 4 dotted productions:

X::=.UVW
X::=U.VW
X::=UV.W
X::=UVW.

Earley Parsing Idea

Input string: w(0) … w(N-1) EOF

Parser examines input left to right, computes S(i) for i from 0 to N-1

If D is start symbol in the original grammar, we add extra production

D'::=D EOF

and make D' start state, so parser can detect end of input.

Let $G^*$ denote sequences of terminals and non-terminals of grammar.

Goal: S(j) should contain all items that could appear in parsing of strings beginning with w(0),…,w(j-1)

More precisely: S(j) contains those items $(X::=r.s, i)$ for $i \le j$ such that for some $p,q \in G^*$

  • $D' \Rightarrow^{*} p X q$
  • $X::=rs$ is a grammar rule, $r,s \in G^*$
  • $p \Rightarrow^{*} w(0)...w(i-1)$
  • $r \Rightarrow^{*} w(i)...w(j-1)$

We sketch these conditions as follows:

w(0) ... w(i-1) w(i) ... w(j-1) w(j)     ...  w(N-1)
  |         |                    ^
  |    p    |         r          |   s?  |       |
            |                   parser position
            |              X?            |    q? |

For $s,X,q$ the parser did not check yet whether they agree with the input string.

Steps of Early Parsing Algorithm

Initially, let S(0) = {(D'::=.D EOF,0)}

When scanning input at position j, parser does the following operations (p,q,r are sequences of terminals and non-terminals):

Prediction

If $(X::=p.Yq, i) \in S(j)$ and $Y::=r$ is a grammar rule, then

\begin{equation*}
   S(j) = S(j) \cup \{ (Y::=.r,j) \}
\end{equation*}

Scanning

If $w(j)=a$ and $(X::=p.aq, i) \in S(j)$ then (we can skip a):

\begin{equation*}
   S(j+1) = S(j+1) \cup \{(X::=pa.q, i)\}
\end{equation*}

Completion

If $(X::=p., i) \in S(j)$ and $(Y::=q.Xr, k) \in S(i)$ then

\begin{equation*}
   S(j) = S(j) \cup \{(Y::=qX.r, k)\}
\end{equation*}

sketch of completion:

w(0) ... w(k)   ...   w(i)   ...   w(j)
           |     q      |     p      |
                    Y::=q.Xr       X::=p.
                                   Y::=qX.r

Remarks on Complexity

We measure parsing time as a function of input length N, for a fixed grammar.

  • fixed number of dotted rules

In S(j) there can be at most j copies of each dotted rule (for all possible start positions)

  • prediction and scanning take O(N)
  • completion can take O(N^2) (but size of result still O(N))

Summing up work over all S(j), we get O(N^3)

For ambiguous grammar, completion is proportional to its result, so it is O(N)

  • overall parsing time is O(N^2)
  • becomes O(N) if |S(j)| is independent of N (bounded-state grammars)
    • if number of previous points in input to examine is bounded by constant

Further Information