# Earley Parser

## What is Earley Parser

A parser for context-free grammars

Can be thought of as using dynamic programming, like CYK

• but driven by the input stream and start symbol (avoids unnecessary computation)
• like LR parsers uses notion of items

## Items in Earley Parser

For right-hand sides of the form X::= UVW, a CYK Parsing Algorithm first uses Transforming to Chomsky Normal Form to introduce fresh non-terminals for partial right-hand sides, such as Z::=UV.

Earley parser instead uses grammar rules extended with exactly one 'dot' inserted, indicating the point reached by parser so far

X ::= UV.W

Call this a 'dotted production'.

We think of this as denoting non-terminal Z standing for UV.

CYK Parsing Algorithm uses a matrix d(i)(j) storing all non-terminals that can derive substring from i to j.

• instead of having e.g. d(i)(j) = {Z,P}, d(i')(j)={Q},
• use array d' storing non-terminals with their starting position: S(j)={(Z,i),(P,i),(Q,i')}

Earley parser stores at position j items (called states by Earley), which are pairs (X::=UV.W, i) of

• dotted right-hand side X::=UV.W
• starting position i

Note: for X::=UVW there are 4 dotted productions:

X::=.UVW
X::=U.VW
X::=UV.W
X::=UVW.

## Earley Parsing Idea

Input string: w(0) … w(N-1) EOF

Parser examines input left to right, computes S(i) for i from 0 to N-1

If D is start symbol in the original grammar, we add extra production

D'::=D EOF

and make D' start state, so parser can detect end of input.

Let denote sequences of terminals and non-terminals of grammar.

Goal: S(j) should contain all items that could appear in parsing of strings beginning with w(0),…,w(j-1)

More precisely: S(j) contains those items for such that for some • • is a grammar rule, • • We sketch these conditions as follows:

w(0) ... w(i-1) w(i) ... w(j-1) w(j)     ...  w(N-1)
|         |                    ^
|    p    |         r          |   s?  |       |
|                   parser position
|              X?            |    q? |

For the parser did not check yet whether they agree with the input string.

## Steps of Early Parsing Algorithm

Initially, let S(0) = {(D'::=.D EOF,0)}

When scanning input at position j, parser does the following operations (p,q,r are sequences of terminals and non-terminals):

#### Prediction

If and is a grammar rule, then #### Scanning

If and then (we can skip a): #### Completion

If and then sketch of completion:

w(0) ... w(k)   ...   w(i)   ...   w(j)
|     q      |     p      |
Y::=q.Xr       X::=p.
Y::=qX.r

## Remarks on Complexity

We measure parsing time as a function of input length N, for a fixed grammar.

• fixed number of dotted rules

In S(j) there can be at most j copies of each dotted rule (for all possible start positions)

• prediction and scanning take O(N)
• completion can take O(N^2) (but size of result still O(N))

Summing up work over all S(j), we get O(N^3)

For ambiguous grammar, completion is proportional to its result, so it is O(N)

• overall parsing time is O(N^2)
• becomes O(N) if |S(j)| is independent of N (bounded-state grammars)
• if number of previous points in input to examine is bounded by constant