Earley Parser
What is Earley Parser
A parser for context-free grammars
- works for arbitrary, even ambiguous, grammars
- runs in O(n^3) in general
- unlike CYK Parsing Algorithm, it does not require Transforming to Chomsky Normal Form
- for unambiguous grammars it runs in O(n^2)
- for grammars handled by LR parsers, it runs in O(n)
Can be thought of as using dynamic programming, like CYK
- but driven by the input stream and start symbol (avoids unnecessary computation)
- like LR parsers uses notion of items
Items in Earley Parser
For right-hand sides of the form X::= UVW, a CYK Parsing Algorithm first uses Transforming to Chomsky Normal Form to introduce fresh non-terminals for partial right-hand sides, such as Z::=UV.
Earley parser instead uses grammar rules extended with exactly one 'dot' inserted, indicating the point reached by parser so far
X ::= UV.W
Call this a 'dotted production'.
We think of this as denoting non-terminal Z standing for UV.
CYK Parsing Algorithm uses a matrix d(i)(j) storing all non-terminals that can derive substring from i to j.
- instead of having e.g. d(i)(j) = {Z,P}, d(i')(j)={Q},
- use array d' storing non-terminals with their starting position: S(j)={(Z,i),(P,i),(Q,i')}
Earley parser stores at position j items (called states by Earley), which are pairs (X::=UV.W, i) of
- dotted right-hand side X::=UV.W
- starting position i
Note: for X::=UVW there are 4 dotted productions:
X::=.UVW X::=U.VW X::=UV.W X::=UVW.
Earley Parsing Idea
Input string: w(0) … w(N-1) EOF
Parser examines input left to right, computes S(i) for i from 0 to N-1
If D is start symbol in the original grammar, we add extra production
D'::=D EOF
and make D' start state, so parser can detect end of input.
Let denote sequences of terminals and non-terminals of grammar.
Goal: S(j) should contain all items that could appear in parsing of strings beginning with w(0),…,w(j-1)
More precisely: S(j) contains those items for such that for some
- is a grammar rule,
We sketch these conditions as follows:
w(0) ... w(i-1) w(i) ... w(j-1) w(j) ... w(N-1) | | ^ | p | r | s? | | | parser position | X? | q? |
For the parser did not check yet whether they agree with the input string.
Steps of Early Parsing Algorithm
Initially, let S(0) = {(D'::=.D EOF,0)}
When scanning input at position j, parser does the following operations (p,q,r are sequences of terminals and non-terminals):
Prediction
If and is a grammar rule, then
Scanning
If and then (we can skip a):
Completion
If and then
sketch of completion:
w(0) ... w(k) ... w(i) ... w(j) | q | p | Y::=q.Xr X::=p. Y::=qX.r
Remarks on Complexity
We measure parsing time as a function of input length N, for a fixed grammar.
- fixed number of dotted rules
In S(j) there can be at most j copies of each dotted rule (for all possible start positions)
- prediction and scanning take O(N)
- completion can take O(N^2) (but size of result still O(N))
Summing up work over all S(j), we get O(N^3)
For ambiguous grammar, completion is proportional to its result, so it is O(N)
- overall parsing time is O(N^2)
- becomes O(N) if |S(j)| is independent of N (bounded-state grammars)
- if number of previous points in input to examine is bounded by constant