What is Earley Parser
A parser for context-free grammars
- works for arbitrary (even ambiguous) context-free grammars
- driven by the input stream and start symbol
- uses the well-known technique of dynamic programming (avoids unnecessary computation)
- runs in O(n^3) in general
- for unambiguous grammars it runs in O(n^2)
- for grammars handled by LR parsers (we will see those later), it runs in O(n)
Items in Earley Parser
Earley parser uses grammar rules extended with exactly one 'dot' inserted, indicating the point reached by parser so far.
Call this a 'dotted production'.
Consider an example grammar rule X::=UVW. There are 4 dotted productions for this rule:
X::=.UVW X::=U.VW X::=UV.W X::=UVW.
Another example: for the grammar production
assignment ::= ID ":=" expression ";"
we have the follownig dotted productions:
assignment ::= . ID ":=" expression ";" assignment ::= ID . ":=" expression ";" assignment ::= ID ":=" . expression ";" assignment ::= ID ":=" expression . ";" assignment ::= ID ":=" expression ";" .
For each of the positions j in the input string, Earley parser stores items (called states by Earley), which are pairs of
- dotted right-hand side (e.g. X::=UV.W)
- starting position i
If e.g. the item (X::=UV.W, i) is in the set S(j), then it should be possible to parse the substring of input from position i to position j using the sequence of symbols UV.
Earley Parsing Idea
Input string: EOF
Parser examines input left to right, computes for i from 0 to N-1
If D is start symbol in the original grammar, we add extra production
and make D' start state, so parser can detect end of input.
Let denote sequences of terminals and non-terminals of grammar.
Goal: should contain all items that could appear in parsing of strings beginning with
More precisely: contains those items for , such that for some
- is a grammar rule,
We sketch these conditions as follows:
w(0) ... w(i-1) w(i) ... w(j-1) w(j) ... w(N-1) | | ^ | p | r | s? | | | parser position | X? | q? |
For the parser did not check yet whether they agree with the input string.
Steps of Early Parsing Algorithm
When scanning input at position j, parser does the following operations ( are sequences of terminals and non-terminals):
If and is a grammar rule, then
If and then (we can skip a):
If and then
sketch of completion:
w(0) ... w(k) ... w(i) ... w(j) | q | p | Y::=q.Xr X::=p. Y::=qX.r
After finishing all steps that are possible for , the parser moves to .
Remarks on Complexity
We measure parsing time as a function of input length N, for a fixed grammar.
- fixed number of dotted rules
In S(j) there can be at most j copies of each dotted rule (for all possible start positions)
- prediction and scanning take O(N)
- completion can take O(N^2) (but size of result still O(N))
Summing up work over all S(j), we get O(N^3)
For un-ambiguous grammar, completion is proportional to its result, so it is O(N)
- overall parsing time is O(N^2)
- becomes O(N) if |S(j)| is independent of N (bounded-state grammars)
- if number of previous points in input to examine is bounded by constant