How Parsing Works (Part 2)

LR Parsing

What is LR parsing? People usually describe it as some sort of “bottom-up” parsing technique. That true, but doesn’t build intuition.

From complexity theory, we know context-free grammars (CFGs) are “more powerful” than deterministic finite automatas (DFAs). If you have ever tried (and failed at) parsing HTML with a regex, you’d see what I mean.

All languages describable by a DFA are describable by a CFG, the inverse is not true. For example, this grammar describes balanced brackets:

S -> E
E -> 0
E -> ( E )

It turns out you can’t write a DFA that can figure out if brackets are balanced or not.

Why is this relevant to us? Well, LR parsing boils down to: “some CFGs are too hard for DFAs to parse. What if we only consider the languages that are decidable by a DFA?” (DFAs all run in by the way, which is fast enough for us). It’s the equivalent of a student saying, “teacher, it’s not my fault I can’t do the homework; it’s your fault that the homework is too hard.”

Let’s first look at a specific LR parser: LR(0).

LR(0) Parsing

First, we’ll use NFAs instead of DFAs, since they’re equivalent, except NFAs are a lot easier to work with. So suppose we know a CFG can be parsed by a NFA. How would we go about constructing the NFA? Let’s consider a really simple grammar, that only accepts the string “hi”:

S -> h i

A natural NFA might look like:

-> S -----> [h] -----> [i]
       h          i

Where S is the start state and [i] is the accepting state. This works fine, but let’s adopt a different notation:

-> [S -> . h i] -----> [S -> h . i] -----> [S -> h i .]
                  h                   i

So the period (.) represents “where we are in reading the production rule’s right-hand-side sequence”. Note we simply go left-to-right in the production rule (i.e. a transition corresponds to “making progress” in that rule).

This works fine, but grammars usually have multiple rules. Consider cases where we need to expand things:

S -> A b
A -> a a
A -> a

The language is {"ab", "aab"}. Say we’re at a state [S -> . A b]. We can’t read A directly, so we want to ε-transition to [A -> . a] and [A -> . a a] to read a (equivalent to going down a parse tree). We ε-transition because we don’t know which rule is actually relevant, so we keep both options open.

But after we read a rule i.e. we’re at [A -> a .], we want to go back to the production rule we came from (equivalent to going up the parse tree), so we ε-transition to [S -> A . b] (we go past A because we just read it!).

That’s LR(0) parsing in a nutshell. The transitions and ε-transitions I described are sufficient to draw a NFA that decides the CFG.

rss facebook twitter github youtube mail spotify instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora