CS 444 (Compilers) is boring. Honestly, who finds learning rules like “
java.lang.Object is an implicit supertype for all classes and interfaces that do not extend any other supertype” interesting?
99% of the course focuses on learning the quirks of Java and figuring out how to turn
(new Foo()).bar() into x86 code. We will explore the only theoretically interesting part of the course - parsing.
What is parsing? To understand parsing, we have to first understand what a grammar is. A grammar describes some language. A language is a set of token sequences. Tokens are basically strings or characters with some useful meaning.
For example, a set of tokens may be , which are some numbers and operators. A language for this may be “all valid math expressions”. So we want
0 + 1 and
1 + 1 * 0 to be in the language, but not
1 + +. That would be silly.
That’s not a very precise definition of a language, right? A grammar makes it more precise by specifying production rules, or describes how we can turn
1 + 1 * 0 into a tree. A production rule allows us to turn a symbol into a (possibly empty) sequence of symbols + tokens.
For example, a grammar for the above may be:
S -> E E -> 0 E -> 1 E -> E + E E -> E * E
In English: a “valid math expression” is either a number (specifically
1), or 2 expressions added/multiplied together. We use symbols like
E (i.e. capital letters) to describe “intermediary” results, with
S being the starting symbol.
So what’s parsing? A parsing algorithm, or parser, takes a grammar and some sequence of tokens, and generates a tree that matches the sequence. And we can use parse trees in a variety of ways, like calculating the value of an expression.
So hopefully, a parser can take the above grammar and a sequence like
1 + 1 * 0 into a tree like:
S | E / | \ E + E | | 1 E * E | | 1 0
1 + (1 * 0). So generating a tree is basically like drawing “brackets” inside the sequence. Of course, we might get this instead, assuming our parser didn’t take grade-school math:
S | E / | \ E * E | | E + E 0 | | 1 1
(1 + 1) * 0. Drawing trees is hard.
Let’s learn how some parsers work.
It turns out the given grammar is “too hard” for a class of parsers called “LL parsers” (LL means “left-to-right, left derivation”). Mostly because of the ambiguity shown above, and we want LL parsers to run in i.e. a one-time pass through the input.
So let’s try an “easier” grammar:
(1) S -> E (2) E -> a F (3) F -> + a (4) F -> ε ("nothing", or empty sequence)
This grammar describes the language .
What is LL Parsing? The main intuition is as follows: we know every valid expression must have a root node,
S must turn into
E, which turns into
a F, so let’s start with that.
Suppose our input is
a + a. We look at the next token,
a, and say, “gee,
a matches the
a F, so what we’re really trying to find is a tree for
+ a given
Now we reduce the problem down to finding a tree for
+ a, given
F can turn into 2 different things,
+ a and
ε, so we have to look at the next token, which is
+. At this point, we say, “aha, we must apply the rule
F -> + a, because we’re smart”.
So our steps looks like:
S -> E [apply rule 1] -> a F [rule 2] -> a (+ a) [rule 3]
That’s basically how LL parsing works! Except we glossed over the “we’re smart” part. Let’s look at it more carefully.
Suppose our grammar is like:
(1) S -> A B (2) A -> a (3) A -> ε (4) B -> b
So before reading the input, we know that
S must turn into
[A B], which we call our “stack” (which is really more of a queue). Now suppose our input is
a b. How can we know to apply rule 2 instead of 3?
Well, suppose our stack is
[(some token) ...]. Then the next token we see must match our “some token”. E.g. if our stack is
[a A B], the next token must be
So we really only get confused if our stack starts with a symbol instead of a token, like
[A B]. And suppose the next token is
The key insight: either we use
A must derive a bunch of stuff that lead to a sequence
a ..., or we discard
A, i.e. derive itself to
In our case, we “use”
A by applying the rule
A -> a directly, but it’s not hard to imagine a world where we need to take extra steps like
A -> C -> D -> ... -> a ....
Likewise, suppose our input was
b instead. Then we want to apply the rule
A -> ε so
A is gone, then apply
B -> b. Likewise, it’s not hard to imagine we’d have to do something like
A -> C -> D -> ... -> ε. But
A is still gone in the end.
We typically precompute a table of “all the symbol x input possibilities”, so that if we have an
A and an input
a, we know what production rule to apply next.
But that doesn’t explain how we precompute the table. The key question is, “in general, what rule should we apply to
A if we see an input
We answer this by figuring out (recursively) the set of tokens that can be first in any sequence of derivations of
A. i.e. all
a such that
A -> B ... -> C ... -> ... -> a .... We also remember what rule we need to apply to reach this derivation (e.g.
A -> a leads to
a). This is called the
How do we know when to discard
A instead? Well, if the input
a is not in
A must be “nullable”, i.e.
A -> B -> ... -> ε for there to be a valid derivation. Convince yourself why this is true.
In that case, we “hope” that
a is the the
follow(A) set, i.e. all tokens that can appear in some derivation of
A B ... such that when
A is gone,
a can be matched successfully. This is tricky, because we can have rules like
S -> A B C with input
a, which require
A -> ... -> ε and
B -> ... -> ε and
C -> ... -> a.
follow(A) is tricky but do-able. There are 2 cases:
K -> ... A stuff, in which
first(stuff) ⊆ follow(A).
K -> ... A stuff and
stuff is “nullable”, so
follow(K) ⊆ follow(A).
Note that figuring out
follow(A) requires “contextual knowledge”, i.e. figuring out rules of form
K -> ... A ..., rather than starting with
A itself (as in computing
Armed with this intuition of LL Parsing, you can probably write an algorithm for making LL parsing tables if given enough time. Next, we will explore LR parsing and co.