In my last post I gave a brief introduction to Happy and parser generators. In this post I will continue the story with Happy's counterpart Alex.
We will write an MVP parser for an Untyped Lambda Calculus. In a subsequent post we will improve the design with proper error handling and spans. Like last time, all code is available in the git repo.
Alex, like Happy, is a DSL used to construct Haskell modules. From the Alex docs:
Alex takes a description of tokens based on regular expressions and generates a Haskell module containing code for scanning text efficiently.
At it's simplest alex takes in a set of token definitions and then
outputs a scanner function String -> [Token]
, where
Token
is some provided lexeme type.
Here is a complete Alex file taken from the documentation:
{
module Main (main) where
}
%wrapper "basic"
$digit = 0-9 -- digits
$alpha = [a-zA-Z] -- alphabetic characters
tokens :-
$white+ ;
"--".* ;
let { \s -> Let }
in { \s -> In }
$digit+ { \s -> Int (read s) }
[\=\+\-\*\/\(\)] { \s -> Sym (head s) }
$alpha [$alpha $digit \_ \']* { \s -> Var s }
{
-- Each action has type :: String -> Token
-- The token type:
data Token =
Let
| In
| Sym Char
| Var String
| Int Int
deriving (Eq,Show)
main = do
s <- getContents
print (alexScanTokens s)
}
The elements of an Alex file are:
The structure of an Alex file is (in order):
Rules are the only required element, but in practice you will use a little bit of everything.
Wrappers are predefined bits of code that extend the functionality of Alex. For example, there is a wrapper to add position tracking and another to add monadic effects. The docs give an overview of wrappers, but it is best to just look at the source code.
In this post, we will use basic
wrapper. It gives you
alexScanTokens :: String -> [token]
which we will use as
the entry point to our lexer.
Rules are the only required element in your Alex file and are the
meat and potatos of the lexer. A rules block is declared with
id |-
followed by a series of rule statements.
A rule statement consists of a of a regex pattern (or macro definition) and either a chunk of Haskell code wrapped in curly brackets or a semicolon:
$digit+ { \s -> Int (read s) }
Rules are translated into Haskell functions whose type is determined
by our wrapper implementation. To be more precise, rule function's type
is the a
param in the AlexReturn
type. Our basic
wrapper implements alexScanTokens
which recursively consumes the String
(or
ByteString
) input and constructs a list of tokens.
If we were to forgo the basic
wrapper and write out own
custom scanner then we would be abe to dictate the type signature of our
rule functions.
In the example above and in the arithmetic language parser from the
last post the signature was String -> Token
. We can see
the function definitions in our
arithmetic lexer from last time.
Regex matching respects the longest match rule which references the regex match which consumes the greatest number of characters.
When a match is found, then an AlexReturn
value is
constructed. If the rule contains a chunk of Haskell code then that
chunk of code is returned inside the AlexToken
data
constructor. If the rule has a semicolon rather then a chunk of Haskell
code to return, then an AlexReturn
value with the AlexSkip
data constructor is used.
The AlexReturn
term is then consumed by
alexScanTokens
function, in the case of the
basic
wrapper, or by whatever scanner function we chose to
define.
Contexts
Rules can also have a right
and left
context which are added before or after your regex respectively..
Contexts allow you to match on the characters before and after your
lexeme.
Left Context
allows you to match on the beginning of a
line via the ^
character:
^ $alpha [$alpha]* { \s -> Identifier s }
This rule will only match a string of alpha characters immediately following a newline char.
Right context
is more powerful and has three forms:
$
: The rule will only match if it is immediately
preceding a newline char.
/ regex : This rule will only match if its regex match is immediately followed by the additional provided regex.
/ { … } : This rule applies a Haskell predicate function on the rule to determine if it matches.
The predicate function's type must be:
... } :: user -- predicate state
{ -> AlexInput -- input stream before the token
-> Int -- length of the token
-> AlexInput -- input stream after the token
-> Bool -- True <=> accept the token
Start Codes
Another powerful tool for writing Alex rules are
Start Codes
. They allow you to add state to your lexer. I
am going to hold off discussing them until a later post but they are
really useful for things like string templating.
For a minimal Untyped Lambda Calculus using the traditional syntax we will need the following lexemes:
data Token
= Identifier String
| Lambda
| Dot
| OpenParen
| CloseParen
Our intial lexer is very straight forward:
{module Lexer where
}
%wrapper "basic"
$digit = 0-9
$alpha = [a-zA-Z]
$alphanum = [a-zA-Z09]
:-
tokens
-- Whitespace insensitive
$white+ ;
-- Comments
"#".* ;
-- Syntax
-> Lambda }
\\ { \_ . { \_ -> Dot }
\-> OpenParen }
\( { \_ -> CloseParen }
\) { \_ $alpha [$alpha $digit \_ \-]* { \s -> Identifier s }
{data Token
= Identifier String
| Lambda
| Dot
| OpenParen
| CloseParen
deriving Show
lexer :: String -> [Token]
= alexScanTokens
lexer }
alexScanTokens
is supplied by the basic
wrapper and we bind it to lexer
for convinence.
There isn't a whole lot to say about this lexer. It does pretty much what you expect:
> lexer "\\x. y x"
Lambda,Identifier "x",Dot,Identifier "y",Identifier "x"]
[> lexer "\\x. y x # this is a comment"
Lambda,Identifier "x",Dot,Identifier "y",Identifier "x"] [
Note: we had to escape the \
because we are in a Haskell
repl.
The main issues are that it provides no safe error handling and provides no spans:
> lexer "[\\x.x]"
*** Exception: lexical error
CallStack (from HasCallStack):
error, called at templates/wrappers.hs:336:32 in main:Lexer
If you look at the source for alexScanTokens
you can see that it simply calls error
for any lex
errors.
We will fix these shortcomings soon, but lets move on to the parser for now.
Now that we have a String -> [Token]
lexer function,
we can implement our Happy parser. In my last
post I described the basic syntax of a Happy file. Here I will
include a complete Happy file and go over some of the specific syntax
not mentioned in the last post.
{module Parser where
import qualified Lexer as L
}
%name parser expr
%tokentype { L.Token }
%error { parseError }
%token
L.Identifier $$ }
ident { L.Lambda }
lambda { '.' { L.Dot }
'(' { L.OpenParen }
')' { L.CloseParen }
%%
expr: lambda ident '.' ap { Abs $2 $4 }
| ap { $1 }
ap: ap atom { Ap $1 $2 }
| atom { $1 }
atom: '(' expr ')' { $2 }
| ident { Var $1 }
{parseError :: [L.Token] -> a
= error "ParseError: Empty token stream."
parseError [] :_) = error $ "ParseError: Unexpected token '" <> show tok <> "'."
parseError (tok
data Term =
Var String
| Abs String Term
| Ap Term Term
deriving Show
}
Lines beginning with %
are known as Directives. These
are input to the Happy parser generator.
In this file we use:
%name
: Declares the name of the final parser function
and the initial production rule to kick off the parsing process. You can
actually have multiple name directives yielding several different
parsers from a single grammar.%tokenType
: Declares the type of the lexeme token used
as input to the parser.%error
: The error handling function. This function is
applied to the remaining token stream when a parse error is
encountered.%token
: This kicks off the terminal symbols section of
our grammar. Lexemes matched in curly brackets are replaced by the
identifers on their left in the following production rules. You can
think of this section as a case statement on the lexeme token type.%%
: This kicks off the production rules section of the
grammar. Production rules are described in my previous post, so I will
omit them here.In addition to our directives, the happy file can be bookended with Haskell code like the Alex file.
We can now execute our parser in the repl and build an AST:
> parser $ lexer "\\x. y x"
Abs "x" (Ap (Var "y") (Var "x"))
> parser $ lexer "\\x. y x # this is a comment"
Abs "x" (Ap (Var "y") (Var "x"))
Here we have built a minimally complete lexer and parser for Untyped Lambda Calculus. The implementation is available here along with all other lexers and parsers for this series.
In my next post we will rewrite this parser without the
basic
wrapper while adding spans and error handling. Then
in a subsequent post I will move on to a more complex language requiring
monadic state.