Welcome! If you are reading this, you likely have decided to take on the journey of learning parser combinators. We hope this article will make your adventure smoother and hopefully give you a strong foundation for writing your grammars.
This article is composed of three major parts. In the first part, we will implement a small parser combinator library from scratch, which should hopefully help to give a feeling of how industrial-strength parsing combinators work. In the second part, we will learn how to use the Megaparsec library to implement a parser for S-expressions. Finally, as a bonus, we will use the power of Template Haskell to implement a quasi-quoter for our parser.
What is a parser combinator and why use them?
In 2001, Daan Leijen and Erik Meijer published a paper titled Parsec: Direct Style Monadic Parser Combinators For The Real World, describing the parsec
library, whose design consequently influenced various others, such as megaparsec
, attoparsec
, trifecta
, and even libraries outside the Haskell ecosystem, such as NimbleParsec
for Elixir, parsec
for Python, FParsec
for F#, among others.
Parser combinators are known to be simple to use without requiring external tools or too many concepts to learn. That is, they are ordinary Haskell constructors that can easily be combined and returned with other parsers because of their nature as functions. This, in turn, makes them idiomatic to use, and being a popular choice among Haskellers, their ecosystem is pretty developed. You should have no trouble finding tutorials and documentation on how to use them.
It’s important to notice that parser combinators also have their flaws. Parser combinators may be implemented as either LL(1) or LL(∞) parsers, which, elaborating on the points described in Kirill Andreev’s How to Implement an LR(1) Parser:
- Don’t support left recursion. That is to say, if there is a rule
foo = foo *> bar
, you might need to refactor your grammar otherwise it will loop forever. - Don’t resolve conflicts, which we will see as an example later in this article. This means the ordering of rules can influence the output, and you need to be careful when defining rules that can consume the same input.
- In an LL(1) implementation, it will pick the first alternative that maches the input.
- In an LL(∞) implementation, it will try the longest possible match, which may be very inneficient.
- Many parser combinator libraries will backtrack, which will also be exemplified later, meaning that you might need to be careful to avoid performance penalties.
- The backtracking mechanism may, in turn, lead to exponential time complexities.
This shouldn’t, however, discourage you from their use, as their benefits outweigh the drawbacks. Ultimately, the parsing library of your choice should be chosen based on your needs, and parser combinators are a good fit for most use cases.
Implementing a simple parser combinator from scratch
Before we delve into megaparsec
, it may be interesting to implement our parser combinator from scratch. We will make a simple one that should hopefully help you understand how parser combinators work.
If you instead prefer to jump right into the action, then skip to the Megaparsec tutorial.
Implementation
We will begin by importing the appropriate definitions in a new file called Parser.hs
:
module Parser where
import Control.Applicative (Alternative (..))
import Data.List (nub)
We will create two data structures that represent our parser.
The first of them represents some error that may happen while parsing. For now, let’s have three errors: one that is raised when we are expecting input but there is nothing to consume (EndOfInput
), one when we find an unexpected character (Unexpected
), and one for any error messages that the user might want to raise (CustomError
).
Our second structure is the actual parser. It’s represented as a function that takes some input and either returns a list of errors, if Left
, or the result of parsing the input together with the rest of the input (that was not parsed), if Right
.
data Error i e
= EndOfInput -- Expected more input, but there is nothing
| Unexpected i -- We didn't expect to find this element
| CustomError e -- Extra errors the user may want to create
| Empty -- Used in `Alternative` implementation of `empty`
deriving (Eq, Show)
newtype Parser i e a = Parser
{ runParser :: [i] -> Either [Error i e] (a, [i])
}
This is the meaning of each of the type variables:
i
: The input stream. For most cases, we will useChar
, as[Char]
is the same asString
. It’s interesting to notice that keeping this as a type variable allows us to parse things that are not strings as well.e
: The type of custom error messages. If we don’t have those, we may useVoid
instead.a
: The result of our parsing function. It represents the structure parsed from our consumed input.
Let’s begin by creating the most primitive function: satisfy
. It tests the current character with a predicate, and if it succeeds, it will advance the parser and return the consumed character.
We must take care that we are not at the end of the input, however, in which case we should fail.
satisfy :: (i -> Bool) -> Parser i e i
satisfy predicate = Parser $ \input ->
case input of
[] -> Left [EndOfInput]
hd : rest
| predicate hd -> Right (hd, rest)
| otherwise -> Left [Unexpected hd]
It’s also common to define a function that only succeeds if the input strictly matches some given element.
char :: Eq i => i -> Parser i e i
char i = satisfy (== i)
If you load this in GHCi, you should now be able to test it:
>>> :l Parser
>>> runParser (char 'h') "hello"
Right ('h', "ello")
>>> runParser (char 'h') "greetings"
Left [Unexpected 'g']
>>> runParser (char 'h') ""
Left [EndOfInput]
Let’s now create Monad
and Alternative
instances for our parser. For educative purposes, we will also explicitly implement Functor
and Applicative
instead of fmap = liftM
and (<*>) = ap
.
We start with Functor
. Since we map over the a
type variable obtained in the case of a successful parse, we can only apply f
in case of Right
.
instance Functor (Parser i e) where
fmap f (Parser p) = Parser $ \input ->
case p input of
Left err -> Left err
Right (output, rest) -> Right (f output, rest)
Since Either
is a monad, we can use do-notation. Here’s an implementation that’s a little shorter:
instance Functor (Parser i e) where
fmap f (Parser p) = Parser $ \input -> do
(output, rest) <- p input
pure (f output, rest)
Next up is Applicative
. A good intuition behind Applicative
s is that pure
represents an identity element, while <*>
distributes one argument over the other. Think of this like a multiplication, where 0 * a = a * 0 = 0
, while 1 * a = a * 1 = a
.
To illustrate this, observe this table:
Left parser failed | Left parser succeeded | |
---|---|---|
Right parser failed | Error from left parser | Error from right parser |
Right parser succeeded | Error from left parser | Success |
Note that even if both parsers fail, we only want the error from the first one, as we execute the second one only if the first one succeeded.
This is due to the Either
part of the parser, where Left
is like 0
and Right
is like 1. pure
represents our 1
, so it should create a parser that always succeeds without consuming any input.
And similarly to Either
, we want <*>
to fail in case any of the operands result in a failed parse given some input. This implementation tries to run the left operand, which consumes some input and produces a function as well as the unconsumed input. The rest of the input is given to the right operand, and in case of success, it applies the function to the argument.
instance Applicative (Parser i e) where
pure a = Parser $ \input -> Right (a, input)
Parser f <*> Parser p = Parser $ \input ->
case f input of
Left err -> Left err
Right (f', rest) ->
case p rest of
Left err -> Left err
Right (output, rest') -> Right (f' output, rest')
Or with Either
’s monad instance:
instance Applicative (Parser i e) where
pure a = Parser $ \input -> Right (a, input)
Parser f <*> Parser p = Parser $ \input -> do
(f', rest) <- f input
(output, rest') <- p rest
pure (f' output, rest')
In our Monad
instance, we set return = pure
as usual. Our bind is more interesting: we try to run the given parser with the input, and if it succeeds, we give the produced output (of type a
) to our continuation k
, which produces a parser that is executed with the remainder of the output.
instance Monad (Parser i e) where
return = pure
Parser p >>= k = Parser $ \input ->
case p input of
Left err -> Left err
Right (output, rest) ->
let
Parser p' = k output
in
p' rest
Again, using the fact that Either
is a monad, we can simplify this implementation:
instance Monad (Parser i e) where
return = pure
Parser p >>= k = Parser $ \input -> do
(output, rest) <- p input
runParser (k output) rest
With our Applicative
and Monad
instances, we are now able to parse sequentially. For example, we defined char
above to let us parse a single character, but what if we wanted to parse many of them?
string :: Eq i => [i] -> Parser i e [i]
string = traverse char
If traverse
is new to you, it’s also equivalent to these definitions:
string :: Eq i => [i] -> Parser i e [i]
string [] = pure []
string (x : xs) = (:) <$> char x <*> string xs
Or even in terms of Monad
:
string [] = return []
string (x : xs) = do
y <- char x
ys <- string xs
return (y : ys)
Let’s try it in GHCi:
>>> runParser (string "Haskell") "Haskell"
Right ("Haskell","")
>>> runParser (string "Haskell") "Halloween"
Left [Unexpected 'l']
Finally, our last instance: Alternative
. This may be new for you, so I will explain. First, take a look at its definition, which can be found in Control.Applicative
:
class Applicative f => Alternative f where
empty :: f a
(<|>) :: f a -> f a -> f a
I mentioned that pure
acts as 1
, which is a parser that always succeeds without consuming input. empty
works like the converse (or 0
), that is a parser that always fails without consuming input.
For (<|>)
, we want to implement it such that the first parser to succeed is used. That is, we first try the left operand, and failing that, we try the second one. The table below shall guide you:
Left parser failed | Left parser succeeded | |
---|---|---|
Right parser failed | Errors from both parsers | Left parser |
Right parser succeeded | Right parser | Left parser |
Here is an implementation of Alternative
. We constrain i
and e
with Eq
so we can use nub
to remove duplicated errors. Note that this is not the best or most efficient way to merge errors, but for our purposes, it’s good enough.
instance (Eq i, Eq e) => Alternative (Parser i e) where
empty = Parser $ \_ -> Left [Empty]
Parser l <|> Parser r = Parser $ \input ->
case l input of
Left err ->
case r input of
Left err' -> Left $ nub $ err <> err'
Right (output, rest) -> Right (output, rest)
Right (output, rest) -> Right (output, rest)
Finally, let’s test it!
>>> runParser (string "hello" <|> string "greetings") "hello, world"
Right ("hello",", world")
>>> runParser (string "hello" <|> string "greetings") "greetings, world"
Right ("greetings",", world")
>>> runParser (string "hello" <|> string "greetings") "bye, world"
Left [Unexpected 'b']
>>> runParser (empty <|> pure ()) ""
Right ()
>>> runParser (pure () <|> empty) ""
Right ()
We can see our errors composing with a more complicated expression:
>>> runParser ((string "hello" *> string ", globe") <|> string "greetings") "hello, world"
Left [Unexpected 'w',Unexpected 'h']
Hopefully, this should give you an intuition of how parser combinators work. Popular parser combinator libraries are more complex than this, as they tend to keep track of extra state for better error messages, optimization, information about line and column, etc., but the overall operations between them are similar.
The complete code for this section with solutions to the exercises below can be found here.
Exercises
- Change
char
so that it returns errors such asLeft [Expected 'h' 'g']
(“Expected ‘h’, but got ‘g’”).
Solution
Begin by creating a new error:data Error i e
= ...
| Expected i i
Next, we will refactor the code originally in satisfy
into token
, just so we avoid duplicating code. Then, rewrite satisfy
to use this new function, and rewrite char
to make use of it as well.
The important part is simply that we change the error message.
token :: (i -> Error i e) -> (i -> Bool) -> Parser i e i
token mkErr predicate = Parser $ \input ->
case input of
[] -> Left [EndOfInput]
hd : rest
| predicate hd -> Right (hd, rest)
| otherwise -> Left [mkErr hd]
satisfy :: (i -> Bool) -> Parser i e i
satisfy = token Unexpected
char :: Eq i => i -> Parser i e i
char i = token (Expected i) (== i)
In GHCi you should see:
>>> parse (char 'h') "bye"
Left [Expected 'h' 'b']
- Create an
eof :: Parser i e ()
function which succeeds only if we have arrived at the end of the input.runParser (string "hello" <* eof) "hello"
should returnRight ("hello", "")
runParser (strung "hello, world" <* eof) ""
should returnLeft [ExpectedEndOfFile ',']
(“Expected EOF, but got ‘,’”)
Solution
First, create a new error:data Error i e
= ...
| ExpectedEndOfFile i
Now create a function that returns Right
only if there is no more input, and an error otherwise.
eof :: Parser i e ()
eof = Parser $ \input ->
case input of
[] -> Right ((), [])
hd : _ -> Left [ExpectedEndOfFile hd]
Trying it out in GHCi:
>>> parse (string "bye" <* eof) "bye"
Right ("bye","")
>>> parse (string "bye" <* eof) "byebye"
Left [ExpectedEndOfFile 'b']
- Change the definition of
Parser
so it contains a[i] -> Offset -> (Either [Error i e] (Offset, a, [i]))
, wheretype Offset = Int
. The offset indicates how many positions we have parsed.- You may also change
Error
to store an offset so that you can indicate where an error occurred. runParser (string "hey") 0 "hi"
should returnLeft [Error {erOffset = 1, erError = Expected 'e' 'i'}]
(“Unexpected ‘e’, but got ‘i’, at position 1”)
- You may also change
Solution
The solution is mostly mechanically replacing definitions where errors occurred.First, we rename Error
to ErrorType
, and create a new Error
type:
data Error i e = Error
{ erOffset :: Offset
, erError :: ErrorType i e
} deriving (Eq, Show)
data ErrorType i e
= ...
If you’ve done the previous exercises, then token
may be changed now to take offset
as input to the parser’s function, and return offset + 1
on success. If an error occurred instead, we now construct Error
with the given input. Our mkErr
function should not take an ErrorType
instead of Error
.
token :: (i -> ErrorType i e) -> (i -> Bool) -> Parser i e i
token mkErr predicate = Parser $ \input offset ->
case input of
[] -> Left [Error offset EndOfInput]
hd : rest
| predicate hd -> Right (offset + 1, hd, rest)
| otherwise -> Left [Error offset $ mkErr hd]
If you haven’t done the previous exercises, then change satisfy
in a similar manner, taking offset
as input and returning offset + 1
, and error Error offset $ Unexpected hd
on the error.
For the remainder of the functions, change the parser function to also take offset
as an argument, and just return it unaffected. Take care to chain these offsets as required. For example, this is the new definition of Applicative
:
instance Applicative (Parser i e) where
pure a = Parser $ \input offset -> Right (offset, a, input)
Parser f <*> Parser p = Parser $ \input offset ->
case f input offset of
Left err -> Left err
Right (offset', f', rest) ->
case p rest offset' of
Left err -> Left err
Right (offset'', output, rest') -> Right (offset'', f' output, rest')
Finally, in parse
, use 0
as the initial offset.
Keep in mind it’s also possible to operate with line and column when you know you have a function operating in characters.
>>> parse (string "hey") "hi"
Left [Error {erOffset = 1, erError = Expected 'e' 'i'}]
The reader may also choose to instead refactor the parser type to use a custom-defined State
type that contains the input stream and offset. Many improvements are possible, but we choose to keep it simple for educational purposes.
What’s next?
In this section, we’ve learned how to make a simple parsing combinator library. There are still various improvements that can be done. To name a few of them, we can implement more instances, such as Semigroup
and Monoid
, IsString
, MonadFail
, MonadPlus
, among others. Error messages can be improved by better keeping track of location (such as line and column), supporting more types such as Text
and ByteString
, pretty-printing error messages, etc.
We will conclude this part here, and if you are interested, proceed to the next part where Megaparsec, an industrial-strength parser combinator library, will be introduced with a more practical application than our simple examples in this section.
Megaparsec tutorial
We will now take a look at a practical application of parser combinators: parsing S-expressions. Hopefully, this example will give you a solid foundation for creating parsers for your grammars.
Megaparsec is the go-to parser for industrial projects, so we will also use it in our tutorial. If you wish to see how it compares to other parser combinator libraries in Haskell, you can read its README.md
.
Keep in mind you should choose the parser library that better suits your needs. If you plan to parse machine-written data, for example, then Attoparsec may be a better choice. Its interface is similar to Megaparsec’s, so the knowledge you will learn here will also be useful.
Without any further ado, we will now present how to use Megaparsec to parse some S-expressions.
Representing S-expressions
To begin, create a file called SExp.hs
. Also, make sure to get the megaparsec
package.
We will use the following data structure to represent various S-expressions. Gradually, we will write a parser that represents each of these cases.
module SExp where
import Data.Void (Void)
import Text.Megaparsec
import Text.Megaparsec.Char
newtype Identifier = Identifier
{ getId :: String
} deriving (Show)
data SExp
= SSExp SExp [SExp] -- (foo 0 "hello" bar), (bar (baz 1)), (foo)
| SInteger Integer -- 42
| SString String -- "hello, world"
| SBool Bool -- false, true
| SId Identifier -- foo
deriving (Show)
We also need an alias for the type which megaparsec
uses.
type Parser = Parsec
-- The type for custom error messages. We have none, so use `Void`.
Void
-- The input stream type. Let's use `String` for now, but for
-- better performance, you might want to use `Text` or `ByteString`.
String
Booleans
Let’s start with the simplest of the parsers: SBool
. We use <|>
which tries one parser, and failing that, will try the other one.
bool :: Parser Bool
bool = False <$ string "false" <|> True <$ string "true"
We will also make a parser to represent our SExp
-specific variants. As we progress, we will add more parsers to this list.
atom :: Parser SExp
atom = choice
[ SBool <$> bool
]
Here we use choice
, which is the same as running <|>
between each element of the list. choice [a, b, c] = a <|> b <|> c
.
It’s important to notice that choice
, as well as many other useful combinators, are part of the parser-combinators
package, which Megaparsec conveniently re-exports for us. Should you fail to find a specific function in Megaparsec’s documentation, chances are that it’s defined in parser-combinators
instead.
Now, fire up GHCi and test that we can parse it! megaparsec
provides the parseTest
function which allows us to conveniently test our parsers.
>>> parseTest atom "false"
SBool False
>>> parseTest atom "hey"
1:1:
|
1 | hey
| ^^^
unexpected "hey"
expecting "false" or "true"
Integers
To parse integers, one strategy is to parse as many digits (chars whose value is between '0'
and '9'
) as possible. This will give us a list of characters, which we can then read
.
For this parser, we use the some
function, which tries to run a given parser 1 or more times. numberChar
is exported from Text.Megaparsec.Char
and parses any character that matches a digit. In regex, this is equivalent to [0-9]+
. This will return us a [Char]
(aka String
), which we can read
to make it into an Integer
.
integer :: Parser Integer
integer = read <$> some numberChar
We now change our atom
to also include integers.
atom :: Parser SExp
atom = choice
[ SBool <$> bool
, SInteger <$> integer
]
Let’s also test it in GHCi:
>>> parseTest atom "42"
SInteger 42
>>> parseTest atom "hey"
1:1:
|
1 | hey
| ^^^
unexpected "hey"
expecting "false", "true", or numeric character
Although the error message is good, wouldn’t it better if it said “integer” instead of “numeric character”? For this, we can use labels, which allows us to give a name to a parser. To do this, we can simply change our parser like this:
integer :: Parser Integer
integer = label "integer" $ read <$> some numberChar
If you prefer, it’s also possible to use <?>
, which does the same as label
.
integer :: Parser Integer
integer = read <$> some numberChar <?> "integer"
And now the error message should be slightly improved:
>>> parseTest atom "hey"
1:1:
|
1 | hey
| ^^^
unexpected "hey"
expecting "false", "true" or integer
You can also use label "boolean"
in our bool
parser to change the message to expecting boolean or integer
.
Exercise: Can you change this parser so it also accepts negative integers?
Solution
One possible solution using `<|>`:integer :: Parser Integer
integer = label "integer" $ read <$> (some numberChar <|> ((:) <$> char '-' <*> some numberChar))
However, later in the tutorial, we will see an even better way that uses Megaparsec’s own mechanisms.
Strings
To parse a string, the idea is to find everything that is between double-quotes. This approach has a weakness, though: we won’t be able to parse escaped quotes. Later in this tutorial, we will present a solution to escaped characters using Megaparsec’s lexing mechanisms.
To achieve this, we will use two new functions: between
and takeWhileP
. Using between open close x
is equivalent to open *> x <* close
, while takeWhileP
will read as many characters as possible that satisfy some predicate.
Note that takeWhileP
also accepts, optionally, a label for each character that it tries to read. For now, we will leave it as Nothing
.
str :: Parser String
str = label "string" $ between (char '"') (char '"') (takeWhileP Nothing (/= '"'))
-- Or:
--- str = label "string" $ char '"' *> takeWhileP Nothing (/= '"') <* char '"'
In atom
, include SString <$> str
.
>>> parseTest atom "\"hey\""
SString "hey"
Identifiers
An identifier will be like a variable in a programming language. We can choose any naming convention for such a thing, but here I chose the one that is frequently used in C-like languages: the first letter must be a letter or an underscore, while the remainder may be a letter, digit, or underscore.
Here we use the many
function, which is very similar to the some
function discussed before. The difference is that many
tries to run the given parser 0 or more times. In regex, our identifier
parser is equivalent to [a-zA-Z_][a-zA-Z0-9_]*
.
identifier :: Parser Identifier
identifier = label "identifier" $ do
first <- letterChar <|> char '_'
rest <- many $ alphaNumChar <|> char '_'
pure $ Identifier $ first : rest
-- Or:
-- identifier = Identifier <$> label "identifier" ((:) <$> (letterChar <|> char '_') <*> many (alphaNumChar <|> char '_'))
Now include SId <$> identifier
to atom
.
Warning: When parsing with <|>
or choice
, the order of the parsers matters. Take this as an example:
>>> parseTest (SBool <$> bool <|> SId <$> identifier) "true"
SBool True
>>> parseTest (SId <$> identifier <|> SBool <$> bool) "true"
SId (Identifier {getId = "true"})
In the first case, we parse a boolean before an identifier, which causes true
to be matched by bool
. However, in the second example, using identifier
before bool
caused true
to be recognized as an identifier. You should be careful while considering such things. In most cases, you can assume that parsers are greedy. Notice that in the second case, the bool
parser will never be executed.
In other words, make sure that you parse a boolean in atom
before you parse an identifier.
S-expressions
Finally, let’s move on to our last parser, which parses S-expressions themselves, such as (foo 42 "hey")
or (bar)
.
First, we will do it the naïve way, which may come as the first natural solution to someone just learning Megaparsec, and then we will do it in a better manner using extra tools offered by Megaparsec.
S-expressions: the awkward way
The idea here is to parse an atom, followed by 0 or more atoms. Of course, this means that we want to use many
to parse such atoms. Additionally, an S-expression may be surrounded by parenthesis, so we can use between
for such a task.
This parser, however, is not as simple as before, but let’s first do it the wrong way (which won’t work properly), analyze the problem, and come up with a solution to it.
sexp :: Parser (SExp, [SExp])
sexp = label "S-expression" $ between (char '(') (char ')') ((,) <$> atom <*> many atom)
-- Or:
-- sexp = label "S-expression" $ char '(' *> ((,) <$> atom <*> many atom) <* char ')'
Note that (,) <$> atom <*> many atom
is pretty similar to some atom
. parser-combinators
exports a non-empty some
that you can use instead, and have the parser return Parser (NonEmpty SExp)
. Megaparsec exports the ordinary list version instead, so if you want to use this one, you’ll need to import it manually and include a dependency on parser-combinators
.
After adding it to atom
with uncurry SSExp <$> sexp
, we can test it in GHCi:
>>> parseTest atom "(foo)"
SSExp (SId (Identifier {getId = "foo"})) []
Ok, looks good so far. What if we try to apply an argument to foo
?
>>> parseTest atom "(foo bar)"
1:5:
|
1 | (foo bar)
| ^
unexpected space
expecting ')', '_', S-expression, alphanumeric character, boolean, identifier, integer, or string
Uh-oh. We can’t parse spaces, as the error makes evident. Either we close parenthesis (stop parsing this S-expression), continue the definition of the identifier (_
or alphanumeric character
), put another atom (boolean, identifier, integer, or string
), or another S-expression (by opening parenthesis).
One solution that could come to one’s mind is to use the function sepBy
, which takes a parser and a separator, and use it as atom `sepBy` space
. In this case, space
is a parser defined by Megaparsec that skips zero or more spaces.
>>> parseTest (between (char '(') (char ')') $ SSExp <$> atom <*> (space *> atom `sepBy` space)) "(foo bar)"
SSExp (SId (Identifier {getId = "foo"})) [SId (Identifier {getId = "bar"})]
Of course, this may raise other questions: what if we have a space character before or after the opening and closing parenthesis?
While explicitly skipping spaces works, the megaparsec
documentation already offers a solution for this kind of problem, namely, a lexer. Indeed, we will see that some of our previous definitions could have been written more elegantly with the help of this module.
The important bit for us now is this:
Parsing of white space is an important part of any parser. We propose a convention where every lexeme parser assumes no spaces before the lexeme and consumes all spaces after the lexeme; this is what the
lexeme
combinator does, and so it’s enough to wrap everylexeme
parser with lexeme to achieve this. Note that you’ll need to callspace
manually to consume any white space before the first lexeme (i.e. at the beginning of the file).
S-expressions: the elegant way
The first thing you should do is import the appropriate lexing module. Megaparsec recommends to import this module qualified.
import qualified Text.Megaparsec.Char.Lexer as L
Now we need to define what is whitespace. Until now, with the usage of space
, we have considered that it may be a space, tab, newline, carriage return, form feed, or vertical tab. Megaparsec can go beyond that and also consider line comments (such as -- comment
) and block comments (such as {- comment -}
) as whitespace. Since we get those for free, what about using ;;
for line comments and /* */
for block comments?
Begin by defining skipSpace
. We use the space
helper function from the lexer module, which expects a function to parse 1 or more spaces, a line comment, and a block comment.
skipSpace :: Parser ()
skipSpace = L.space
-- Like `space`, but skips 1 or more space characters.
space1
-- Skip from ;; until a newline.
(L.skipLineComment ";;")
-- Skip from /* until */. There is also `skipBlockComment`, but it doesn't handle nested comments.
(L.skipBlockCommentNested "/*" "*/")
And now define a lexeme
parser, which indicates how to consume space after the given parser.
lexeme :: Parser a -> Parser a
lexeme = L.lexeme skipSpace
Now you can use lexeme
in each of the parsers, and they will automatically skip whitespace after their definitions. Note that we also use lexeme
before our first opening parenthesis, so we can skip space just after the (
.
sexp :: Parser (SExp, [SExp])
sexp = label "S-expression" $ lexeme $
between (lexeme (char '(')) (char ')') ((,) <$> atom <*> many atom)
You may add lexeme
to other parsers where it’s desirable to skip whitespace after the rule. For instance:
integer :: Parser Integer
integer = label "integer" $ lexeme $ read <$> some numberChar
Note: The reader should do the same for str
, bool
, and identifier
. Alternatively, adding lexeme
to atom
will also work.
And now we can parse S-expressions properly:
>>> parseTest atom "( foo /* 8-) */ bar ;; look mom, comments! \n )"
SSExp (SId (Identifier {getId = "foo"})) [SId (Identifier {getId = "bar"})]
Our main parser function
Finally, let’s make a function that solves the last two outstanding problems.
For the first one, as mentioned by the Megaparsed documentation, lexeme
will skip whitespace after a parser, but we also need to deal with whitespace at the beginning of the file.
For the second one, we want to parse everything until we find an eof
since our parser will currently ignore anything after a complete atom. The two defects are exemplified below:
>>> parseTest atom " 42"
1:1:
|
1 | 42
| ^^^^
unexpected " 42"
expecting S-expression, boolean, identifier, integer, or string
>>> parseTest atom "42 foo bar"
SInteger 42
We will create a function that takes our input string, and either returns an error message or the actual result. Such function “bootstraps” our parser with between skipSpace eof atom
, which will solve the two defects. It will also serve as an auxiliary function that can be called by client code, rather than parseTest
which is more useful for working in GHCi.
parseSExp :: String -> Either String SExp
parseSExp input =
let
outputE = parse
-- Skip any whitespace at the beginning, expect the end of input after the atom.
(between skipSpace eof atom)
-- Name of the source file, you can give it any name you want. I leave it blank for now.
""
-- The actual string to be parsed
input
in
-- If we get Left, it will be an `ParseErrorBundle`, let's pretty print it for now.
-- If we get Right, do nothing.
case outputE of
Left err -> Left $ errorBundlePretty err
Right output -> Right output
Now in GHCi:
>>> case parseSExp " 42" of { Left e -> putStrLn e; Right r -> print r }
SInteger 42
>>> case parseSExp "42 foo" of { Left e -> putStrLn e; Right r -> print r }
1:4:
|
1 | 42 foo
| ^
unexpected 'f'
expecting end of input
One more case: Double
Double
There is one more caveat that Megaparsec users should know about: backtracking. We show a somewhat contrived but realistic example. To illustrate this, let’s add one more case to our SExp
:
data SExp
= ...
| SDouble Double -- 42f, 3.1415f
We will parse it as such: first, some digits, optionally followed by a .
and some more digits, and an f
at the end. In regex, this is equivalent to [0-9](\.[0-9])?f
. We achieve the optional part using the optional
function, which returns a Parser (Maybe a)
.
double :: Parser Double
double = label "double" $ lexeme $ read <$> do
left <- some numberChar
rightM <- optional $ do
char '.'
some numberChar
char' 'f' -- Like char, but case-insensitive
pure $ case rightM of
Nothing -> left
Just right -> left ++ "." ++ right
Now, we add it to atom
: SDouble <$> double
. Depending whether you added double
before or after integer
, the result may be surprising:
Parser \ Position | `double` before `integer` | `double` after `integer` |
parseTest atom "42"
|
1:3: | 1 | 42 | ^ unexpected end of input expecting '.', 'F', 'f', or numeric character |
SDouble 42.0
|
parseTest atom "42f"
|
SInteger 42
|
SInteger 42
|
So it only got one case correctly. To fix it, one solution would be to use try
, which allows Megaparsec to do as much backtracking as it wants.
-- Note: Other cases omitted for brevity.
atom :: Parser SExp
atom = choice
[ SDouble <$> try double
, SInteger <$> integer
]
And it should fix the problem. However, this is not ideal in terms of performance, as it will cause Megaparsec to consume the entirety of the double
to later find that there is no f
, backtrack, and finally try again with an integer
.
In such a situation this is trivial, but care must be taken in more complex examples, for example, what if we wanted to have (foo bar)
as well as tuples such as (foo bar, baz quz)
? The S-expression could be very long and lead to exponential complexity in the worst-case scenario. Recently, a very similar problem was fixed in one of our projects.
As we described in the introduction of this tutorial, it’s important to notice Megaparsec normally operates as a LL(1) parser, meaning it tries to look at only one character ahead when executing parsers in an alternative. Using try
allows us to consume an arbitrary number of characters before backtracking, allowing it to be used as a LL(∞) parser.
In this situation, the best thing to do is to refactor the grammars to share common parts. The code above could be written without try
as such:
numeric :: Parser SExp
numeric = label "number" $ lexeme $ do
-- Left side before . or f
left <- some numberChar
-- .0f or f
rightM <- optional $ choice
[ do
char '.'
right <- some numberChar
char' 'f'
pure $ left ++ "." ++ right
, do
char' 'f'
pure left
]
-- If we had the right side, it's a double, otherwise an integer.
pure $ case rightM of
Nothing -> SInteger $ read left
Just right -> SDouble $ read right
And in atom
, we simply use numeric
instead of the two previous parsers.
Rewriting with Megaparsec’s lexer
A lot of the things could be written using the lexer module from Megaparsec. In this section, we will present the simplified versions for functions that can be improved.
Because we will use floatingOrInteger
, you should include the scientific
package and import it:
import Data.Scientific
And here is our rewritten parser. Note that our new numeric
has a quirk: 1.0
is parsed as an integer by Megaparsec. A resolution to this is left as an exercise to the reader.
-- signed takes a parser on how to skip space after the sign.
integer :: Parser Integer
integer = label "integer" $ lexeme $ L.signed skipSpace L.decimal
double :: Parser Double
double = label "double" $ lexeme $ L.signed skipSpace L.float <* char' 'f'
-- charLiteral also takes escaped characters into question.
str :: Parser String
str = label "string" $ lexeme $ char '"' *> manyTill L.charLiteral (char '"')
numeric :: Parser SExp
numeric = lexeme $ do
value <- L.signed skipSpace L.scientific
case floatingOrInteger value of
Left d -> SDouble d <$ char' 'f'
Right i -> do
f <- optional $ char' 'f'
pure $ case f of
-- Note: 1.0 is an integer, but 1.1 is a parser error.
Nothing -> SInteger i
Just _ -> SDouble $ fromIntegral i
Note the changes to str
as well. manyTill
will run L.charLiteral
taking characters, including escaped characters until the matching "
is found.
As a bonus, we got various improvements:
- No need to
read
values. - We can use
signed
to parse negative numbers. - We can parse escaped characters in strings.
The functions for integer
and double
were kept for reference, even though numeric
is used instead.
Now let’s check it in GHCi:
>>> parseTest atom "(foo -42 \"with \\\" now!\")"
SSExp (SId (Identifier {getId = "foo"})) [SInteger (-42),SString "with \" now!"]
And I hope that with this you can better understand how parser combinators work. Now get to hacking your languages!
The complete code for this part can be found here.
Bonus: quasi-quotations
This section is better understood if you have some Template Haskell (TH) knowledge. If you haven’t learned about it, then I strongly encourage you to check out our Introduction to Template Haskell before trying out this section.
Note that this section is not necessary for our knowledge of parser combinators, and it’s purely a bonus section on how to do more with our S-expressions.
For this part, you will need the template-haskell
package.
Long story short, Template Haskell is a mechanism that allows GHC to evaluate and analyze code during compile-time. One such application is using QuasiQuoter
s, which allows us to run custom parsers during compile time. Here we will use them for two purposes: to use S-expressions as Haskell expressions and to use S-expressions as Haskell patterns, both of which can be parsed during compilation.
We will create a new file called TH.hs
, whose definition is given below.
module TH where
import Control.Monad ((<=<))
import Language.Haskell.TH.Quote (QuasiQuoter (..))
import Language.Haskell.TH.Syntax (Q, dataToPatQ, liftData)
import SExp (SExp, parseSExp)
sexp :: QuasiQuoter
sexp = QuasiQuoter
{ quoteExp = liftData <=< parseOrThrow
, quotePat = dataToPatQ (const Nothing) <=< parseOrThrow
, quoteType = const $ fail "No S-expression parser for type"
, quoteDec = const $ fail "No S-expression parser for declarations"
}
where
parseOrThrow :: String -> Q SExp
parseOrThrow = either fail pure . parseSExp
Additionally, liftData
and dataToPatQ
assume that our parsed SExp
implement the Data
type class. We will change our SExp
module to automatically derive this datatype for us. First, enable the DeriveDataTypeable
language extension, so GHC can automatically implement Data
for us.
You also need to import Data
from Data.Data
.
{-# LANGUAGE DeriveDataTypeable #-}
module SExp where
import Data.Data (Data)
-- The rest of the file goes here...
And now change Identifier
and SExp
to derive Data
:
newtype Identifier = Identifier
{ getId :: String
} deriving (Data, Show)
data SExp
= SSExp SExp [SExp] -- (foo 0 "hello" bar), (bar (baz 1)), (foo)
| SInteger Integer -- 42
| SString String -- "hello, world"
| SBool Bool -- false, true
| SId Identifier -- foo
| SDouble Double -- 42f, 3.1415f
deriving (Data, Show)
The basic contents of the TH
module are summarized as follows:
- Each field of
QuasiQuoter
provides aString
parser for a different Haskell syntax. quoteExp
creates a Haskell expression.- This will allow us to write expressions such as
let mySExp = [sexp|(foo 42)|]
. - We use our standard Megaparsec parser,
parseSExp
, but fail if it can’t parse (halting compilation).liftData
then transforms ourSExp
into anExp
. - If this is your first time seeing
<=<
, it’s just another way to compose monadic functions.g <=< f
is the same as\x -> g =<< f x
. - If you’ve read our previous TH tutorial, it creates an
Exp
just like[e||]
.
- This will allow us to write expressions such as
quotePat
creates a Haskell pattern.- Example:
case mySExp of {[sexp|(foo 42)|] -> True; _ -> False}
. - Similarly to
quoteExp
, we usedataToPatQ (const Nothing)
to transform ourSExp
into aPat
.- Confusingly,
liftData
is defined asdataToExpQ (const Nothing)
. - The
const Nothing
can be replaced to handle type-specific cases. We are fine with the default behavior for now but we’ll change it later.
- Confusingly,
- Creates a
Pat
, like[p||]
.
- Example:
quoteType
creates a Haskell type.- Since we have no
SExp
,SInteger
etc types matching our constructors, we will fail when someone tries to use our parser as types. - Creates a
Type
, like[t||]
.
- Since we have no
quoteDec
creates a Haskell declaration.- Similarly to
quoteType
, we won’t define this one. - Creates a
Dec
, like[d||]
.
- Similarly to
Now enable the QuasiQuotes
extension in GHCi and you should be able to test it. Notice how failing to parse instead causes GHCi to fail to interpret the given S-expression (compilation error).
>>> :set -XQuasiQuotes
>>> [sexp|(foo "bar" 42)|]
SSExp (SId (Identifier {getId = "foo"})) [SString "bar",SInteger 42]
>>> [sexp|no parenthesis|]
Out [19]:
<interactive>:1135:7: error:
• 1:4:
|
1 | no parenthesis
| ^
unexpected 'p'
expecting end of input
• In the quasi-quotation: [sexp|no parenthesis|]
There are still two improvements we could do, however. The first one is that Megaparsec could use the actual Haskell location:
>>> let SInteger x = [sexp|.|] in x
(omitted for brevity, error in 1:1)
As for the second one, it would be much better if we could bind S-expression variables into Haskell ones, such as the following:
foo [sexp|(foo $x)|] = Just x
foo _ = Nothing
Using Haskell locations
Let’s change our parser runner a bit so it starts with the current line and column. We will create a new auxiliary parseSExpWithPos
, which takes a SourcePos
and uses it for the initial state. This function was created from our old parseSExp
.
Additionally, we create a new parseSExp
with the default position.
We now call runParser'
instead of directly calling parse
, as it allows us to give it an initial state and build such a state with the default settings. Keep in mind that we’re getting into the internal behavior of Megaparsec here, so changes in the package can break this code.
parseSExp :: String -> Either String SExp
parseSExp = parseSExpWithPos (initialPos "")
parseSExpWithPos :: SourcePos -> String -> Either String SExp
parseSExpWithPos pos@(SourcePos file _line _column) input =
let
initState = State
{ stateInput = input
, stateOffset = 0
, statePosState = PosState
{ pstateInput = input
, pstateOffset = 0
, pstateSourcePos = pos
, pstateTabWidth = defaultTabWidth
, pstateLinePrefix = ""
}
, stateParseErrors = []
}
outputE = snd $ runParser'
-- Skip any whitespace at the beginning, expect the end of input after the atom.
(between skipSpace eof atom)
-- The input state with the string to be parsed.
initState
in
-- If we get Left, it will be a `ParseErrorBundle`, let's pretty print it.
-- If we get Right, do nothing.
case outputE of
Left err -> Left $ errorBundlePretty err
Right output -> Right output
Now we can change our TH
modules. We will make new imports:
import Language.Haskell.TH.Syntax (Loc (..), Q, dataToPatQ, liftData, location)
import Text.Megaparsec (SourcePos (..), mkPos)
import SExp (SExp, parseSExpWithPos)
And we will change parseOrThrow
as follows:
parseOrThrow :: String -> Q SExp
parseOrThrow input = do
posTH <- location
let pos = SourcePos
(loc_filename posTH)
(mkPos $ fst $ loc_start posTH)
(mkPos $ snd $ loc_start posTH)
either fail pure $ parseSExpWithPos pos input
With it, we now use location
to query where the splice occurred and put it in the format that Megaparsec
expects. Trying it in GHCi gives me:
>>> let SInteger x = [sexp|foo 42|] in x
<interactive>:1730:24: error:
• <interactive>:1730:28:
|
1730 | foo 42
|
unexpected '4'
expecting end of input
• In the quasi-quotation: [sexp|foo 42|]
For visualization, it may be better if we create a small test file:
module Test where
import TH (sexp)
testQQ = [sexp|foo 42|]
Compiling it results in:
Test.hs:5:16: error:
• Test.hs:5:20:
|
5 | foo 42
|
unexpected '4'
expecting end of input
• In the quasi-quotation: [sexp|foo 42|]
|
5 | testQQ = [sexp|foo 42|]
| ^^^^^^^^
You may freely change how errors are printed by changing errorBundlePretty
with a custom error printing function, but for this tutorial, this is good enough for us.
Binding variables in patterns
As mentioned, it would be nice if we could bring S-expression bindings into the Haskell scope. For this, let’s add a new term into our SExp
:
data SExp
-- other constructors omitted for brevity
| STH String -- $meta
We may now add a parser to extract it. Let’s reuse our identifier
for this:
th :: Parser String
th = label "TH variable" $ lexeme $ char '$' *> fmap getId identifier
Don’t forget to add it to atom
with STH <$> th
.
We are now ready to add it to our TH
module. First, we will need some new imports:
import Data.Typeable (Typeable, cast)
import Language.Haskell.TH.Lib (varP)
import Language.Haskell.TH.Syntax (Loc (..), Name, Q, dataToPatQ, liftData, location, mkName) -- no Loc and location if you've skipped the previous item
import SExp (SExp (STH), parseSExpWithPos) -- or parseSExp if you've skipped the previous item
Typeable
is a type class that allows us to try to cast
some type into another. Here, we attempt a cast into SExp
.
We will add a new function to aid us:
sexp = -- ommited for brevity
where
antiQuoteVar :: Typeable a => a -> Maybe (Q Pat)
antiQuoteVar var = case cast var of
Nothing -> Nothing
Just var' -> case var' of
STH var'' -> Just $ varP $ mkName var''
_other -> Nothing
We can now replace const Nothing
with antiQuoteVar
in our quasi-quoter:
, quotePat = dataToPatQ antiQuoteVar <=< parseOrThrow
Now check whether it works:
>>> case [sexp|(foo 42)|] of
... [sexp|(foo $answer)|] -> print answer
... _ -> putStrLn "oops"
SInteger 42
It’s also possible to give a meaning to $answer
at the expression level by having it capture a Haskell variable instead. Let’s change TH
a bit to support it:
antiQuoteVar :: Typeable a => (Name -> Q b) -> a -> Maybe (Q b)
antiQuoteVar mkVar var = case cast var of
Nothing -> Nothing
Just var' -> case var' of
STH var'' -> Just $ mkVar $ mkName var''
_other -> Nothing
And our QuasiQuoter
now has:
{ quoteExp = dataToExpQ (antiQuoteVar varE) <=< parseOrThrow
, quotePat = dataToPatQ (antiQuoteVar varP) <=< parseOrThrow
Don’t forget to import dataToExpQ
, which now replaces liftData
as well as varE
. We can now test our quasi-quoter again:
>>> let x = SInteger 42 in case [sexp|(foo $x)|] of
[sexp|(foo $answer)|] -> print answer
_ -> putStrLn "oops"
SInteger 42
And that’s it! You now support quasi-quotations.
The complete code for the parser and quasi-quoter may be found here.
Conclusion
In this post, we’ve introduced the theory and a practical application of parser combinators. We’ve discussed what are they as well as their pros and cons, implemented a simple parser combinator from scratch, learned how to write an S-expression parser with Megaparsec, and created a simple quasi-quotation to consume input at compile-time.
Some improvements can still be made and they are left as an exercise to the reader. For example, the S-expression parser operates on String
s, but Megaparsec also supports Text
and ByteString
as the input stream which should have better performance. On top of that, if the reader is familiar with type families (otherwise, check out our article on Type Families), they may try to improve our parser from scratch to also operate on more data types.
Appendix
Common parsing functions
Below we present some other useful functions and combinators you may want to use. We show their names, types, and comment about them.
Name | Type | Documentation |
---|---|---|
many |
Alternative f => f a -> f [a] |
Parse 0 or more occurrences of the given parser. |
some |
Alternative f => f a -> f [a] |
Parse 1 or more occurrences of the given parser. |
option |
Alternative m => a -> m a -> m a |
Parse 0 or 1 occurence of the given parser, with a fallback a if it fails. |
optional |
Alternative f => f a -> f (Maybe a) |
Like option , but returns a Maybe instead. |
sepBy , sepBy1 |
Alternative m => m a -> m sep -> m [a] |
Parses various occurences of the given parser separated by the given separator. Functions prefixed with 1 parse 1 or more, while the ones without parse 0 or more. |
sepEndBy , sepEndBy1 |
Alternative m => m a -> m sep -> m [a] |
Like the sepBy variants, but additionally allow the separator to appear optionally at the end of the sequence. |
endBy , endBy1 |
Alternative m => m a -> m sep -> m [a] |
Like the sepEndBy variants, but the separator at the end is not optional. |
char |
(MonadParsec e s m, Token s ~ Char) => Token s -> m (Token s) |
Parses a single character matching the given char. |
char' |
(MonadParsec e s m, Token s ~ Char) => Token s -> m (Token s) |
Like char , but case-invariant. |
string |
MonadParsec e s m => Tokens s -> m (Tokens s) |
Parses many characters matching the given string. |
string' |
(MonadParsec e s m, FoldCase (Tokens s)) => Tokens s -> m (Tokens s) |
Like string , but case-invariant. |
makeExprParser |
MonadPlus m => m a -> [[Operator m a]] -> m a |
Allows parsing generic expressions, supporting infix, prefix, and postfix operators with precedences. |
Further reading
- Parser Combinators in Elixir
- How to Implement an LR(1) Parser
- Parsing with Haskell
- Parsec: “try a <|> b” considered harmful
- Quasiquotation 101
For more Haskell tutorials, you can check out our Haskell articles or follow us on Twitter or Dev. Also, visit our service page to find out about Haskell software development we offer.