Research Programming Artificial Intelligence Interviews Other

Parser Combinators in Haskell

Article by Heitor Toledo Lassarote de Paula

December 20th, 2021

41 min read

Welcome! If you are reading this, you likely have decided to take on the journey of learning parser combinators. We hope this article will make your adventure smoother and hopefully give you a strong foundation for writing your grammars.

This article is composed of three major parts. In the first part, we will implement a small parser combinator library from scratch, which should hopefully help to give a feeling of how industrial-strength parsing combinators work. In the second part, we will learn how to use the Megaparsec library to implement a parser for S-expressions. Finally, as a bonus, we will use the power of Template Haskell to implement a quasi-quoter for our parser.

What is a parser combinator and why use them?

In 2001, Daan Leijen and Erik Meijer published a paper titled Parsec: Direct Style Monadic Parser Combinators For The Real World, describing the parsec library, whose design consequently influenced various others, such as megaparsec, attoparsec, trifecta, and even libraries outside the Haskell ecosystem, such as NimbleParsec for Elixir, parsec for Python, FParsec for F#, among others.

Parser combinators are known to be simple to use without requiring external tools or too many concepts to learn. That is, they are ordinary Haskell constructors that can easily be combined and returned with other parsers because of their nature as functions. This, in turn, makes them idiomatic to use, and being a popular choice among Haskellers, their ecosystem is pretty developed. You should have no trouble finding tutorials and documentation on how to use them.

It’s important to notice that parser combinators also have their flaws. Parser combinators may be implemented as either LL(1) or LL(∞) parsers, which, elaborating on the points described in Kirill Andreev’s How to Implement an LR(1) Parser:

Don’t support left recursion. That is to say, if there is a rule foo = foo *> bar, you might need to refactor your grammar otherwise it will loop forever.
Don’t resolve conflicts, which we will see as an example later in this article. This means the ordering of rules can influence the output, and you need to be careful when defining rules that can consume the same input.
- In an LL(1) implementation, it will pick the first alternative that maches the input.
- In an LL(∞) implementation, it will try the longest possible match, which may be very inneficient.
Many parser combinator libraries will backtrack, which will also be exemplified later, meaning that you might need to be careful to avoid performance penalties.
The backtracking mechanism may, in turn, lead to exponential time complexities.

This shouldn’t, however, discourage you from their use, as their benefits outweigh the drawbacks. Ultimately, the parsing library of your choice should be chosen based on your needs, and parser combinators are a good fit for most use cases.

Implementing a simple parser combinator from scratch

Before we delve into megaparsec, it may be interesting to implement our parser combinator from scratch. We will make a simple one that should hopefully help you understand how parser combinators work.

If you instead prefer to jump right into the action, then skip to the Megaparsec tutorial.

Implementation

We will begin by importing the appropriate definitions in a new file called Parser.hs:

module Parser where

import Control.Applicative (Alternative (..))
import Data.List (nub)

We will create two data structures that represent our parser.

The first of them represents some error that may happen while parsing. For now, let’s have three errors: one that is raised when we are expecting input but there is nothing to consume (EndOfInput), one when we find an unexpected character (Unexpected), and one for any error messages that the user might want to raise (CustomError).

Our second structure is the actual parser. It’s represented as a function that takes some input and either returns a list of errors, if Left, or the result of parsing the input together with the rest of the input (that was not parsed), if Right.

data Error i e
  = EndOfInput  -- Expected more input, but there is nothing
  | Unexpected i  -- We didn't expect to find this element
  | CustomError e  -- Extra errors the user may want to create
  | Empty  -- Used in `Alternative` implementation of `empty`
  deriving (Eq, Show)

newtype Parser i e a = Parser
  { runParser :: [i] -> Either [Error i e] (a, [i])
  }

This is the meaning of each of the type variables:

i: The input stream. For most cases, we will use Char, as [Char] is the same as String. It’s interesting to notice that keeping this as a type variable allows us to parse things that are not strings as well.
e: The type of custom error messages. If we don’t have those, we may use Void instead.
a: The result of our parsing function. It represents the structure parsed from our consumed input.

Let’s begin by creating the most primitive function: satisfy. It tests the current character with a predicate, and if it succeeds, it will advance the parser and return the consumed character.

We must take care that we are not at the end of the input, however, in which case we should fail.

satisfy :: (i -> Bool) -> Parser i e i
satisfy predicate = Parser $ \input ->
  case input of
    [] -> Left [EndOfInput]
    hd : rest
      | predicate hd -> Right (hd, rest)
      | otherwise    -> Left [Unexpected hd]

It’s also common to define a function that only succeeds if the input strictly matches some given element.

char :: Eq i => i -> Parser i e i
char i = satisfy (== i)

If you load this in GHCi, you should now be able to test it:

>>> :l Parser

>>> runParser (char 'h') "hello"
Right ('h', "ello")

>>> runParser (char 'h') "greetings"
Left [Unexpected 'g']

>>> runParser (char 'h') ""
Left [EndOfInput]

Let’s now create Monad and Alternative instances for our parser. For educative purposes, we will also explicitly implement Functor and Applicative instead of fmap = liftM and (<*>) = ap.

We start with Functor. Since we map over the a type variable obtained in the case of a successful parse, we can only apply f in case of Right.

instance Functor (Parser i e) where
  fmap f (Parser p) = Parser $ \input ->
    case p input of
      Left err -> Left err
      Right (output, rest) -> Right (f output, rest)

Since Either is a monad, we can use do-notation. Here’s an implementation that’s a little shorter:

instance Functor (Parser i e) where
  fmap f (Parser p) = Parser $ \input -> do
    (output, rest) <- p input
    pure (f output, rest)

Next up is Applicative. A good intuition behind Applicatives is that pure represents an identity element, while <*> distributes one argument over the other. Think of this like a multiplication, where 0 * a = a * 0 = 0, while 1 * a = a * 1 = a.

To illustrate this, observe this table:

	Left parser failed	Left parser succeeded
Right parser failed	Error from left parser	Error from right parser
Right parser succeeded	Error from left parser	Success

Note that even if both parsers fail, we only want the error from the first one, as we execute the second one only if the first one succeeded.

This is due to the Either part of the parser, where Left is like 0 and Right is like 1. pure represents our 1, so it should create a parser that always succeeds without consuming any input.

And similarly to Either, we want <*> to fail in case any of the operands result in a failed parse given some input. This implementation tries to run the left operand, which consumes some input and produces a function as well as the unconsumed input. The rest of the input is given to the right operand, and in case of success, it applies the function to the argument.

instance Applicative (Parser i e) where
  pure a = Parser $ \input -> Right (a, input)

  Parser f <*> Parser p = Parser $ \input ->
    case f input of
      Left err -> Left err
      Right (f', rest) ->
        case p rest of
          Left err -> Left err
          Right (output, rest') -> Right (f' output, rest')

Or with Either’s monad instance:

instance Applicative (Parser i e) where
  pure a = Parser $ \input -> Right (a, input)

  Parser f <*> Parser p = Parser $ \input -> do
    (f', rest) <- f input
    (output, rest') <- p rest
    pure (f' output, rest')

In our Monad instance, we set return = pure as usual. Our bind is more interesting: we try to run the given parser with the input, and if it succeeds, we give the produced output (of type a) to our continuation k, which produces a parser that is executed with the remainder of the output.

instance Monad (Parser i e) where
  return = pure

  Parser p >>= k = Parser $ \input ->
    case p input of
      Left err -> Left err
      Right (output, rest) ->
        let
          Parser p' = k output
        in
        p' rest

Again, using the fact that Either is a monad, we can simplify this implementation:

instance Monad (Parser i e) where
  return = pure

  Parser p >>= k = Parser $ \input -> do
    (output, rest) <- p input
    runParser (k output) rest

With our Applicative and Monad instances, we are now able to parse sequentially. For example, we defined char above to let us parse a single character, but what if we wanted to parse many of them?

string :: Eq i => [i] -> Parser i e [i]
string = traverse char

If traverse is new to you, it’s also equivalent to these definitions:

string :: Eq i => [i] -> Parser i e [i]
string [] = pure []
string (x : xs) = (:) <$> char x <*> string xs

Or even in terms of Monad:

string [] = return []
string (x : xs) = do
  y <- char x
  ys <- string xs
  return (y : ys)

Let’s try it in GHCi:

>>> runParser (string "Haskell") "Haskell"
Right ("Haskell","")

>>> runParser (string "Haskell") "Halloween"
Left [Unexpected 'l']

Finally, our last instance: Alternative. This may be new for you, so I will explain. First, take a look at its definition, which can be found in Control.Applicative:

class Applicative f => Alternative f where
  empty :: f a
  (<|>) :: f a -> f a -> f a

I mentioned that pure acts as 1, which is a parser that always succeeds without consuming input. empty works like the converse (or 0), that is a parser that always fails without consuming input.

For (<|>), we want to implement it such that the first parser to succeed is used. That is, we first try the left operand, and failing that, we try the second one. The table below shall guide you:

	Left parser failed	Left parser succeeded
Right parser failed	Errors from both parsers	Left parser
Right parser succeeded	Right parser	Left parser

Here is an implementation of Alternative. We constrain i and e with Eq so we can use nub to remove duplicated errors. Note that this is not the best or most efficient way to merge errors, but for our purposes, it’s good enough.

instance (Eq i, Eq e) => Alternative (Parser i e) where
  empty = Parser $ \_ -> Left [Empty]

  Parser l <|> Parser r = Parser $ \input ->
    case l input of
      Left err ->
        case r input of
          Left err' -> Left $ nub $ err <> err'
          Right (output, rest) -> Right (output, rest)
      Right (output, rest) -> Right (output, rest)

Finally, let’s test it!

>>> runParser (string "hello" <|> string "greetings") "hello, world"
Right ("hello",", world")

>>> runParser (string "hello" <|> string "greetings") "greetings, world"
Right ("greetings",", world")

>>> runParser (string "hello" <|> string "greetings") "bye, world"
Left [Unexpected 'b']

>>> runParser (empty <|> pure ()) ""
Right ()

>>> runParser (pure () <|> empty) ""
Right ()

We can see our errors composing with a more complicated expression:

>>> runParser ((string "hello" *> string ", globe") <|> string "greetings") "hello, world"
Left [Unexpected 'w',Unexpected 'h']

Hopefully, this should give you an intuition of how parser combinators work. Popular parser combinator libraries are more complex than this, as they tend to keep track of extra state for better error messages, optimization, information about line and column, etc., but the overall operations between them are similar.

The complete code for this section with solutions to the exercises below can be found here.

Exercises

Change char so that it returns errors such as Left [Expected 'h' 'g'] (“Expected ‘h’, but got ‘g’”).

Solution

Begin by creating a new error:

data Error i e
  = ...
  | Expected i i

Next, we will refactor the code originally in satisfy into token, just so we avoid duplicating code. Then, rewrite satisfy to use this new function, and rewrite char to make use of it as well.

The important part is simply that we change the error message.

token :: (i -> Error i e) -> (i -> Bool) -> Parser i e i
token mkErr predicate = Parser $ \input ->
  case input of
    [] -> Left [EndOfInput]
    hd : rest
      | predicate hd -> Right (hd, rest)
      | otherwise    -> Left [mkErr hd]

satisfy :: (i -> Bool) -> Parser i e i
satisfy = token Unexpected

char :: Eq i => i -> Parser i e i
char i = token (Expected i) (== i)

In GHCi you should see:

>>> parse (char 'h') "bye"
Left [Expected 'h' 'b']

Create an eof :: Parser i e () function which succeeds only if we have arrived at the end of the input.
- runParser (string "hello" <* eof) "hello" should return Right ("hello", "")
- runParser (strung "hello, world" <* eof) "" should return Left [ExpectedEndOfFile ','] (“Expected EOF, but got ‘,’”)

Solution

First, create a new error:

data Error i e
  = ...
  | ExpectedEndOfFile i

Now create a function that returns Right only if there is no more input, and an error otherwise.

eof :: Parser i e ()
eof = Parser $ \input ->
  case input of
    []     -> Right ((), [])
    hd : _ -> Left [ExpectedEndOfFile hd]

Trying it out in GHCi:

>>> parse (string "bye" <* eof) "bye"
Right ("bye","")

>>> parse (string "bye" <* eof) "byebye"
Left [ExpectedEndOfFile 'b']

Change the definition of Parser so it contains a [i] -> Offset -> (Either [Error i e] (Offset, a, [i])), where type Offset = Int. The offset indicates how many positions we have parsed.
- You may also change Error to store an offset so that you can indicate where an error occurred.
- runParser (string "hey") 0 "hi" should return Left [Error {erOffset = 1, erError = Expected 'e' 'i'}] (“Unexpected ‘e’, but got ‘i’, at position 1”)

Solution

The solution is mostly mechanically replacing definitions where errors occurred.

First, we rename Error to ErrorType, and create a new Error type:

data Error i e = Error
  { erOffset :: Offset
  , erError :: ErrorType i e
  } deriving (Eq, Show)

data ErrorType i e
  = ...

If you’ve done the previous exercises, then token may be changed now to take offset as input to the parser’s function, and return offset + 1 on success. If an error occurred instead, we now construct Error with the given input. Our mkErr function should not take an ErrorType instead of Error.

token :: (i -> ErrorType i e) -> (i -> Bool) -> Parser i e i
token mkErr predicate = Parser $ \input offset ->
  case input of
    [] -> Left [Error offset EndOfInput]
    hd : rest
      | predicate hd -> Right (offset + 1, hd, rest)
      | otherwise    -> Left [Error offset $ mkErr hd]

If you haven’t done the previous exercises, then change satisfy in a similar manner, taking offset as input and returning offset + 1, and error Error offset $ Unexpected hd on the error.

For the remainder of the functions, change the parser function to also take offset as an argument, and just return it unaffected. Take care to chain these offsets as required. For example, this is the new definition of Applicative:

instance Applicative (Parser i e) where
  pure a = Parser $ \input offset -> Right (offset, a, input)

  Parser f <*> Parser p = Parser $ \input offset ->
    case f input offset of
      Left err -> Left err
      Right (offset', f', rest) ->
        case p rest offset' of
          Left err -> Left err
          Right (offset'', output, rest') -> Right (offset'', f' output, rest')

Finally, in parse, use 0 as the initial offset.

Keep in mind it’s also possible to operate with line and column when you know you have a function operating in characters.

>>> parse (string "hey") "hi"
Left [Error {erOffset = 1, erError = Expected 'e' 'i'}]

The reader may also choose to instead refactor the parser type to use a custom-defined State type that contains the input stream and offset. Many improvements are possible, but we choose to keep it simple for educational purposes.

What’s next?

In this section, we’ve learned how to make a simple parsing combinator library. There are still various improvements that can be done. To name a few of them, we can implement more instances, such as Semigroup and Monoid, IsString, MonadFail, MonadPlus, among others. Error messages can be improved by better keeping track of location (such as line and column), supporting more types such as Text and ByteString, pretty-printing error messages, etc.

We will conclude this part here, and if you are interested, proceed to the next part where Megaparsec, an industrial-strength parser combinator library, will be introduced with a more practical application than our simple examples in this section.

Megaparsec tutorial

We will now take a look at a practical application of parser combinators: parsing S-expressions. Hopefully, this example will give you a solid foundation for creating parsers for your grammars.

Megaparsec is the go-to parser for industrial projects, so we will also use it in our tutorial. If you wish to see how it compares to other parser combinator libraries in Haskell, you can read its README.md.

Keep in mind you should choose the parser library that better suits your needs. If you plan to parse machine-written data, for example, then Attoparsec may be a better choice. Its interface is similar to Megaparsec’s, so the knowledge you will learn here will also be useful.

Without any further ado, we will now present how to use Megaparsec to parse some S-expressions.

Representing S-expressions

To begin, create a file called SExp.hs. Also, make sure to get the megaparsec package.

We will use the following data structure to represent various S-expressions. Gradually, we will write a parser that represents each of these cases.

module SExp where

import Data.Void (Void)
import Text.Megaparsec
import Text.Megaparsec.Char

newtype Identifier = Identifier
  { getId :: String
  } deriving (Show)

data SExp
  = SSExp    SExp [SExp]  -- (foo 0 "hello" bar), (bar (baz 1)), (foo)
  | SInteger Integer  -- 42
  | SString  String  -- "hello, world"
  | SBool    Bool  -- false, true
  | SId      Identifier  -- foo
  deriving (Show)

We also need an alias for the type which megaparsec uses.

type Parser = Parsec
  -- The type for custom error messages. We have none, so use `Void`.
  Void
  -- The input stream type. Let's use `String` for now, but for
  -- better performance, you might want to use `Text` or `ByteString`.
  String

Booleans

Let’s start with the simplest of the parsers: SBool. We use <|> which tries one parser, and failing that, will try the other one.

bool :: Parser Bool
bool = False <$ string "false" <|> True <$ string "true"

We will also make a parser to represent our SExp-specific variants. As we progress, we will add more parsers to this list.

atom :: Parser SExp
atom = choice
  [ SBool <$> bool
  ]

Here we use choice, which is the same as running <|> between each element of the list. choice [a, b, c] = a <|> b <|> c.

It’s important to notice that choice, as well as many other useful combinators, are part of the parser-combinators package, which Megaparsec conveniently re-exports for us. Should you fail to find a specific function in Megaparsec’s documentation, chances are that it’s defined in parser-combinators instead.

Now, fire up GHCi and test that we can parse it! megaparsec provides the parseTest function which allows us to conveniently test our parsers.

>>> parseTest atom "false"
SBool False

>>> parseTest atom "hey"
1:1:
  |
1 | hey
  | ^^^
unexpected "hey"
expecting "false" or "true"

Integers

To parse integers, one strategy is to parse as many digits (chars whose value is between '0' and '9') as possible. This will give us a list of characters, which we can then read.

For this parser, we use the some function, which tries to run a given parser 1 or more times. numberChar is exported from Text.Megaparsec.Char and parses any character that matches a digit. In regex, this is equivalent to [0-9]+. This will return us a [Char] (aka String), which we can read to make it into an Integer.

integer :: Parser Integer
integer = read <$> some numberChar

We now change our atom to also include integers.

atom :: Parser SExp
atom = choice
  [ SBool <$> bool
  , SInteger <$> integer
  ]

Let’s also test it in GHCi:

>>> parseTest atom "42"
SInteger 42

>>> parseTest atom "hey"
1:1:
  |
1 | hey
  | ^^^
unexpected "hey"
expecting "false", "true", or numeric character

Although the error message is good, wouldn’t it better if it said “integer” instead of “numeric character”? For this, we can use labels, which allows us to give a name to a parser. To do this, we can simply change our parser like this:

integer :: Parser Integer
integer = label "integer" $ read <$> some numberChar

If you prefer, it’s also possible to use <?>, which does the same as label.

integer :: Parser Integer
integer = read <$> some numberChar <?> "integer"

And now the error message should be slightly improved:

>>> parseTest atom "hey"
1:1:
  |
1 | hey
  | ^^^
unexpected "hey"
expecting "false", "true" or integer

You can also use label "boolean" in our bool parser to change the message to expecting boolean or integer.

Exercise: Can you change this parser so it also accepts negative integers?

Solution

One possible solution using `<|>`:

integer :: Parser Integer
integer = label "integer" $ read <$> (some numberChar <|> ((:) <$> char '-' <*> some numberChar))

However, later in the tutorial, we will see an even better way that uses Megaparsec’s own mechanisms.

Strings

To parse a string, the idea is to find everything that is between double-quotes. This approach has a weakness, though: we won’t be able to parse escaped quotes. Later in this tutorial, we will present a solution to escaped characters using Megaparsec’s lexing mechanisms.

To achieve this, we will use two new functions: between and takeWhileP. Using between open close x is equivalent to open *> x <* close, while takeWhileP will read as many characters as possible that satisfy some predicate.

Note that takeWhileP also accepts, optionally, a label for each character that it tries to read. For now, we will leave it as Nothing.

str :: Parser String
str = label "string" $ between (char '"') (char '"') (takeWhileP Nothing (/= '"'))
-- Or:
--- str = label "string" $ char '"' *> takeWhileP Nothing (/= '"') <* char '"'

In atom, include SString <$> str.

>>> parseTest atom "\"hey\""
SString "hey"

Identifiers

An identifier will be like a variable in a programming language. We can choose any naming convention for such a thing, but here I chose the one that is frequently used in C-like languages: the first letter must be a letter or an underscore, while the remainder may be a letter, digit, or underscore.

Here we use the many function, which is very similar to the some function discussed before. The difference is that many tries to run the given parser 0 or more times. In regex, our identifier parser is equivalent to [a-zA-Z_][a-zA-Z0-9_]*.

identifier :: Parser Identifier
identifier = label "identifier" $ do
  first <- letterChar <|> char '_'
  rest <- many $ alphaNumChar <|> char '_'
  pure $ Identifier $ first : rest
-- Or:
-- identifier = Identifier <$> label "identifier" ((:) <$> (letterChar <|> char '_') <*> many (alphaNumChar <|> char '_'))

Now include SId <$> identifier to atom.

Warning: When parsing with <|> or choice, the order of the parsers matters. Take this as an example:

>>> parseTest (SBool <$> bool <|> SId <$> identifier) "true"
SBool True

>>> parseTest (SId <$> identifier <|> SBool <$> bool) "true"
SId (Identifier {getId = "true"})

In the first case, we parse a boolean before an identifier, which causes true to be matched by bool. However, in the second example, using identifier before bool caused true to be recognized as an identifier. You should be careful while considering such things. In most cases, you can assume that parsers are greedy. Notice that in the second case, the bool parser will never be executed.

In other words, make sure that you parse a boolean in atom before you parse an identifier.

S-expressions

Finally, let’s move on to our last parser, which parses S-expressions themselves, such as (foo 42 "hey") or (bar).

First, we will do it the naïve way, which may come as the first natural solution to someone just learning Megaparsec, and then we will do it in a better manner using extra tools offered by Megaparsec.

S-expressions: the awkward way

The idea here is to parse an atom, followed by 0 or more atoms. Of course, this means that we want to use many to parse such atoms. Additionally, an S-expression may be surrounded by parenthesis, so we can use between for such a task.

This parser, however, is not as simple as before, but let’s first do it the wrong way (which won’t work properly), analyze the problem, and come up with a solution to it.

sexp :: Parser (SExp, [SExp])
sexp = label "S-expression" $ between (char '(') (char ')') ((,) <$> atom <*> many atom)
-- Or:
-- sexp = label "S-expression" $ char '(' *> ((,) <$> atom <*> many atom) <* char ')'

Note that (,) <$> atom <*> many atom is pretty similar to some atom. parser-combinators exports a non-empty some that you can use instead, and have the parser return Parser (NonEmpty SExp). Megaparsec exports the ordinary list version instead, so if you want to use this one, you’ll need to import it manually and include a dependency on parser-combinators.

After adding it to atom with uncurry SSExp <$> sexp, we can test it in GHCi:

>>> parseTest atom "(foo)"
SSExp (SId (Identifier {getId = "foo"})) []

Ok, looks good so far. What if we try to apply an argument to foo?

>>> parseTest atom "(foo bar)"
1:5:
  |
1 | (foo bar)
  |     ^
unexpected space
expecting ')', '_', S-expression, alphanumeric character, boolean, identifier, integer, or string

Uh-oh. We can’t parse spaces, as the error makes evident. Either we close parenthesis (stop parsing this S-expression), continue the definition of the identifier (_ or alphanumeric character), put another atom (boolean, identifier, integer, or string), or another S-expression (by opening parenthesis).

One solution that could come to one’s mind is to use the function sepBy, which takes a parser and a separator, and use it as atom `sepBy` space. In this case, space is a parser defined by Megaparsec that skips zero or more spaces.

>>> parseTest (between (char '(') (char ')') $ SSExp <$> atom <*> (space *> atom `sepBy` space)) "(foo bar)"
SSExp (SId (Identifier {getId = "foo"})) [SId (Identifier {getId = "bar"})]

Of course, this may raise other questions: what if we have a space character before or after the opening and closing parenthesis?

While explicitly skipping spaces works, the megaparsec documentation already offers a solution for this kind of problem, namely, a lexer. Indeed, we will see that some of our previous definitions could have been written more elegantly with the help of this module.

The important bit for us now is this:

Parsing of white space is an important part of any parser. We propose a convention where every lexeme parser assumes no spaces before the lexeme and consumes all spaces after the lexeme; this is what the lexeme combinator does, and so it’s enough to wrap every lexeme parser with lexeme to achieve this. Note that you’ll need to call space manually to consume any white space before the first lexeme (i.e. at the beginning of the file).

S-expressions: the elegant way

The first thing you should do is import the appropriate lexing module. Megaparsec recommends to import this module qualified.

import qualified Text.Megaparsec.Char.Lexer as L

Now we need to define what is whitespace. Until now, with the usage of space, we have considered that it may be a space, tab, newline, carriage return, form feed, or vertical tab. Megaparsec can go beyond that and also consider line comments (such as -- comment) and block comments (such as {- comment -}) as whitespace. Since we get those for free, what about using ;; for line comments and /* */ for block comments?

Begin by defining skipSpace. We use the space helper function from the lexer module, which expects a function to parse 1 or more spaces, a line comment, and a block comment.

skipSpace :: Parser ()
skipSpace = L.space
  -- Like `space`, but skips 1 or more space characters.
  space1
  -- Skip from ;; until a newline.
  (L.skipLineComment ";;")
  -- Skip from /* until */. There is also `skipBlockComment`, but it doesn't handle nested comments.
  (L.skipBlockCommentNested "/*" "*/")

And now define a lexeme parser, which indicates how to consume space after the given parser.

lexeme :: Parser a -> Parser a
lexeme = L.lexeme skipSpace

Now you can use lexeme in each of the parsers, and they will automatically skip whitespace after their definitions. Note that we also use lexeme before our first opening parenthesis, so we can skip space just after the (.

sexp :: Parser (SExp, [SExp])
sexp = label "S-expression" $ lexeme $
  between (lexeme (char '(')) (char ')') ((,) <$> atom <*> many atom)

You may add lexeme to other parsers where it’s desirable to skip whitespace after the rule. For instance:

integer :: Parser Integer
integer = label "integer" $ lexeme $ read <$> some numberChar

Note: The reader should do the same for str, bool, and identifier. Alternatively, adding lexeme to atom will also work.

And now we can parse S-expressions properly:

>>> parseTest atom "(  foo  /* 8-) */  bar  ;; look mom, comments! \n  )"
SSExp (SId (Identifier {getId = "foo"})) [SId (Identifier {getId = "bar"})]

Our main parser function

Finally, let’s make a function that solves the last two outstanding problems.

For the first one, as mentioned by the Megaparsed documentation, lexeme will skip whitespace after a parser, but we also need to deal with whitespace at the beginning of the file.

For the second one, we want to parse everything until we find an eof since our parser will currently ignore anything after a complete atom. The two defects are exemplified below:

>>> parseTest atom "  42"
1:1:
  |
1 |   42
  | ^^^^
unexpected "  42"
expecting S-expression, boolean, identifier, integer, or string

>>> parseTest atom "42 foo bar"
SInteger 42

We will create a function that takes our input string, and either returns an error message or the actual result. Such function “bootstraps” our parser with between skipSpace eof atom, which will solve the two defects. It will also serve as an auxiliary function that can be called by client code, rather than parseTest which is more useful for working in GHCi.

parseSExp :: String -> Either String SExp
parseSExp input =
  let
    outputE = parse
      -- Skip any whitespace at the beginning, expect the end of input after the atom.
      (between skipSpace eof atom)
      -- Name of the source file, you can give it any name you want. I leave it blank for now.
      ""
      -- The actual string to be parsed
      input
  in
  -- If we get Left, it will be an `ParseErrorBundle`, let's pretty print it for now.
  -- If we get Right, do nothing.
  case outputE of
    Left err -> Left $ errorBundlePretty err
    Right output -> Right output

Now in GHCi:

>>> case parseSExp "   42" of { Left e -> putStrLn e; Right r -> print r }
SInteger 42

>>> case parseSExp "42 foo" of { Left e -> putStrLn e; Right r -> print r }
1:4:
  |
1 | 42 foo
  |    ^
unexpected 'f'
expecting end of input

One more case: `Double`

There is one more caveat that Megaparsec users should know about: backtracking. We show a somewhat contrived but realistic example. To illustrate this, let’s add one more case to our SExp:

data SExp
  = ...
  | SDouble  Double  -- 42f, 3.1415f

We will parse it as such: first, some digits, optionally followed by a . and some more digits, and an f at the end. In regex, this is equivalent to [0-9](\.[0-9])?f. We achieve the optional part using the optional function, which returns a Parser (Maybe a).

double :: Parser Double
double = label "double" $ lexeme $ read <$> do
  left <- some numberChar
  rightM <- optional $ do
    char '.'
    some numberChar
  char' 'f'  -- Like char, but case-insensitive
  pure $ case rightM of
    Nothing -> left
    Just right -> left ++ "." ++ right

Now, we add it to atom: SDouble <$> double. Depending whether you added double before or after integer, the result may be surprising:

Parser \ Position	`double` before `integer`	`double` after `integer`
parseTest atom "42"	1:3: \| 1 \| 42 \| ^ unexpected end of input expecting '.', 'F', 'f', or numeric character	SDouble 42.0
parseTest atom "42f"	SInteger 42	SInteger 42

So it only got one case correctly. To fix it, one solution would be to use try, which allows Megaparsec to do as much backtracking as it wants.

-- Note: Other cases omitted for brevity.
atom :: Parser SExp
atom = choice
  [ SDouble <$> try double
  , SInteger <$> integer
  ]

And it should fix the problem. However, this is not ideal in terms of performance, as it will cause Megaparsec to consume the entirety of the double to later find that there is no f, backtrack, and finally try again with an integer.

In such a situation this is trivial, but care must be taken in more complex examples, for example, what if we wanted to have (foo bar) as well as tuples such as (foo bar, baz quz)? The S-expression could be very long and lead to exponential complexity in the worst-case scenario. Recently, a very similar problem was fixed in one of our projects.

As we described in the introduction of this tutorial, it’s important to notice Megaparsec normally operates as a LL(1) parser, meaning it tries to look at only one character ahead when executing parsers in an alternative. Using try allows us to consume an arbitrary number of characters before backtracking, allowing it to be used as a LL(∞) parser.

In this situation, the best thing to do is to refactor the grammars to share common parts. The code above could be written without try as such:

numeric :: Parser SExp
numeric = label "number" $ lexeme $ do
  -- Left side before . or f
  left <- some numberChar

  -- .0f or f
  rightM <- optional $ choice
    [ do
        char '.'
        right <- some numberChar
        char' 'f'
        pure $ left ++ "." ++ right
    , do
        char' 'f'
        pure left
    ]

  -- If we had the right side, it's a double, otherwise an integer.
  pure $ case rightM of
    Nothing -> SInteger $ read left
    Just right -> SDouble $ read right

And in atom, we simply use numeric instead of the two previous parsers.

Rewriting with Megaparsec’s lexer

A lot of the things could be written using the lexer module from Megaparsec. In this section, we will present the simplified versions for functions that can be improved.

Because we will use floatingOrInteger, you should include the scientific package and import it:

import Data.Scientific

And here is our rewritten parser. Note that our new numeric has a quirk: 1.0 is parsed as an integer by Megaparsec. A resolution to this is left as an exercise to the reader.

-- signed takes a parser on how to skip space after the sign.
integer :: Parser Integer
integer = label "integer" $ lexeme $ L.signed skipSpace L.decimal

double :: Parser Double
double = label "double" $ lexeme $ L.signed skipSpace L.float <* char' 'f'

-- charLiteral also takes escaped characters into question.
str :: Parser String
str = label "string" $ lexeme $ char '"' *> manyTill L.charLiteral (char '"')

numeric :: Parser SExp
numeric = lexeme $ do
  value <- L.signed skipSpace L.scientific
  case floatingOrInteger value of
    Left d -> SDouble d <$ char' 'f'
    Right i -> do
      f <- optional $ char' 'f'
      pure $ case f of
        -- Note: 1.0 is an integer, but 1.1 is a parser error.
        Nothing -> SInteger i
        Just _ -> SDouble $ fromIntegral i

Note the changes to str as well. manyTill will run L.charLiteral taking characters, including escaped characters until the matching " is found.

As a bonus, we got various improvements:

No need to read values.
We can use signed to parse negative numbers.
We can parse escaped characters in strings.

The functions for integer and double were kept for reference, even though numeric is used instead.

Now let’s check it in GHCi:

>>> parseTest atom "(foo -42 \"with \\\" now!\")"
SSExp (SId (Identifier {getId = "foo"})) [SInteger (-42),SString "with \" now!"]

And I hope that with this you can better understand how parser combinators work. Now get to hacking your languages!

The complete code for this part can be found here.

Bonus: quasi-quotations

This section is better understood if you have some Template Haskell (TH) knowledge. If you haven’t learned about it, then I strongly encourage you to check out our Introduction to Template Haskell before trying out this section.

Note that this section is not necessary for our knowledge of parser combinators, and it’s purely a bonus section on how to do more with our S-expressions.

For this part, you will need the template-haskell package.

Long story short, Template Haskell is a mechanism that allows GHC to evaluate and analyze code during compile-time. One such application is using QuasiQuoters, which allows us to run custom parsers during compile time. Here we will use them for two purposes: to use S-expressions as Haskell expressions and to use S-expressions as Haskell patterns, both of which can be parsed during compilation.

We will create a new file called TH.hs, whose definition is given below.

module TH where

import Control.Monad ((<=<))
import Language.Haskell.TH.Quote (QuasiQuoter (..))
import Language.Haskell.TH.Syntax (Q, dataToPatQ, liftData)

import SExp (SExp, parseSExp)

sexp :: QuasiQuoter
sexp = QuasiQuoter
  { quoteExp  = liftData <=< parseOrThrow
  , quotePat  = dataToPatQ (const Nothing) <=< parseOrThrow
  , quoteType = const $ fail "No S-expression parser for type"
  , quoteDec  = const $ fail "No S-expression parser for declarations"
  }
  where
    parseOrThrow :: String -> Q SExp
    parseOrThrow = either fail pure . parseSExp

Additionally, liftData and dataToPatQ assume that our parsed SExp implement the Data type class. We will change our SExp module to automatically derive this datatype for us. First, enable the DeriveDataTypeable language extension, so GHC can automatically implement Data for us.

You also need to import Data from Data.Data.

{-# LANGUAGE DeriveDataTypeable #-}

module SExp where

import Data.Data (Data)
-- The rest of the file goes here...

And now change Identifier and SExp to derive Data:

newtype Identifier = Identifier
  { getId :: String
  } deriving (Data, Show)

data SExp
  = SSExp    SExp [SExp]  -- (foo 0 "hello" bar), (bar (baz 1)), (foo)
  | SInteger Integer  -- 42
  | SString  String  -- "hello, world"
  | SBool    Bool  -- false, true
  | SId      Identifier  -- foo
  | SDouble  Double  -- 42f, 3.1415f
  deriving (Data, Show)

The basic contents of the TH module are summarized as follows:

Each field of QuasiQuoter provides a String parser for a different Haskell syntax.
quoteExp creates a Haskell expression.
- This will allow us to write expressions such as let mySExp = [sexp|(foo 42)|].
- We use our standard Megaparsec parser, parseSExp, but fail if it can’t parse (halting compilation). liftData then transforms our SExp into an Exp.
- If this is your first time seeing <=<, it’s just another way to compose monadic functions. g <=< f is the same as \x -> g =<< f x.
- If you’ve read our previous TH tutorial, it creates an Exp just like [e||].
quotePat creates a Haskell pattern.
- Example: case mySExp of {[sexp|(foo 42)|] -> True; _ -> False}.
- Similarly to quoteExp, we use dataToPatQ (const Nothing) to transform our SExp into a Pat.
  - Confusingly, liftData is defined as dataToExpQ (const Nothing).
  - The const Nothing can be replaced to handle type-specific cases. We are fine with the default behavior for now but we’ll change it later.
- Creates a Pat, like [p||].
quoteType creates a Haskell type.
- Since we have no SExp, SInteger etc types matching our constructors, we will fail when someone tries to use our parser as types.
- Creates a Type, like [t||].
quoteDec creates a Haskell declaration.
- Similarly to quoteType, we won’t define this one.
- Creates a Dec, like [d||].

Now enable the QuasiQuotes extension in GHCi and you should be able to test it. Notice how failing to parse instead causes GHCi to fail to interpret the given S-expression (compilation error).

>>> :set -XQuasiQuotes
>>> [sexp|(foo "bar" 42)|]
SSExp (SId (Identifier {getId = "foo"})) [SString "bar",SInteger 42]

>>> [sexp|no parenthesis|]
Out [19]:

<interactive>:1135:7: error:
    • 1:4:
  |
1 | no parenthesis
  |    ^
unexpected 'p'
expecting end of input

    • In the quasi-quotation: [sexp|no parenthesis|]

There are still two improvements we could do, however. The first one is that Megaparsec could use the actual Haskell location:

>>> let SInteger x = [sexp|.|] in x
(omitted for brevity, error in 1:1)

As for the second one, it would be much better if we could bind S-expression variables into Haskell ones, such as the following:

foo [sexp|(foo $x)|] = Just x
foo _                = Nothing

Using Haskell locations

Let’s change our parser runner a bit so it starts with the current line and column. We will create a new auxiliary parseSExpWithPos, which takes a SourcePos and uses it for the initial state. This function was created from our old parseSExp.

Additionally, we create a new parseSExp with the default position.

We now call runParser' instead of directly calling parse, as it allows us to give it an initial state and build such a state with the default settings. Keep in mind that we’re getting into the internal behavior of Megaparsec here, so changes in the package can break this code.

parseSExp :: String -> Either String SExp
parseSExp = parseSExpWithPos (initialPos "")

parseSExpWithPos :: SourcePos -> String -> Either String SExp
parseSExpWithPos pos@(SourcePos file _line _column) input =
  let
    initState = State
      { stateInput = input
      , stateOffset = 0
      , statePosState = PosState
        { pstateInput = input
        , pstateOffset = 0
        , pstateSourcePos = pos
        , pstateTabWidth = defaultTabWidth
        , pstateLinePrefix = ""
        }
      , stateParseErrors = []
      }
    outputE = snd $ runParser'
      -- Skip any whitespace at the beginning, expect the end of input after the atom.
      (between skipSpace eof atom)
      -- The input state with the string to be parsed.
      initState
  in
  -- If we get Left, it will be a `ParseErrorBundle`, let's pretty print it.
  -- If we get Right, do nothing.
  case outputE of
    Left err -> Left $ errorBundlePretty err
    Right output -> Right output

Now we can change our TH modules. We will make new imports:

import Language.Haskell.TH.Syntax (Loc (..), Q, dataToPatQ, liftData, location)
import Text.Megaparsec (SourcePos (..), mkPos)

import SExp (SExp, parseSExpWithPos)

And we will change parseOrThrow as follows:

    parseOrThrow :: String -> Q SExp
    parseOrThrow input = do
      posTH <- location
      let pos = SourcePos
            (loc_filename posTH)
            (mkPos $ fst $ loc_start posTH)
            (mkPos $ snd $ loc_start posTH)
      either fail pure $ parseSExpWithPos pos input

With it, we now use location to query where the splice occurred and put it in the format that Megaparsec expects. Trying it in GHCi gives me:

>>> let SInteger x = [sexp|foo 42|] in x
<interactive>:1730:24: error:
    • <interactive>:1730:28:
     |
1730 | foo 42
     |
unexpected '4'
expecting end of input

    • In the quasi-quotation: [sexp|foo 42|]

For visualization, it may be better if we create a small test file:

module Test where

import TH (sexp)

testQQ = [sexp|foo 42|]

Compiling it results in:

Test.hs:5:16: error:
    • Test.hs:5:20:
  |
5 | foo 42
  |
unexpected '4'
expecting end of input

    • In the quasi-quotation: [sexp|foo 42|]
  |
5 | testQQ = [sexp|foo 42|]
  |                ^^^^^^^^

You may freely change how errors are printed by changing errorBundlePretty with a custom error printing function, but for this tutorial, this is good enough for us.

Binding variables in patterns

As mentioned, it would be nice if we could bring S-expression bindings into the Haskell scope. For this, let’s add a new term into our SExp:

data SExp
  -- other constructors omitted for brevity
  | STH String  -- $meta

We may now add a parser to extract it. Let’s reuse our identifier for this:

th :: Parser String
th = label "TH variable" $ lexeme $ char '$' *> fmap getId identifier

Don’t forget to add it to atom with STH <$> th.

We are now ready to add it to our TH module. First, we will need some new imports:

import Data.Typeable (Typeable, cast)
import Language.Haskell.TH.Lib (varP)
import Language.Haskell.TH.Syntax (Loc (..), Name, Q, dataToPatQ, liftData, location, mkName) -- no Loc and location if you've skipped the previous item

import SExp (SExp (STH), parseSExpWithPos)  -- or parseSExp if you've skipped the previous item

Typeable is a type class that allows us to try to cast some type into another. Here, we attempt a cast into SExp.

We will add a new function to aid us:

sexp = -- ommited for brevity
  where
    antiQuoteVar :: Typeable a => a -> Maybe (Q Pat)
    antiQuoteVar var = case cast var of
      Nothing -> Nothing
      Just var' -> case var' of
        STH var'' -> Just $ varP $ mkName var''
        _other    -> Nothing

We can now replace const Nothing with antiQuoteVar in our quasi-quoter:

  , quotePat  = dataToPatQ antiQuoteVar <=< parseOrThrow

Now check whether it works:

>>> case [sexp|(foo 42)|] of
...  [sexp|(foo $answer)|] -> print answer
...  _ -> putStrLn "oops"
SInteger 42

It’s also possible to give a meaning to $answer at the expression level by having it capture a Haskell variable instead. Let’s change TH a bit to support it:

    antiQuoteVar :: Typeable a => (Name -> Q b) -> a -> Maybe (Q b)
    antiQuoteVar mkVar var = case cast var of
      Nothing -> Nothing
      Just var' -> case var' of
        STH var'' -> Just $ mkVar $ mkName var''
        _other    -> Nothing

And our QuasiQuoter now has:

  { quoteExp  = dataToExpQ (antiQuoteVar varE) <=< parseOrThrow
  , quotePat  = dataToPatQ (antiQuoteVar varP) <=< parseOrThrow

Don’t forget to import dataToExpQ, which now replaces liftData as well as varE. We can now test our quasi-quoter again:

>>> let x = SInteger 42 in case [sexp|(foo $x)|] of
          [sexp|(foo $answer)|] -> print answer
          _ -> putStrLn "oops"
SInteger 42

And that’s it! You now support quasi-quotations.

The complete code for the parser and quasi-quoter may be found here.

Conclusion

In this post, we’ve introduced the theory and a practical application of parser combinators. We’ve discussed what are they as well as their pros and cons, implemented a simple parser combinator from scratch, learned how to write an S-expression parser with Megaparsec, and created a simple quasi-quotation to consume input at compile-time.

Some improvements can still be made and they are left as an exercise to the reader. For example, the S-expression parser operates on Strings, but Megaparsec also supports Text and ByteString as the input stream which should have better performance. On top of that, if the reader is familiar with type families (otherwise, check out our article on Type Families), they may try to improve our parser from scratch to also operate on more data types.

Appendix

Common parsing functions

Below we present some other useful functions and combinators you may want to use. We show their names, types, and comment about them.

Name	Type	Documentation
`many`	`Alternative f => f a -> f [a]`	Parse 0 or more occurrences of the given parser.
`some`	`Alternative f => f a -> f [a]`	Parse 1 or more occurrences of the given parser.
`option`	`Alternative m => a -> m a -> m a`	Parse 0 or 1 occurence of the given parser, with a fallback `a` if it fails.
`optional`	`Alternative f => f a -> f (Maybe a)`	Like `option`, but returns a `Maybe` instead.
`sepBy`, `sepBy1`	`Alternative m => m a -> m sep -> m [a]`	Parses various occurences of the given parser separated by the given separator. Functions prefixed with `1` parse 1 or more, while the ones without parse `0` or more.
`sepEndBy`, `sepEndBy1`	`Alternative m => m a -> m sep -> m [a]`	Like the `sepBy` variants, but additionally allow the separator to appear optionally at the end of the sequence.
`endBy`, `endBy1`	`Alternative m => m a -> m sep -> m [a]`	Like the `sepEndBy` variants, but the separator at the end is not optional.
`char`	`(MonadParsec e s m, Token s ~ Char) => Token s -> m (Token s)`	Parses a single character matching the given char.
`char'`	`(MonadParsec e s m, Token s ~ Char) => Token s -> m (Token s)`	Like `char`, but case-invariant.
`string`	`MonadParsec e s m => Tokens s -> m (Tokens s)`	Parses many characters matching the given string.
`string'`	`(MonadParsec e s m, FoldCase (Tokens s)) => Tokens s -> m (Tokens s)`	Like `string`, but case-invariant.
`makeExprParser`	`MonadPlus m => m a -> [[Operator m a]] -> m a`	Allows parsing generic expressions, supporting infix, prefix, and postfix operators with precedences.