ENGINEERING

# Writing a Parser Combinator from Scratch in TypeScript

Ben Wiklund

Software Engineer, Sigma

Writing parsers can be challenging. The problem is simple to describe: convert an input string into a syntax tree. The same language can be parsed by many different algorithms, and all have different tradeoffs with regards to speed, memory usage, readability, and maintainability.

In this blog post, we’ll explore the approach we take to parsing formulas in Sigma, using parser combinators. Parser combinators allow us to compose many simple functions together to define our entire grammar. The underlying algorithm is Recursive Descent with backtracking, using techniques that are available to us in languages with first class functions like javascript.

## Background

At Sigma, we have our own small language to let users write formulas in worksheets, which is very similar to what users are familiar with in Excel or Google Docs. We chose to write our own parser so we could have full control over error handling and annotations, which allow us to provide code completion even on malformed strings, like a formula that’s only halfway written.

Our parser uses recursive descent with backtracking, built using combinators. The code here isn’t the Sigma parser, but a similar one written for a simpler grammar to show the basic concepts. In this article we’ll walk through building a parser for a simple language that supports function calls and number literals, like Foo(Bar(1,2,3)) .

## The goal

Our high level goal is to write a function that takes an input string, and returns a syntax tree. Our language is going to look like this:

We could describe this language in (loosely) EBNF like this:

program = expr
expr = call | number
call = ident ‘(‘ [ argList ] ‘)’
argList = arg (‘,’ arg) *
number = … whatever number format we decide to support
ident = … idk, how about /[a-zA-Z][a-zA-Z0-9]+/

Our parser will use recursive descent, which means we’re going to start from the highest level structure in the language (in this case, an expression), and branch out into the rest of the language definition based on the rules we’ve defined.

The leaves of this tree, (number, ident, ‘(‘ and ‘)’) are called terminal symbols, and are the only parts of the parser that directly consume characters from the input string.

The rest of the parser is non-terminals , which means they represent combinations of other symbols . (in parsing terminology, a symbol is just a part of the language. each line in the ENBF above represents a symbol)

Parsers commonly have a lexer before the actual parser. In short, a lexer preprocesses an input string into a flat list of tokens . We can skip this process entirely because the code for our terminal symbols will consume the string directly.

## Some types

We’ll be writing this in a pure functional style, which means we want every function to return an immutable result, with no side effects. We want to traverse the input string from the top, all the way down the the leaf nodes (the terminals ), and then return the parts of the tree from the bottom up.

Functional purity is desirable here because it makes it easier to reason about the behaviour of a single parser and its output, instead of wondering what other non-obvious things could be happening in the parsing process outside of the code you’re looking at.

For each part of the grammar we’re creating, we want to:

Take an input Context (the string containing our code, and the position we’re currently at)

On success: return a Success containing a value and a new Context containing the next position in the string to continue parsing from.

On failure, we return a Failure object with the position and reason for failure.

By creating these parsers, and writing a variety of functions that compose parsers together, we will build our complete parser for this language.

Whoa, that’s a bunch of types upfront, so lets give a concrete examples of how they’d work:

How would we implement parseCow ?

Okay, but that’s a lot of code just to parse the word cow . Luckily javascript is a language that supports first class functions. By writing functions that return functions, we can create parsers in a declarative way. These are the “combinators” this article is about.

To take a detour and parse this (meaningless) cow says moo sentence, we could write something like this, which has some immediately obvious issues:

Yikes! This is not fun. It’s really easy to pass the wrong context forward to the next parser. Let’s write a function that abstracts this away for us.

Using this combinator, we can express the previous parser in a much more terse manner:

To write our little language, we’re going to need a handful more combinators that we’ll compose together to create a parser that supports our grammar.

Here’s once called any that takes an array of parsers, and tries them all until one succeeds.

Here’s another one called many that take on parser, and gathers as many repeats.

Below are all the combinators we’ll need for our language. The full implementation of each function is included at the end of the article, but the understanding the intent of each should be enough to move forward.

## Implementing the grammar

Here’s our basic plan for the language again.

program = expr
expr = call | number
call = ident ‘(‘ [ argList ] ‘)’
argList = arg (‘,’ arg) *
number = … whatever number format we decide to support
ident = … idk, how about /[a-zA-Z][a-zA-Z0-9]+/

Let’s turn it into a working parser using the combinators at our disposal.

Finally, running this code:

The result is an implementation of our grammar that is very terse (about 60 lines with comments and spacing), and is backed up by about 120 lines of types and supporting combinator code, with no external dependencies. And because we wrote it by hand, we can easily extend it to support arbitrary extra functionality like annotations, error recovery, etc.

## Final thoughts

All parsing approaches have pros and cons, and so do combinators:

PROS

• Simple contract
• Expressive, declarative
• Functional, immutable, composable
• Actual language grammar implementation can be terse, maps closely to EBNF for the language
• Doesn’t rely on exception handling
• Backtracking is trivially easy, doesn’t require managing a stack

CONS

• Introduces a layer of indirection / abstraction
• Requires a language with first class functions
• The type of the `sequence` combinator may be difficult to express strictly enough. It takes an array of parsers with mixed return types, and returns a list of results mapping directly to the output of those parsers, which is tricky to enforce in a type system. This can be overcome in typescript with function overloading, but it’s outside the scope of this article, so I opted to use `any` and trust the ordering of parsers.

The full source code for this blog post, including all the combinators used, is available on github.