Parsing Lookup Expressions using Chevrotain

It turns out that writing a parser is... surprisingly easy with the right tool!

Properties

AuthorBraden MacDonald (neolace lead developer)
Date2022-08-11
CategoriesDevelopment Blog, Things we ❤️
What we loveChevrotain

Background

With Neolace, our goal is to put the power of graph-based insights into everyone's hands. And to do that, we're creating a lot of easy to use visual tools for working with connected data, such as our graph visualization tools. But sometimes users need a more powerful approach. Our take on that is inspired by Excel. In Excel, you can write formulas like "= B4 + C4" to compute data in tables, and see instantly updated results whenever the source data changes.

Likewise, Neolace has lookups which will look up data from the Neolace knowledge graph and display the current value. For example, say I want to write about when this blog post ("Parsing Lookup Expressions using Chevrotain") was published: on 2022-08-11.

To write something like the above paragraph, I shouldn't write out the name of the blog post, nor the date; after all, that might change by the time I'm done writing the post (or even afterward), and I don't want to have to edit the post in that case. Instead, I would write the above sentence like this:

I want to write about when this blog post ("{ this.name }") was published: on { this.get(prop=Date) }.

The parts in bold are lookups. As you can see, when the blog post is shown to users, those lookups get evaluated and replaced with the current data. In this way, lookups help us keep content consistent by always referencing one "source of truth", instead of duplicating information in several places.[1] Neolace also uses lookups to implement property inference, search, autocompletions, and much more - which are topics for future posts.

The original "fake" parser

In order to use lookups, first Neolace has to parse them - to convert them from their string representation to the underlying implementation classes that know how to retrieve data from the knowledge graph.

I had always assumed that writing a full, "correct" parser for lookups would be a difficult task. So at first I just wrote what I called the "simple/fake parser", which basically used regular expression matching, recursion, and a lot of exception catching to more or less be able to parse any basic lookup expression.

To give you a feel for how this works, The job of the parser is to take an expression like

"this.andAncestors().andRelated(depth=3).graph()"

and turn it into a TypeScript object representing the expression to be evaluated, like this:

new Graph(
    new AndRelated(
        new AndAncestors(
            new This()
        ),
        {
            depth: new LiteralExpression( new IntegerValue(3) )
        },
    )
);

As you'll notice, the order of the expressions becomes inverted when parsed, because this / new This() has to be evaluated first, and in TypeScript that means that it should be the innermost expression[2].

For those who are curious, here is the code of the original parser (with some repetitive parts omitted). Skip this unless you're interested; the details are not important here.

export function parseLookupString(lookup: string): LookupExpression {

    lookup = lookup.trim();

    if (lookup === "null") return new LiteralExpression(new NullValue());
    if (lookup === "true") return new LiteralExpression(new BooleanValue(true));
    if (lookup === "false") return new LiteralExpression(new BooleanValue(false));
    if (lookup === "this") return new This();

    const otherTemplates: [RegExp, (match: RegExpMatchArray) => LookupExpression][] = [
        // "foo" (String literal)
        [/^"(.*)"$/, (_m) => new LiteralExpression(new StringValue(JSON.parse(lookup)))],
        // 123 (Integer literal, supports bigints)
        [/^\d+$/, (_m) => new LiteralExpression(new IntegerValue(BigInt(lookup)))],

        // something.function()
        [
            /^(.+)\.([A-za-z0-9_]+)\([ \t]*\)$/,
            (m) => {
                const something = m[1];
                const functionName = m[2];
                const matchedFunction = allFunctions.find((fn) => fn.functionName === functionName);
                if (matchedFunction) {
                    return matchedFunction.constructWithArgs(parseLookupString(something));
                } else {
                    throw new LookupParseError(`Unknown function: ${functionName}()`);
                }
            },
        ],

        // something.function(arg1=arg1Value)
        [
            /^(.+)\.([A-za-z0-9_]+)\([ \t]*([A-za-z0-9_]+)[ \t]*=(.*)\)$/,
            (m) => {
                const something = m[1];
                const functionName = m[2];
                const arg1 = m[3];
                const arg1Value = m[4];
                const matchedFunction = allFunctions.find((fn) => fn.functionName === functionName);
                if (matchedFunction) {
                    return matchedFunction.constructWithArgs(parseLookupString(something), {
                        [arg1]: parseLookupString(arg1Value),
                    });
                } else {
                    throw new LookupParseError(`Unknown function: ${functionName}()`);
                }
            },
        ],
        ...
    ];

    for (const [re, fn] of otherTemplates) {
        const matchResult = lookup.match(re);
        if (matchResult) {
            try {
                return fn(matchResult);
            } catch (err) {
                if (err instanceof LookupParseError) {
                    continue;
                } else if (err instanceof SyntaxError) {
                    // This is triggered by the JSON parser that we use to parse string values;
                    continue;
                } else {
                    throw err;
                }
            }
        }
    }

    if (lookup[0] === "[" && lookup[lookup.length - 1] === "]") {
        // It's a list:
        if (lookup.length === 2) {
            return new List([]);
        }
        const parts = lookup.substring(1, lookup.length - 1).split(",").map((part) => part.trim());
        return new List(parts.map((part) => parseLookupString(part)));
    }

    throw new LookupParseError(`Unable to parse the lookup expression "${lookup}"`);
}

The problem

The nice thing about the "simple/fake" parser was its simplicity: one-shot parsing that covers most commonly wanted lookup expressions and in only ~250 lines of code. This parser actually worked better and lasted longer in the development process than I expected, as it really could parse most things we needed it to for the first year or so of Neolace development. Here are a variety of valid lookup expressions that it was able to parse correctly:

  • this
  • "string literal"
  • ["lists", 123, null, true, false]
  • this.ancestors()
  • this.andAncestors().andRelated(depth=3).graph()
  • entry("img-technotes-logo").image(format="logo", link="https://www.technotes.org/", maxWidth=60)
  • [ entry("tc-ec-cell-li"), entry("tc-ec-cell") ].map(apply=( e -> e.ancestors().count() ))

However, as the lookup language got more and more features, I started needing to use complex lookup expressions that this type of parser simply couldn't handle.

In particular, a certain type of expression would break this parser; anything formed like this:

this.annotate(blah=this.annotate(foo="bar"))

The problem with this type of expression is that it's supposed to be matched by a regular expression which looks for this pattern:

[something].[functionName]([arg]=[value])

So that parser should match it to that regex like this:

something: this, functionName: annotate, arg: blah, value: this.annotate(foo="bar")

But instead it matches:

something: this.annotate(something=this, functionName: annotate, arg: foo, value: "bar")

And then of course when it recursively tries to parse this.annotate(something=this and/or "bar"), it fails because those fragments aren't valid expressions.

Now, it may have be possible to salvage this regex based parser by playing around with greedy vs. lazy matches and adding more complex retry algorithms. But I decided it was time to try writing a real, stateful parser.

An additional motivating factor was the need for better error reporting during parsing. The regex-based parser works by trying a few different match patterns, ignoring any that throw errors, and using the first matched pattern that doesn't throw an error. This recursive try/catch approach meant that some exceptions must be thrown, caught, and ignored during normal parsing. A consequence of that was that the parser essentially ignores any detailed errors that come from inner parts of the expression and just throws an overall "parse failed" if none of the possible ways of parsing something work without an exception. Ideally, the parser would be smarter about what type of code it's currently parsing, and then throw meaningful exceptions for any error that is encountered.

The Solution: Chevrotain

Long ago I had researched different tools for making a parser in TypeScript, and I had already picked the one I wanted to use: Chevrotain. Chevrotain is a "Parser Building Toolkit for JavaScript" that focuses on performance, and has impressive benchmarks compared to every other alternative. Plus, it's a pure TypeScript/JavaScript library that you just use; it's not a parser generator that requires you to add additional build tooling to your project.

Although Chevrotain had always appealed to me, I had assumed it would be a lot of work to create the parser, so I'd been putting off using it until now. I decided to explore writing a Chevrotain-based parser on the upcoming weekend and see how far I could get.

Step 1: Implementing the lexer

When the weekend arrived, I started following the Chevrotain tutorial, and found that the first step would be to define a lexer. The lexer's job is to convert the input from a string to a series of annotated tokens, which then get passed to the parser. For example, the lexer would convert this string:

this.ancestors()

into a list of five tokens like this:

ThisKeyword, Dot, Identifier[3], LeftParen, RightParen

To implement this with Chevrotain, you just have to define a regular expression for each token type, as well as to give each one a name. The documentation explained this very clearly and there were lots of examples to consult.

So I started defining the lexer for lookup expressions, following the examples:

import { createToken, Lexer } from "https://esm.sh/chevrotain@10.1.2/";

/** The name of a function or variable */
export const Identifier = createToken({ name: "Identifier", pattern: /[A-Za-z_][A-Za-z0-9_]*/ });

/** The 'this' keyword */
export const This = createToken({ name: "This", pattern: /this/, longer_alt: Identifier });

export const Dot = createToken({ name: "Dot", pattern: /\./, label: "." });
export const LParen = createToken({ name: "LParen", pattern: /\(/, label: "(" });
export const RParen = createToken({ name: "RParen", pattern: /\)/, label: ")" });
// ... a bunch more symbols and keywords ...

export const lookupTokens = [
    This,
    Dot,
    LParen,
    RParen,
    ...,
    Identifier,
];

export const lookupLexer = new Lexer(lookupTokens, { ensureOptimizations: true });

For brevity, I've omitted quite a few of the other token types, but they're very similar to what's already there.

A couple things are worth pointing out about this code:

  1. The order of lookupTokens is very important. The Identifier token needs to come last, because we want a keyword like this to be matched as a This token, not as an Identifier (variable/function name) called "this". Likewise for true, false, and other keywords that are defined as their own tokens.
  2. But what if a user defines a variable named "thisVariable"? That's a legal variable name, but if we put the This token before Identifier, this will be lexed as [This, Identifier("Variable")] not [Identifier("thisVariable")]. Of course, this is a common problem and Chevrotain has a simple solution: for tokens like This that can be part of a longer identifier, we use the longer_alt: Identifier option when defining the token. This tells the Chevrotain lexer that if the Identifier token also matches and is longer, it should be used instead.

And that was it! I wrote some test cases for the lexer, like this:

test("list of various value types", () => {
    const { tokens, errors } = lookupLexer.tokenize(`[0, -1, 3, "hello", ]`);
    assertEquals(errors, []);
    assertEquals(tokens.map((t) => t.tokenType), [
        T.LSquare,
        T.IntegerLiteral, // 0
        T.Comma,
        T.IntegerLiteral, // -1
        T.Comma,
        T.IntegerLiteral, // 3
        T.Comma,
        T.StringLiteral, // "hello"
        T.Comma,
        T.RSquare,
    ]);
});

Happily, the lexer worked flawlessly pretty much right away (when does that ever happen?). I spent some time to flesh out the test suite for the lexer and then called it done. It was time for the next step.

Step 2: The Parser

Once the lexer is working, the next step is to implement a subclass of Chevrotain's CstParser. This class will do the actual parsing, by taking the tokens that the lexer produced, and converting them to a "Concrete Syntax Tree."

Let's look at the first version of the new parser setion by section.

First, we need to import Chevrotain and the lexer's token definitions:

import { CstNode, CstParser, IToken, ParserMethod } from "https://esm.sh/chevrotain@10.1.2/";
import { lookupTokens } from "./lexer.ts";
import * as T from "./lexer.ts";

Next, we declare our parser as a subclass of CstParser, and pass in the list of lexer tokens that it will be able to parse:

class LookupParser extends CstParser {

    constructor() {
        super(lookupTokens, { recoveryEnabled: true, });

Now, in the constructor of this class is where we define how the actual parsing will work, by writing a series of rules that "consume" tokens. We can think of the parser as going through the list of tokens, matching them against the various rules we've defined, and then letting the matching rule "consume" the tokens, generating the parsed Syntax Tree in the process. Once all the tokens are consumed, parsing is complete.

So the first rule I wrote is the most general rule for a lookup expression, which says that any given expression can be either a function call, a list, or a "value", which we'll define below:

        const $ = this;

        $.RULE("expression", () => {
            $.OR([
                { ALT: () => $.SUBRULE($.functionCall) },
                { ALT: () => $.SUBRULE($.list) },
                { ALT: () => $.SUBRULE($.value) },
            ]);
        });

Next, we define the rule for a basic function call, like foo(), foo(arg1), or foo(arg1, param2=arg2, param3=arg3). It starts with the function name (Identifier token), then a left parenthesis, then optional parameters, then a closing right parenthesis.

        $.RULE("functionCall", () => {
            $.CONSUME(T.Identifier); // the name of the function
            $.CONSUME(T.LParen); // (
            $.OPTION(() => {
                // The first argument to the function. Never named.
                $.SUBRULE($.expression, {LABEL: "firstParameterValue"});
                // Additional parameters - always named:
                $.OPTION2(() => {
                    $.CONSUME(T.Comma);
                    $.MANY_SEP({
                        SEP: T.Comma,
                        DEF: () => {
                            $.CONSUME2(T.Identifier, {LABEL: "otherParamName"});
                            $.CONSUME(T.Equals);
                            $.SUBRULE2($.expression, {LABEL: "otherParamValue"});
                        },
                    });
                });
            });
            $.CONSUME(T.RParen); // )
        });

Notice how we define the parameter values by recursively calling the expression rule that we defined earlier. That's because each parameter value is a completely general expression, and can in turn be a function, list, or any other value.

If you're wondering why some of the function names have a 2 at the end (CONSUME2, SUBRULE2), it's because of how Chevrotain internally keeps its state as it parses; if you call the same function like $.OPTION(() => {...}) or $.SUBRULE($.expression) twice in the same rule, the parser actually can't tell those apart and can't properly maintain its state as it parses (or something like that). So you have to add an arbitrary number to the end of each additional call to those functions that uses the same arguments. What's really great is that Chevrotain holds your hand with this, and if you forget or do anything slightly wrong, it prints very clear and helpful error messages during initialization that link to the documentation and tell you how to fix it 👏. This was wonderful and I was very grateful for the excellent error messages while I worked on this parser.

Next up: a rule for lists, like [1, 2, 3]. Here was my first attempt:

        $.RULE("list", () => {
            $.CONSUME(T.LSquare); // [
            $.MANY_SEP({
                SEP: T.Comma,
                DEF: () => {
                    $.SUBRULE($.expression);
                },
            });
            $.CONSUME(T.RSquare); // ]
        });

OK, that was straightforward. Now a rule for "value":

        $.RULE("value", () => {
            $.OR([
                { ALT: () => $.CONSUME(T.StringLiteral) },
                { ALT: () => $.CONSUME(T.IntegerLiteral) },
                { ALT: () => $.CONSUME(T.This) },
                { ALT: () => $.CONSUME(T.True) },
                { ALT: () => $.CONSUME(T.False) },
                { ALT: () => $.CONSUME(T.Null) },
                { ALT: () => $.CONSUME(T.Identifier) }, // A variable by itself
            ]);
        });

Hopefully that is self explanatory by this point.

Next, we end the constructor code with this critical line:

        // very important to call this after all the rules have been setup.
        // otherwise the parser may not work correctly as it will lack information
        // derived from the self analysis.
        this.performSelfAnalysis();
    }

Because Neolace is written in TypeScript, it was also necessary to add these type definitions within the class declaration, in order to make TypeScript aware of the rules that are being defined in the constructor at runtime:

    // Declare types for our various rules:
    expression!: ParserMethod<[], CstNode>;
    functionCall!: ParserMethod<[], CstNode>;
    list!: ParserMethod<[], CstNode>;
    value!: ParserMethod<[], CstNode>;
}

That's it for our Parser class. Next, we declare a single instance of it to use for parsing:

const parser = new LookupParser();

And we're ready to parse!

export function parse(lookupExpression: string) {
    const lexingResult = T.lookupLexer.tokenize(lookupExpression);
    if (lexingResult.errors.length > 0) {
        throw new LookupParseError(lexingResult.errors[0].message);
    }
    parser.input = lexingResult.tokens; // Assigning to this will reset the parser's internal state.
    const cst = parser.expression();
    if (parser.errors.length > 0) {
        throw new LookupParseError(parser.errors[0].message);
    }
    return cst;
}

Note that this parser doesn't yet return LookupExpression objects; it returns a "Concrete Syntax Tree" that corresponds to the rules we defined. So it will always return a CstNode for the expression rule, and then that node will have CstNode children that are from functionCall, list, or value rules, and they in turn may have CstNode children that are from the expression rule, and so on. Now you could modify the parser to return your application-specific data directly (in this case, that would mean to return LookupExpression objects). However, Chevrotain recommends instead that you separate that logic, so that the parser produces the Concrete Syntax Tree, and a separate Visitor class can convert that Concrete Syntax Tree into your application-specific objects. So that's what I did. Implementing a Visitor is quite straightforward so I'll leave the details for another time.

Once again, I took my best guess at how to write the parser, following the examples in Chevrotain, and when I tested it, the basics were working almost right away 🥳. It parsed simple expressions, emitted a correct Conrete Synatx Tree, and the Visitor class I wrote converted those to LookupExpression objects. Awesome!

Step 3: Dot Functions Edge Case and Left Recursion

Of course, it can't be quite that easy, can it?

I started expanding the parser by adding more detailed rules. In particular, I needed to add a rule for "chained functions". With Neolace lookups, every function call that accepts a first argument can be writted as either function(arg) or arg.function(), and I had only implemented the first way so far.

Naively, I wrote a rule for such "chained" functions like this: start with matching any expression, then match a dot (.), then the function name, and so on:

        // A function call using the alternative syntax, x.foo()
        // or x.foo(bar=blah, bar2=blah2) where x is the first arg
        $.RULE("dotFunctionCall", () => {
            $.SUBRULE($.expression);
            $.CONSUME(T.Dot);
            $.CONSUME(T.Identifier, { LABEL: "functionName" });
            $.CONSUME(T.LParen);
            // Optional additional parameters - always named:
            $.OPTION(() => {
                $.OPTION2(() => {
                    $.CONSUME(T.Comma);
                    $.MANY_SEP({
                        SEP: T.Comma,
                        DEF: () => {
                            $.CONSUME2(T.Identifier, { LABEL: "otherParamName" });
                            $.CONSUME(T.Equals);
                            $.SUBRULE2($.expression, { LABEL: "otherParamValue" });
                        },
                    });
                });
            });
            $.CONSUME(T.RParen);
        });

And I added this new rule to the general rule for expressions:

        $.RULE("expression", () => {
            $.OR([
                { ALT: () => $.SUBRULE($.dotFunctionCall) }, // <<<<<< NEW RULE
                { ALT: () => $.SUBRULE($.functionCall) },
                { ALT: () => $.SUBRULE($.list) },
                { ALT: () => $.SUBRULE($.value) },
            ]);
        });

But now, when I tried to use this parser, it gave an error on initialization:

Error: Parser Definition Errors detected:
 Left Recursion found in grammar.
rule: <expression> can be invoked from itself (directly or indirectly)
without consuming any Tokens. The grammar path that causes this is: 
 expression --> dotFunctionCall --> expression
 To fix this refactor your grammar to remove the left recursion.
see: https://en.wikipedia.org/wiki/LL_parser#Left_factoring.

Wow, what a great error message! Not only does it tell me exactly what's wrong, and where it occurs, but it links to the relevant Wikipedia article which explains how to fix it. Although I found that following the link on that Wikipedia article to a second article on "Removing left recursion" was the part that I really needed, with a clear explanation of the problem and how to solve it:

Left recursion often poses problems for parsers, either because it leads them into infinite recursion ... The general algorithm to remove direct left recursion follows.

Great! So it took me a while to read and understand, but then I just had to follow the recipe from the Wikipedia article in order to rewrite the rules and remove the potentially infinite recursion.

Essentially, to solve this problem, I needded to rewrite the rules so that each rule always consumes at least one token before calling subrules that can in turn call the first rule again.

The trick is to make the dotFunctionCall rule start with the ., and consuming it, and then optionally recursively calling itself to handle long chains of function calls:

$.RULE("dotFunctionCall", () => {
    // No longer starts with $.SUBRULE(expression); now starts with dot:
    $.CONSUME(T.Dot);
    // This middle part is unchanged:
    $.CONSUME(T.Identifier, { LABEL: "functionName" });
    $.CONSUME(T.LParen);
    $.OPTION(() => {
        // Additional parameters - always named:
        $.MANY_SEP({
            SEP: T.Comma,
            DEF: () => {
                $.CONSUME2(T.Identifier, { LABEL: "otherParamName" });
                $.CONSUME(T.Equals);
                $.SUBRULE2($.expression, { LABEL: "otherParamValue" });
            },
        });
    });
    $.CONSUME(T.RParen);

    // NEW: Handle the case where this .function() is immediately followed by another .function():
    $.OPTION2(() => {
        $.SUBRULE($.dotFunctionCall, { LABEL: "chainedExpression" });
    });
});

And then the overall expression rule needs to incorporate this dotFunctionCall rule a bit differently:

$.RULE("expression", () => {
    // An expression starts with one of these three things:
    $.OR([
        { ALT: () => $.SUBRULE($.functionCall) },
        { ALT: () => $.SUBRULE($.list) },
        { ALT: () => $.SUBRULE($.value) },
    ]);
    // And may optionally be followed by a chained function call:
    $.OPTION(() => $.SUBRULE($.dotFunctionCall));
});

Once I figured that out, and updated the Visitor accordingly, everything was again working well and now the parser could correctly parse expressions like

this.andAncestors().andRelated(depth=3).graph()

Step 4: Trailing Commas

One last improvement that I wanted to make with the new parser was to ignore trailing commas in lists, so that the following were equivalent:

  • [1, 2, 3]
  • [1, 2, 3, ]

The Chevrotain API for "many separated items" that I was using, MANY_SEP, does not support ignoring a trailing separator. So after some trail and error, I landed on this more detailed rule for parsing lists:

$.RULE("list", () => {
    $.CONSUME(T.LSquare); // [

    // The contents of the list are optional, because we need to support empty lists.
    $.OPTION(() => {
        $.SUBRULE1($.expression); // The first value in the list
        $.MANY(() => {
            $.CONSUME(T.Comma);
            $.SUBRULE2($.expression); // Each additional value in the list
        });
        // Optional trailing comma at the very end of the list:
        $.OPTION2(() => { $.CONSUME2(T.Comma); });
    });

    $.CONSUME(T.RSquare); // ]
});

Step 5: Profit

That was it! In a single weekend, I had completely rewritten the parser using Chevrotain. It can now handle any valid lookup expression, it has much better error reporting when users provide invalid input, its code is much cleaner and easier to maintain, and it's probably faster too. I'm very grateful to the folks who created Chevrotain - it's really a fantasic piece of software that I highly recommend.

If you want to see the new parser in action, head to the lookup page and follow the instructions. Or click here to see a specific complex example - a list of the "Things we Love" and the blog posts that mention them.

© Copyright 2022 MacDonald Thoughtstuff Inc. All documentation text and multimedia on this site are licensed under the CC BY-SA 4.0 license.