madhadron

Parser combinators for HTML

Status: Done
Confidence: Very likely

Parser combinators are an idea that gained wide use in Haskell for specifying parsers declaratively. They’re powerful enough where, while Haskell has good regex libraries, almost no one uses them since writing a real parser is easier.

Usually parser combinators are used for parsing raw text or binary streams, but much of the data we deal with in the connectivity teams is structured JSON, HTML, or XML that we want to pull information out of. So let’s try building parser combinators to declare how to pull bits of data out of HTML.

We’ll start with using cheerio to select data out of HTML. This is what we’re already using in PDaaS.

import * as cheerio from 'cheerio';

To form combinators, we need to construct an algebra that lets us compose existing functions for extracting data into new ones. I’ll invoke the design principles from page 1 of Structure and Interpretation of Computer Programs:

when we describe a language, we should pay particular attention to the means that the language provides for combining simple ideas to form more complex ideas. Every powerful language has three mechanisms for accomplishing this:

A parser function for our purposes is one that takes a hunk of HTML and returns a parsed value. With cheerio, we need to pass two arguments, the document and the root element we should work relative to, so a parser has the form

($, root) => ...

A few simple example parsers:

// Get the text representation of the contents 
// of the element referred to by root.
const text = ($, root) => root.text().trim();

// Get the number of child elements of the element
// referred to by root.
const childCount = ($, root) => $(root).childNodes.length;

// Get all ul elements underneath the element
// referred to by root and return them as an array
// where each element can be used as the root argument
// to another parser.
const ulsUnder = ($, root) => {
    const uls = $(root).find('ul');
    let results = [];
    // cheerio has a peculiar notion of collection
    // that lets you ignore the difference between
    // a single element returned and multiple elements
    // returned. The map function on cheerio objects
    // returns another of that strange collection,
    // so we have to explicitly construct the array
    // we want.
    uls.map((i, el) => {
        const el$ = $(el);
        results.push(el$);
    });
    return results;
};

To actually run these, we write a function that takes the parser and some data and does the appropriate munging and plumbing:

const parse = (parser, data: string) => {
    const $ = cheerio.load(data);
    const root = $('html');
    return parser($, root);
}

These are not particularly reusable pieces that we can compose. They’re the primitive expressions in the first bullet point from SICP. Bullet two requires that we have a way of taking one or more parsing functions and producing a new parsing function with the same basic signature. Bullet three is trivial for us, since we’re writing functions that we bind to variables.

The simplest example is that we would like to be able to apply a function to the value produced by a parser.

const apply$ = (fn, parser) => ($, el) => {
    const v = parser($, el);
    const r = fn(v);
    return r;
}

Say we have an HTML document

<!DOCTYPE html>
<p id=count>42

We can get the count with

const countText = ($, root) => $('p#count').text();
const countParser = apply$(parseInt, countText);

const html = `
    <!DOCTYPE html>
    <p id=count>42
`;

parse(countParser, html);
// 42

Writing little functions like countText over and over again is onerous, so we can generalize it

const many$ = (path: string, fn) => ($, root) => {
    const els = $(root).find(path);
    let results = [];
    els.map((i, el) => {
        const el$ = $(el);
        const result = fn($, el$);
        results.push(result);
    });
    return results;
};

If we have a ul element where each li element under it contains counts that we want to parse, we can write

const html = `
    <!DOCTYPE html>
    <ul id=counts>
        <li>12
        <li>15
        <li>21
        <li>3
    </ul>
`;

const text = ($, root) => root.text();
const countsParser = many$(
    'ul#counts > li', 
    apply$(parseInt, text)
);

let counts = parse(countsParser, html);
// [12, 15, 21, 3]

Using apply$ here is probably less clear than writing a little parser function by hand in practice, such as

const countsParser = many$(
    'ul#counts > li', 
    (_, el) => parseInt(el.text())
);

Similarly, if we want have an element like <li>$34.55 and we want to pull out the amount without the currency marker, we would probably just write

(_, el) => parseInt(el.text().slice(1))

rather than trying to fancily assemble combinators. It’s when we get past the initial, atomic bits that they start to really shine.

For example, sometimes we want to get exactly one element that matches a CSS selector, and if we get more than one it’s a parse error:

const one$ = (path: string, fn) => ($, root) => {
    const els = $(root).find(path);
    if (els.length != 1) {
        throw Error(`selectOne found ${els.length} elements matching selector '${path}'; expected one.`);
    }
    return fn($, els);
};

We could also write a function that does validation that we call the same way as apply$:

const check$ = (validator, parser) => ($, root) => {
    const v = parser($, root);
    validator(v);
    return v;
};

Say we expect exactly one count, with a non-negative value in a ul element.

const html = `
    <!DOCTYPE html>
    <ul id=counts>
        <li>45
    </ul>
`;

const assertNonNegative = (v) => {
    if (v < 0) {
        throw Error(`Count must be nonnegative, found ${v}`)
    }
};

const positiveSingleCountsParser = check$(
    assertNonNegative,
    apply$(
        parseInt, 
        one$(
            'ul#counts > li', 
            (_, el) => el.text()
        )
    )
);

parse(positiveSingleCountsParser, html);
// 45

Another handy combinator lets us concatenate results of a many$ parse:

const concat = (vs) => vs[0].concat(...vs.slice(1));

const html = `
    <!DOCTYPE html>
    <ul>
        <li>12
        <li>44
        <li>62
    </ul>
    <ul>
        <li>5
        <li>500
        <li>1024
    </ul>
`;

const allCountsParser = apply$(
    concat,
    many$(
        'ul', 
        many$(
            'li', 
            (_, el) => parseInt(el.text())
        )
    )
);

parse(allCountsParser, html);
// [12, 44, 62, 5, 500, 1024]

We also often want to combine various bits of data from an HTML document into a new JavaScript object. We can write a function struct$ that takes an object with parsers as values that will produce an object with the same keys and parsed values from those parsers.

const struct$ = (obj) => ($, root) => {
    let result = {};
    for (const key in obj) {
        const fn = obj[key];
        result[key] = fn($, root);
    }
    return result;
}

Say we have a table of names, email addresses, and phone numbers that we want to parse into an array of objects.

const html = `
    <!DOCTYPE html>
    <table id=people>
    <tr><th>Name
        <th>Address
        <th>Phone number
    <tr><td>Boris Gudenov
        <td>54 1st St
        <td>123-456-7890
    <tr><td>Edward Vabarond
        <td>The Oaks, 223 E Blumbledowns
        <td>55 123 99999
    </table>
`;

const parseRow = struct$({
    name: one$('td:nth-child(1)', text),
    address: one$('td:nth-child(2)', text),
    phoneNumber: one$('td:nth-child(3)', text),
});

const parseTable = one$(
    'table#people', 
    many$(
        'tr:not(:first-child)', 
        parseRow
    )
);

console.log(parse(parseTable, html));
// [
//   {
//     name: 'Boris Gudenov',
//     address: '54 1st St',
//     phoneNumber: '123-456-7890'
//   },
//   {
//     name: 'Edward Vabarond',
//     address: 'The Oaks, 223 E Blumbledowns',
//     phoneNumber: '55 123 99999'
//   }
// ]

Successive children like this are a common pattern, so we may want a combinator for them, too.

const row$ = (path, fields) => ($, root) => {
    const els = $(root).find(path);
    let result = {}
    els.map((i, el) => {
        const [key, parser] = fields[i];
        const el$ = $(el);
        result[key] = parser($, el$);
    });
    return result;
};

which we can use in the previous example as

const parseRow = row$('td', [
    ['name', text],
    ['address', text],
    ['phoneNumber', text],
]);

As a final summation, let’s take a sizeable document. We will take a report of library card holders, and extract their name, address, phone number (in a standard format), and all the books they have checked out, along with their due dates.

const html = `
    <!DOCTYPE html>
    <div class=patron>
        <dl>
            <dt>Name
            <dd>Boris Gudenov
            <dt>Address
            <dd>54 1st St, New York, NY 10021
            <dt>Phone number
            <dd>551-243-9102
        </dl>
        <table>
            <tr><td>Aesop's Fables
                <td>2022-11-05
            <tr><td>Pride & Prejudice
                <td>2023-01-12
        </table>
    </div>
    <div class=patron>
        <dl>
            <dt>Name
            <dd>Edward Vabarond
            <dt>Address
            <dd>The Oaks, 223 E Blumbledowns, East Surrey
            <dt>Phone number
            <dd>205-118-7773
        </dl>
        <table>
            <tr><td>The Inner Game of Tennis
                <td>2023-01-08
            <tr><td>Das Kapital
                <td>2023-01-08
        </table>
    </div>
`;

const phoneNumber$ = check$(
    (v) => {
        if (!v.match(/^\d{3}-\d{3}-\d{4}$/)) {
            throw Error(`The string '${v}' is not a valid phone number`);
        } 
    },
    text
);

const date$ = ($, el) => {
    const s = text($, el);
    return new Date(Date.parse(s));
}

const books$ = one$('table', many$(
    'tr',
    row$(
        'td',
        [['title', text], ['dueDate', date$]]
    )
));

const metadata$ = one$('dl', row$(
    'dd',
    [['name', text], ['address', text], ['phoneNumber', phoneNumber$]]
));

const patrons$ = many$('div.patron', struct$({
    'metadata': metadata$,
    'books': books$,
}));

parse(patrons$, html)
// [
//   {
//     metadata: {
//       name: 'Boris Gudenov',
//       address: '54 1st St, New York, NY 10021',
//       phoneNumber: '551-243-9102'
//     },
//     books: [
//       { 
//         title: "Aesop's Fables", 
//         dueDate: 2022-11-05T00:00:00.000Z 
//       },
//       { 
//         title: 'Pride & Prejudice', 
//         dueDate: 2023-01-12T00:00:00.000Z 
//       }
//     ]
//   },
//   {
//     metadata: {
//       name: 'Edward Vabarond',
//       address: 'The Oaks, 223 E Blumbledowns, East Surrey',
//       phoneNumber: '205-118-7773'
//     },
//     books: [
//       {
//         title: 'The Inner Game of Tennis',
//         dueDate: 2023-01-08T00:00:00.000Z
//       },
//       { 
//         title: 'Das Kapital', 
//         dueDate: 2023-01-08T00:00:00.000Z 
//       }
//     ]
//   }
// ]

Appendix: Collected utility functions

import * as cheerio from 'cheerio';

const parse = (parser, data: string) => {
    const $ = cheerio.load(data);
    const root = $('html');
    return parser($, root);
}

const one$ = (path: string, fn) => ($, root) => {
    const els = $(root).find(path);
    if (els.length != 1) {
        throw Error(`selectOne found ${els.length} elements matching selector '${path}'; expected one.`);
    }
    return fn($, els);
};

const many$ = (path: string, fn) => ($, root) => {
    const els = $(root).find(path);
    let results = [];
    els.map((i, el) => {
        const el$ = $(el);
        const result = fn($, el$);
        results.push(result);
    });
    return results;
};

const struct$ = (obj) => ($, root) => {
    let result = {};
    for (const key in obj) {
        const fn = obj[key];
        result[key] = fn($, root);
    }
    return result;
}

const apply$ = (fn, parser) => ($, el) => {
    const v = parser($, el);
    const r = fn(v);
    return r;
}

const check$ = (validator, parser) => ($, root) => {
    const v = parser($, root);
    validator(v);
    return v;
};

const text = ($, root) => root.text().trim();

const concat = (vs) => vs[0].concat(...vs.slice(1));

const row$ = (path, fields) => ($, root) => {
    const els = $(root).find(path);
    let result = {}
    els.map((i, el) => {
        const [key, parser] = fields[i];
        const el$ = $(el);
        result[key] = parser($, el$);
    });
    return result;
};