11 Aug 2014
This is the second in a series of articles on building a toy browser rendering engine:
This article is about parsing HTML source code to produce a tree of DOM nodes. Parsing is a fascinating topic, but I don’t have the time or expertise to give it the introduction it deserves. You can get a detailed introduction to parsing from any good course or book on compilers. Or get a hands-on start by going through the documentation for a parser generator that works with your chosen programming language.
HTML has its own unique parsing algorithm. Unlike parsers for most programming languages and file formats, the HTML parsing algorithm does not reject invalid input. Instead it includes specific error-handling instructions, so web browsers can agree on how to display every web page, even ones that don’t conform to the syntax rules. Web browsers have to do this to be usable: Since non-conforming HTML has been supported since the early days of the web, it is now used in a huge portion of existing web pages.
I didn’t even try to implement the standard HTML parsing algorithm. Instead I wrote a basic parser for a tiny subset of HTML syntax. My parser can handle simple pages like this:
The following syntax is allowed:
<p>...</p>
id="main"
<em>world</em>
Everything else is unsupported, including:
&
) and CDATA sections<br/>
or <br>
with no closing tag<html:body>
At each stage of this project I’m writing more or less the minimum code needed to support the later stages. But if you want to learn more about parsing theory and tools, you can be much more ambitious in your own project!
Next, let’s walk through my toy HTML parser, keeping in mind that this is just one way to do it (and probably not the best way). Its structure is based loosely on the tokenizer module from Servo’s cssparser library. It has no real error handling; in most cases, it just aborts when faced with unexpected syntax. The code is in Rust, but I hope it’s fairly readable to anyone who’s used similar-looking languages like Java, C++, or C#. It makes use of the DOM data structures from part 1.
The parser stores its input string and a current position within the string. The position is the index of the next character we haven’t processed yet.
We can use this to implement some simple methods for peeking at the next characters in the input:
Rust strings are stored as UTF-8 byte arrays. To go to the next character, we can’t just advance by one byte.
Often we will want to consume a string of consecutive characters. The
consume_while
method consumes characters that meet a given condition, and
returns them as a string. This method’s argument is a function that takes a
char
and returns a bool
.
We can use this to ignore a sequence of space characters, or to consume a string of alphanumeric characters:
Now we’re ready to start parsing HTML. To parse a single node, we look at its first character to see if it is an element or a text node.
In our simplified version of HTML, a text node can contain any character except <
.
An element is more complicated. It includes opening and closing tags, and between them any number of child nodes:
Parsing attributes is pretty easy in our simplified syntax. Until we reach
the end of the opening tag (>
) we repeatedly look for a name followed by =
and then a string enclosed in quotes.
To parse the child nodes, we recursively call parse_node
in a loop until we
reach the closing tag. This function returns a Vec
, which is Rust’s name
for a growable array.
Finally, we can put this all together to parse an entire HTML document into a DOM tree. This function will create a root node for the document if it doesn’t include one explicitly; this is similar to what a real HTML parser does.
That’s it! The entire code for the robinson HTML parser. The whole thing weighs in at just over 100 lines of code (not counting blank lines and comments). If you use a good library or parser generator, you can probably build a similar toy parser in even less space.
Here are a few alternate ways to try this out yourself. As before, you can choose one or more of them and ignore the others.
Build a parser (either “by hand” or with a library or parser generator) that takes a subset of HTML as input and produces a tree of DOM nodes.
Modify robinson’s HTML parser to add some missing features, like comments. Or replace it with a better parser, perhaps built with a library or generator.
Create an invalid HTML file that causes your parser (or mine) to fail. Modify the parser to recover from the error and produce a DOM tree for your test file.
If you want to skip parsing completely, you can build a DOM tree programmatically instead, by adding some code like this to your program (in pseudo-code; adjust it to match the DOM code you wrote in Part 1):
Or you can find an existing HTML parser and incorporate it into your program.
The next article in this series will cover CSS data structures and parsing.