Matt Brubeck

11 Aug 2014

Let's build a browser engine!

Part 2: HTML

This is the second in a series of articles on building a toy browser rendering engine:

This article is about parsing HTML source code to produce a tree of DOM nodes. Parsing is a fascinating topic, but I don’t have the time or expertise to give it the introduction it deserves. You can get a detailed introduction to parsing from any good course or book on compilers. Or get a hands-on start by going through the documentation for a parser generator that works with your chosen programming language.

HTML has its own unique parsing algorithm. Unlike parsers for most programming languages and file formats, the HTML parsing algorithm does not reject invalid input. Instead it includes specific error-handling instructions, so web browsers can agree on how to display every web page, even ones that don’t conform to the syntax rules. Web browsers have to do this to be usable: Since non-conforming HTML has been supported since the early days of the web, it is now used in a huge portion of existing web pages.

A Simple HTML Dialect

I didn’t even try to implement the standard HTML parsing algorithm. Instead I wrote a basic parser for a tiny subset of HTML syntax. My parser can handle simple pages like this:

<html>
    <body>
        <h1>Title</h1>
        <div id="main" class="test">
            <p>Hello <em>world</em>!</p>
        </div>
    </body>
</html>

The following syntax is allowed:

Everything else is unsupported, including:

At each stage of this project I’m writing more or less the minimum code needed to support the later stages. But if you want to learn more about parsing theory and tools, you can be much more ambitious in your own project!

Example Code

Next, let’s walk through my toy HTML parser, keeping in mind that this is just one way to do it (and probably not the best way). Its structure is based loosely on the tokenizer module from Servo’s cssparser library. It has no real error handling; in most cases, it just aborts when faced with unexpected syntax. The code is in Rust, but I hope it’s fairly readable to anyone who’s used similar-looking languages like Java, C++, or C#. It makes use of the DOM data structures from part 1.

The parser stores its input string and a current position within the string. The position is the index of the next character we haven’t processed yet.

struct Parser {
    pos: uint,
    input: String,
}

We can use this to implement some simple methods for peeking at the next characters in the input:

impl Parser {
    /// Read the next character without consuming it.
    fn next_char(&self) -> char {
        self.input.as_slice().char_at(self.pos)
    }

    /// Do the next characters start with the given string?
    fn starts_with(&self, s: &str) -> bool {
        self.input.as_slice().slice_from(self.pos).starts_with(s)
    }

    /// Return true if all input is consumed.
    fn eof(&self) -> bool {
        self.pos >= self.input.len()
    }

    // ...
}

Rust strings are stored as UTF-8 byte arrays. To go to the next character, we can’t just advance by one byte. Instead we use char_range_at which correctly handles multi-byte characters. (If our string used fixed-width characters, we could just increment pos.)

/// Return the current character, and advance to the next character.
fn consume_char(&mut self) -> char {
    let range = self.input.as_slice().char_range_at(self.pos);
    self.pos = range.next;
    return range.ch;
}

Often we will want to consume a string of consecutive characters. The consume_while method consumes characters that meet a given condition, and returns them as a string:

/// Consume characters until `test` returns false.
fn consume_while(&mut self, test: |char| -> bool) -> String {
    let mut result = String::new();
    while !self.eof() && test(self.next_char()) {
        result.push(self.consume_char());
    }
    return result;
}

We can use this to ignore a sequence of space characters, or to consume a string of alphanumeric characters:

/// Consume and discard zero or more whitespace characters.
fn consume_whitespace(&mut self) {
    self.consume_while(|c| c.is_whitespace());
}

/// Parse a tag or attribute name.
fn parse_tag_name(&mut self) -> String {
    self.consume_while(|c| match c {
        'a'...'z' | 'A'...'Z' | '0'...'9' => true,
        _ => false
    })
}

Now we’re ready to start parsing HTML. To parse a single node, we look at its first character to see if it is an element or a text node. In our simplified version of HTML, a text node can contain any character except <.

/// Parse a single node.
fn parse_node(&mut self) -> dom::Node {
    match self.next_char() {
        '<' => self.parse_element(),
        _   => self.parse_text()
    }
}

/// Parse a text node.
fn parse_text(&mut self) -> dom::Node {
    dom::text(self.consume_while(|c| c != '<'))
}

An element is more complicated. It includes opening and closing tags, and between them any number of child nodes:

/// Parse a single element, including its open tag, contents, and closing tag.
fn parse_element(&mut self) -> dom::Node {
    // Opening tag.
    assert!(self.consume_char() == '<');
    let tag_name = self.parse_tag_name();
    let attrs = self.parse_attributes();
    assert!(self.consume_char() == '>');

    // Contents.
    let children = self.parse_nodes();

    // Closing tag.
    assert!(self.consume_char() == '<');
    assert!(self.consume_char() == '/');
    assert!(self.parse_tag_name() == tag_name);
    assert!(self.consume_char() == '>');

    return dom::elem(tag_name, attrs, children);
}

Parsing attributes is pretty easy in our simplified syntax. Until we reach the end of the opening tag (>) we repeatedly look for a name followed by = and then a string enclosed in quotes.

/// Parse a single name="value" pair.
fn parse_attr(&mut self) -> (String, String) {
    let name = self.parse_tag_name();
    assert!(self.consume_char() == '=');
    let value = self.parse_attr_value();
    return (name, value);
}

/// Parse a quoted value.
fn parse_attr_value(&mut self) -> String {
    let open_quote = self.consume_char();
    assert!(open_quote == '"' || open_quote == '\'');
    let value = self.consume_while(|c| c != open_quote);
    assert!(self.consume_char() == open_quote);
    return value;
}

/// Parse a list of name="value" pairs, separated by whitespace.
fn parse_attributes(&mut self) -> dom::AttrMap {
    let mut attributes = HashMap::new();
    loop {
        self.consume_whitespace();
        if self.next_char() == '>' {
            break;
        }
        let (name, value) = self.parse_attr();
        attributes.insert(name, value);
    }
    return attributes;
}

To parse the child nodes, we recursively call parse_node in a loop until we reach the closing tag:

/// Parse a sequence of sibling nodes.
fn parse_nodes(&mut self) -> Vec<dom::Node> {
    let mut nodes = Vec::new();
    loop {
        self.consume_whitespace();
        if self.eof() || self.starts_with("</") {
            break;
        }
        nodes.push(self.parse_node());
    }
    return nodes;
}

Finally, we can put this all together to parse an entire HTML document into a DOM tree. This function will create a root node for the document if it doesn’t include one explicitly; this is similar to what a real HTML parser does.

/// Parse an HTML document and return the root element.
pub fn parse(source: String) -> dom::Node {
    let mut nodes = Parser { pos: 0u, input: source }.parse_nodes();

    // If the document contains a root element, just return it. Otherwise, create one.
    if nodes.len() == 1 {
        nodes.swap_remove(0).unwrap()
    } else {
        dom::elem("html".to_string(), HashMap::new(), nodes)
    }
}

That’s it! The entire code for the robinson HTML parser. The whole thing weighs in at just over 100 lines of code (not counting blank lines and comments). If you use a good library or parser generator, you can probably build a similar toy parser in even less space.

Exercises

Here are a few alternate ways to try this out yourself. As before, you can choose one or more of them and ignore the others.

  1. Build a parser (either “by hand” or with a library or parser generator) that takes a subset of HTML as input and produces a tree of DOM nodes.

  2. Modify robinson’s HTML parser to add some missing features, like comments. Or replace it with a better parser, perhaps built with a library or generator.

  3. Create an invalid HTML file that causes your parser (or mine) to fail. Modify the parser to recover from the error and produce a DOM tree for your test file.

Shortcuts

If you want to skip parsing completely, you can build a DOM tree programmatically instead, by adding some code like this to your program (in pseudo-code; adjust it to match the DOM code you wrote in Part 1):

// <html><body>Hello, world!</body></html>
let root = element("html");
let body = element("body");
root.children.push(body);
body.children.push(text("Hello, world!"));

Or you can find an existing HTML parser and incorporate it into your program.

The next article in this series will cover CSS data structures and parsing.

Comments

blog comments powered by Disqus