Skip to content

Parsing the HTML inside the browser!

HTML Parsing Cover

With this article, we will start a new series on this blog. We will try to explain and simplify how the browser builds a webpage. Specifically, we will explain how the process of browser parsing works.

If we visit any webpage with the Developer Tools open and if we scroll all the way to the top of the request list we will see something interesting. Most of the time the browser first receives the HTML code, which will then be used as a mold on top of which we will build the page’s UI. (You can check for HTML pages by checking whether the Content-Type header has the text/html; charset=UTF-8 value. The header basically describes the encoding of the file.

Parsing the HTML - Developer Tools
The HTML code was received on the thedukh.com website. Keep in mind that WordPress includes a lot of additional scripts. On a simpler page, we might only get pure HTML as a response.

So what does the browser do when it receives the HTML code. The browser parses the code, building the DOM simultaneously. In the DOM every HTML element is represented as a node (element) of that DOM. The browser will find our HTML elements, such as div, p or span and assign/create a JavaScript object called a Node to each of them. This process is called tokenization, and it exists in almost any modern programming language. Tokenization or lexical analysis splits the file into tokens that are easier to understand while parsing.

Our little parsing example

<!doctype html>
<html>
<head>
    <title>Example Domain</title>
    <style type="text/css">
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
HTML Parsing Schematics
HTML Parsing Schematics

Look at the visualization, and notice how the nodes are organized to mimic the structure of the HTML. Also, notice how every node except the html node has exactly one parent. HTML and DOM seem quite similar, and you can think of HTML as the blueprint that the browser follows when constructing the DOM. The structure of the DOM is tree-like, and elements are nested inside each other.

The DOM is provided by the browser, in order to effectively render a webpage and to recreate elements in a more manipulative way. HTML is really powerful and more often than not if we make a mistake, it won’t fail but it will report a silent warning and fix the document by itself. Of course, parsing is a lot faster if the document is well-formed and properly written.

The browser can encounter other elements – for example, non-blocking resources such as images. If the browser encounters an image it will simply request it and continue parsing the HTML file. The browser will stop if it encounters a script tag though, especially without async or defer attributes. We will stop here, and continue In the next part :).

Final Words

We started talking about HTML parsing, and we mentioned two important concepts – tokenization and DOM. In the next article, we continue with the process of explaining how the browser works by mentioning scripts.

For more articles please click below, or check the blog.

1 thought on “Parsing the HTML inside the browser!”

  1. Pingback: Async and defer inside the script tags - The Dukh Chronicles

Comments are closed.