0% found this document useful (0 votes)
376 views

The Tokenization Algorithm: Hello World

The document describes a tokenization algorithm expressed as a state machine that consumes characters from an HTML input stream and updates its state. The algorithm's output is an HTML token, and its decision on the next state is influenced by both the current tokenization state and the tree construction state, meaning the same character can yield different results depending on the current overall state.

Uploaded by

Sudama Khatri
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
376 views

The Tokenization Algorithm: Hello World

The document describes a tokenization algorithm expressed as a state machine that consumes characters from an HTML input stream and updates its state. The algorithm's output is an HTML token, and its decision on the next state is influenced by both the current tokenization state and the tree construction state, meaning the same character can yield different results depending on the current overall state.

Uploaded by

Sudama Khatri
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 1

The tokenization algorithm The algorithm's output is an HTML token. The algorithm is expressed as a state machine.

Each state consumes one or more characters of the input stream and updates the next state according to those characters. The decision is influenced by the current tokenization state and by the tree construction state. This means the same consumed character will yield different results for the correct next state, depending on the current state. The algorithm is too complex to bring fully, so let's see a simple example that will help us understand the principal. Basic example - tokenizing the following HTML:
<html> <body> Hello world </body> </html>

You might also like