The Tokenization Algorithm: Hello World
The Tokenization Algorithm: Hello World
Each state consumes one or more characters of the input stream and updates the next state according to those characters. The decision is influenced by the current tokenization state and by the tree construction state. This means the same consumed character will yield different results for the correct next state, depending on the current state. The algorithm is too complex to bring fully, so let's see a simple example that will help us understand the principal. Basic example - tokenizing the following HTML:
<html> <body> Hello world </body> </html>