0% found this document useful (0 votes)
4 views

6. Neural Language Models & Tokenization

The document outlines Lecture 6 of COMP 3361 on neural language models, focusing on their overview and tokenization methods. It includes announcements about a virtual office hour and a tutorial on PyTorch for an upcoming lecture. Key topics covered in the lecture include the input/output of neural language models and various tokenization techniques such as byte-pair encoding.

Uploaded by

9gt5rqjjnq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

6. Neural Language Models & Tokenization

The document outlines Lecture 6 of COMP 3361 on neural language models, focusing on their overview and tokenization methods. It includes announcements about a virtual office hour and a tutorial on PyTorch for an upcoming lecture. Key topics covered in the lecture include the input/output of neural language models and various tokenization techniques such as byte-pair encoding.

Uploaded by

9gt5rqjjnq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

COMP 3361 Natural Language Processing

Lecture 6: Neural language models:


Overview, tokenization

Spring 2024

Many materials from CSE447@UW (Taylor Sorensen) and COS484@Princeton with special thanks!
Announcements

• TA will host an online virtual of ce hour @4-5 pm today!


• Mainly about the assignment 1
• Book a slot via the link on Slack
• Also you can always ask questions on Slack!
• 新年快樂!Happy Chinese New Year.
• We will record a tutorial on PyTorch for the next Friday’s lecture and
upload it to the course website.
• So you don't need to attend in person on Feb 9
• Unless most of you want to have class on the morning of New Year's
Eve! :)
2 Lecture 3: Tokenization
fi
Lecture plan

• Neural language models: overview


• Running examples of neural language models
• Byte-pair encoding (BPE) tokenization
• Other tokenization variants

3 Lecture 3: Tokenization
Neural language models: overview

4 Lecture 3: Tokenization
Neural language models: inputs/outputs
• Input: sequences of words (or tokens)
• Output: probability distribution over the next word (token)
p(x|START) p(x|START I)p(x| · · · went) p(x| · · · to) p(x| · · · the) p(x| · · · park) p(x|START I went to the park.)
<latexit sha1_base64="WcUxOGfkP4cx+E9QpV1PwAaAgeg=">AAAB/HicbVDJTgJBEO1xRdxGOXrpSEzwQmaM2xH1ojdUtgQmpKfpgQ49S7prDGTEX/HiQWO8+iHe/BsbmIOCL6nk5b2qVNVzI8EVWNa3sbC4tLyymlnLrm9sbm2bO7s1FcaSsioNRSgbLlFM8IBVgYNgjUgy4ruC1d3+1divPzCpeBhUYBgxxyfdgHucEtBS28xFhcFjC9gAkvvKxV0F34wO22beKloT4HlipySPUpTb5lerE9LYZwFQQZRq2lYETkIkcCrYKNuKFYsI7ZMua2oaEJ8pJ5kcP8IHWulgL5S6AsAT9fdEQnylhr6rO30CPTXrjcX/vGYM3rmT8CCKgQV0usiLBYYQj5PAHS4ZBTHUhFDJ9a2Y9ogkFHReWR2CPfvyPKkdFe3T4sntcb50mcaRQXtoHxWQjc5QCV2jMqoiioboGb2iN+PJeDHejY9p64KRzuTQHxifP8oVlDg=</latexit> <latexit sha1_base64="LRjKiWw0jRv+buME4YuEQgqShlY=">AAACAXicbVDJSgNBFOyJW4zbqBfBS2MQ4iXMiNsx6MVjBLNAZgg9PT1Jk56F7jeaMMaLv+LFgyJe/Qtv/o2d5aCJBQ1F1Stev/ISwRVY1reRW1hcWl7JrxbW1jc2t8ztnbqKU0lZjcYilk2PKCZ4xGrAQbBmIhkJPcEaXu9q5DfumFQ8jm5hkDA3JJ2IB5wS0FLb3EtK/QfsUD8GhR1gfcjuWQTDo7ZZtMrWGHie2FNSRFNU2+aX48c0DXWaCqJUy7YScDMigVPBhgUnVSwhtEc6rKVpREKm3Gx8wRAfasXHQSz1iwCP1d+JjIRKDUJPT4YEumrWG4n/ea0Uggs341GSAovoZFGQCgwxHtWBfS4ZBTHQhFDJ9V8x7RJJKOjSCroEe/bkeVI/Lttn5dObk2LlclpHHu2jA1RCNjpHFXSNqqiGKHpEz+gVvRlPxovxbnxMRnPGNLOL/sD4/AFzzZbq</latexit> <latexit sha1_base64="rGCue0TUWf3/AXEt1i3jVqq9cWE=">AAAB/3icbVDJSgNBEO1xjXGLCl68NAYhXsKMuB2DXjxGMAskQ+jp6SRNerqH7hpJmOTgr3jxoIhXf8Obf2NnOWjig4LHe1VU1QtiwQ247reztLyyurae2chubm3v7Ob29qtGJZqyClVC6XpADBNcsgpwEKwea0aiQLBa0Lsd+7VHpg1X8gEGMfMj0pG8zSkBK7Vyh3GhP8RNGiowuAmsDymo0Wkrl3eL7gR4kXgzkkczlFu5r2aoaBIxCVQQYxqeG4OfEg2cCjbKNhPDYkJ7pMMalkoSMeOnk/tH+MQqIW4rbUsCnqi/J1ISGTOIAtsZEeiaeW8s/uc1Emhf+ymXcQJM0umidiIwKDwOA4dcMwpiYAmhmttbMe0STSjYyLI2BG/+5UVSPSt6l8WL+/N86WYWRwYdoWNUQB66QiV0h8qogigaomf0it6cJ+fFeXc+pq1LzmzmAP2B8/kDy8aV+w==</latexit> <latexit sha1_base64="lCxZUSI2bDJyt4tQii3hgQuEeiE=">AAACAXicbVDLSsNAFJ34rPUVdSO4GSxC3ZREfC2LblxWsA9oQplMJu3QyYOZG2mJdeOvuHGhiFv/wp1/47TNQlsPXDiccy/33uMlgiuwrG9jYXFpeWW1sFZc39jc2jZ3dhsqTiVldRqLWLY8opjgEasDB8FaiWQk9ARrev3rsd+8Z1LxOLqDYcLckHQjHnBKQEsdcz8pDx6wQ/0YFHaADSBLiOyPjjtmyapYE+B5YuekhHLUOuaX48c0DVkEVBCl2raVgJsRCZwKNio6qWIJoX3SZW1NIxIy5WaTD0b4SCs+DmKpKwI8UX9PZCRUahh6ujMk0FOz3lj8z2unEFy6GY+SFFhEp4uCVGCI8TgO7HPJKIihJoRKrm/FtEckoaBDK+oQ7NmX50njpGKfV85uT0vVqzyOAjpAh6iMbHSBqugG1VAdUfSIntErejOejBfj3fiYti4Y+cwe+gPj8wdbVJba</latexit> <latexit sha1_base64="MUKBCiIEN6kov3qjSDDj2XGDctE=">AAACDnicbVC5TgMxEPVyhnAtUNJYREihiXYRVxmggS5ADqQkirzOhFjxHrJnIdGSL6DhV2goQIiWmo6/wTkKIDzJ0tN7M+OZ50VSaHScL2tqemZ2bj61kF5cWl5ZtdfWyzqMFYcSD2Worj2mQYoASihQwnWkgPmehIrXOR34lVtQWoRBEXsR1H12E4iW4AyN1LC3o2z3vobQxeSqeHxZpOf0DgKkGFJsA42Y6uT6Ow074+ScIegkccckQ8YoNOzPWjPksW9Gccm0rrpOhPWEKRRcQj9dizVEjHfYDVQNDZgPup4Mz+nTbaM0aStU5plVhurPjoT5Wvd8z1T6DNv6rzcQ//OqMbaO6okIohgh4KOPWrEcHmuyoU2hgKPsGcK4EmZXyttMMY4mwbQJwf178iQp7+bcg9z+xV4mfzKOI0U2yRbJEpcckjw5IwVSIpw8kCfyQl6tR+vZerPeR6VT1rhng/yC9fENjmebKg==</latexit>

<latexit sha1_base64="MftO+SMqnPwqfD2tsYonKEJyGUQ=">AAAB+nicbVDLTsJAFJ36RHwVXbppJCa4Ia3xtUTduETllUBDpsMUJkynzcytQgqf4saFxrj1S9z5Nw7QhYInucnJOffm3nu8iDMFtv1tLC2vrK6tZzaym1vbO7tmbq+mwlgSWiUhD2XDw4pyJmgVGHDaiCTFgcdp3evfTPz6I5WKhaICw4i6Ae4K5jOCQUttMxcVBqMW0AEkD5Wr+8r4uG3m7aI9hbVInJTkUYpy2/xqdUISB1QA4VippmNH4CZYAiOcjrOtWNEIkz7u0qamAgdUucn09LF1pJWO5YdSlwBrqv6eSHCg1DDwdGeAoafmvYn4n9eMwb90EyaiGKggs0V+zC0IrUkOVodJSoAPNcFEMn2rRXpYYgI6rawOwZl/eZHUTorOefHs7jRfuk7jyKADdIgKyEEXqIRuURlVEUFP6Bm9ojdjZLwY78bHrHXJSGf20R8Ynz/UPJO7</latexit> <latexit sha1_base64="67dkCAqZgv/VHFmTH8BhrvEriJg=">AAACAHicbVDJTgJBEO3BDXFDPXjw0pGY4IXMGLcj0YtHTGRJGEJ6mgI69CzprjGQkYu/4sWDxnj1M7z5NzYwBwVfUsnLe1WpqudFUmi07W8rs7S8srqWXc9tbG5t7+R392o6jBWHKg9lqBoe0yBFAFUUKKERKWC+J6HuDW4mfv0BlBZhcI+jCFo+6wWiKzhDI7XzB1Fx+Ehd3glRUxdhiAn2YXzSzhfskj0FXSROSgokRaWd/3I7IY99CJBLpnXTsSNsJUyh4BLGOTfWEDE+YD1oGhowH3QrmT4wpsdG6dBuqEwFSKfq74mE+VqPfM90+gz7et6biP95zRi7V61EBFGMEPDZom4sKYZ0kgbtCAUc5cgQxpUwt1LeZ4pxNJnlTAjO/MuLpHZaci5K53dnhfJ1GkeWHJIjUiQOuSRlcksqpEo4GZNn8krerCfrxXq3PmatGSud2Sd/YH3+AIfzlmM=</latexit>

The 3 think 11% to 35% the 29% bathroo 3% and 14% I 21%
When 2.5% was 5% back 8% a m
9% doctor 2% with 9 It 6
They 2% went 2% into 5% see %
5% hospita 2% , 8% The 3%
… … am 1% through 4% my 3% l
store 1.5% to 7% There 3%
I 1% will 1% out 3% bed 2% … … … … … …
… … like 0.5% on 2% school 1% park 0.5% . 6% STOP 1%
Banana 0.1% … … … …% … … … … … … … …

Neural Network

START I went to the park . STOP

Natural Language Processing - CSE 517 / CSE 447 5


Neural language models
But neural networks take in real-valued vectors, not words…
• Use one-hot or learned embeddings to map from words to vectors!
• Learned embeddings become part of parameters ✓
<latexit sha1_base64="RnyzIg4Iy64zhhoBANN9EcBOI5Y=">AAAB7XicbVDLSgNBEJyNrxhfUY9eBoPgKeyKr2PQi8cI5gHJEmYnvcmY2Z1lplcIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSKQw6LrfTmFldW19o7hZ2tre2d0r7x80jUo1hwZXUul2wAxIEUMDBUpoJxpYFEhoBaPbqd96Am2Eih9wnIAfsUEsQsEZWqnZxSEg65UrbtWdgS4TLycVkqPeK391+4qnEcTIJTOm47kJ+hnTKLiESambGkgYH7EBdCyNWQTGz2bXTuiJVfo0VNpWjHSm/p7IWGTMOApsZ8RwaBa9qfif10kxvPYzEScpQszni8JUUlR0+jrtCw0c5dgSxrWwt1I+ZJpxtAGVbAje4svLpHlW9S6rF/fnldpNHkeRHJFjcko8ckVq5I7USYNw8kieySt5c5Tz4rw7H/PWgpPPHJI/cD5/AKc3jzI=</latexit>

Neural networks output vectors, not probability distributions…


• Apply the softmax to the outputs!
• What should the size of our output distribution be?
• Same size as our vocabulary |V|
<latexit sha1_base64="f0Eq4chOXdaCL4akh6X1BvQxXWM=">AAAB+HicbVDLSsNAFL2pr1ofjbp0M1gEVyURqy6LblxWsA9oQ5lMJ+3QySTMTISa9kvcuFDErZ/izr9x0mahrQcGDufcyz1z/JgzpR3n2yqsrW9sbhW3Szu7e/tl++CwpaJEEtokEY9kx8eKciZoUzPNaSeWFIc+p21/fJv57UcqFYvEg57E1AvxULCAEayN1LfLU9QLsR4RzNPWDE37dsWpOnOgVeLmpAI5Gn37qzeISBJSoQnHSnVdJ9ZeiqVmhNNZqZcoGmMyxkPaNVTgkCovnQefoVOjDFAQSfOERnP190aKQ6UmoW8ms5Bq2cvE/7xuooNrL2UiTjQVZHEoSDjSEcpaQAMmKdF8YggmkpmsiIywxESbrkqmBHf5y6ukdV51L6u1+4tK/SavowjHcAJn4MIV1OEOGtAEAgk8wyu8WU/Wi/VufSxGC1a+cwR/YH3+AIm1kwc=</latexit>

Natural Language Processing - CSE 517 / CSE 447 6


Example: BERT for sentiment classi cation

Task:

Data:

https://jalammar.github.io/a-visual-guide-to-using-bert-for-the- rst-time/ 7 Lecture 3: Tokenization


fi
fi
Example: BERT for sentiment classi cation

https://jalammar.github.io/a-visual-guide-to-using-bert-for-the- rst-time/ 8 Lecture 3: Tokenization


fi
fi
BERT for sentiment classi cation: overview

https://jalammar.github.io/a-visual-guide-to-using-bert-for-the- rst-time/ 9 Lecture 3: Tokenization


fi
fi
BERT for sentiment classi cation: overview

https://jalammar.github.io/a-visual-guide-to-using-bert-for-the- rst-time/ 10 Lecture 3: Tokenization


fi
fi
BERT for sentiment classi cation: overview

https://jalammar.github.io/a-visual-guide-to-using-bert-for-the- rst-time/ 11 Lecture 3: Tokenization


fi
fi
BERT for sentiment classi cation: prediction

https://jalammar.github.io/a-visual-guide-to-using-bert-for-the- rst-time/ 12 Lecture 3: Tokenization


fi
fi
Prediction step 1: tokenization

https://jalammar.github.io/a-visual-guide-to-using-bert-for-the- rst-time/, run the code on Colab! 13 Lecture 3: Tokenization


fi
Prediction step 1: tokenization

https://jalammar.github.io/a-visual-guide-to-using-bert-for-the- rst-time/, run the code on Colab! 14 Lecture 3: Tokenization


fi
Prediction step 2: input into BERT

https://jalammar.github.io/a-visual-guide-to-using-bert-for-the- rst-time/, run the code on Colab! 15 Lecture 3: Tokenization


fi
Prediction step 3: run BERT to get outputs

https://jalammar.github.io/a-visual-guide-to-using-bert-for-the- rst-time/, run the code on Colab! 16 Lecture 3: Tokenization


fi
Example overview so far

https://jalammar.github.io/a-visual-guide-to-using-bert-for-the- rst-time/, run the code on Colab! 17 Lecture 3: Tokenization


fi
Recapping a sentence’s journey

https://jalammar.github.io/a-visual-guide-to-using-bert-for-the- rst-time/, run the code on Colab! 18 Lecture 3: Tokenization


fi
Slicing the important part

https://jalammar.github.io/a-visual-guide-to-using-bert-for-the- rst-time/, run the code on Colab! 19 Lecture 3: Tokenization


fi
Final BERT output features

https://jalammar.github.io/a-visual-guide-to-using-bert-for-the- rst-time/, run the code on Colab! 20 Lecture 3: Tokenization


fi
Dataset for logistic regression

https://jalammar.github.io/a-visual-guide-to-using-bert-for-the- rst-time/, run the code on Colab! 21 Lecture 3: Tokenization


fi
Prediction step 4: get nal predictions

Train a logistic regression classi er

Run the trained logistic regression classi er

https://jalammar.github.io/a-visual-guide-to-using-bert-for-the- rst-time/, run the code on Colab! 22 Lecture 3: Tokenization


fi
fi
fi
fi
Example overview: BERT for sentiment classi cation

https://jalammar.github.io/a-visual-guide-to-using-bert-for-the- rst-time/, run the code on Colab! 23 Lecture 3: Tokenization


fi
fi
BERT is a encoder-only language model

Credits: ChatGPT 24 Lecture 3: Tokenization


How about BERT vs. GPT-3?

Credits: ChatGPT 25 Lecture 3: Tokenization


Usage example: GPT-3

https://jalammar.github.io/how-gpt3-works-visualizations-animations/ 26 Lecture 3: Tokenization


Usage example: GPT-3

https://jalammar.github.io/how-gpt3-works-visualizations-animations/ 27 Lecture 3: Tokenization


Usage example: GPT-3

https://jalammar.github.io/how-gpt3-works-visualizations-animations/ 28 Lecture 3: Tokenization


Neural language models: tokenization

29 Lecture 3: Tokenization
Neural language models: inputs/outputs
• Input: sequences of words (or tokens)
• Output: probability distribution over the next word (token)
p(x|START) p(x|START I)p(x| · · · went) p(x| · · · to) p(x| · · · the) p(x| · · · park) p(x|START I went to the park.)
<latexit sha1_base64="WcUxOGfkP4cx+E9QpV1PwAaAgeg=">AAAB/HicbVDJTgJBEO1xRdxGOXrpSEzwQmaM2xH1ojdUtgQmpKfpgQ49S7prDGTEX/HiQWO8+iHe/BsbmIOCL6nk5b2qVNVzI8EVWNa3sbC4tLyymlnLrm9sbm2bO7s1FcaSsioNRSgbLlFM8IBVgYNgjUgy4ruC1d3+1divPzCpeBhUYBgxxyfdgHucEtBS28xFhcFjC9gAkvvKxV0F34wO22beKloT4HlipySPUpTb5lerE9LYZwFQQZRq2lYETkIkcCrYKNuKFYsI7ZMua2oaEJ8pJ5kcP8IHWulgL5S6AsAT9fdEQnylhr6rO30CPTXrjcX/vGYM3rmT8CCKgQV0usiLBYYQj5PAHS4ZBTHUhFDJ9a2Y9ogkFHReWR2CPfvyPKkdFe3T4sntcb50mcaRQXtoHxWQjc5QCV2jMqoiioboGb2iN+PJeDHejY9p64KRzuTQHxifP8oVlDg=</latexit> <latexit sha1_base64="LRjKiWw0jRv+buME4YuEQgqShlY=">AAACAXicbVDJSgNBFOyJW4zbqBfBS2MQ4iXMiNsx6MVjBLNAZgg9PT1Jk56F7jeaMMaLv+LFgyJe/Qtv/o2d5aCJBQ1F1Stev/ISwRVY1reRW1hcWl7JrxbW1jc2t8ztnbqKU0lZjcYilk2PKCZ4xGrAQbBmIhkJPcEaXu9q5DfumFQ8jm5hkDA3JJ2IB5wS0FLb3EtK/QfsUD8GhR1gfcjuWQTDo7ZZtMrWGHie2FNSRFNU2+aX48c0DXWaCqJUy7YScDMigVPBhgUnVSwhtEc6rKVpREKm3Gx8wRAfasXHQSz1iwCP1d+JjIRKDUJPT4YEumrWG4n/ea0Uggs341GSAovoZFGQCgwxHtWBfS4ZBTHQhFDJ9V8x7RJJKOjSCroEe/bkeVI/Lttn5dObk2LlclpHHu2jA1RCNjpHFXSNqqiGKHpEz+gVvRlPxovxbnxMRnPGNLOL/sD4/AFzzZbq</latexit> <latexit sha1_base64="rGCue0TUWf3/AXEt1i3jVqq9cWE=">AAAB/3icbVDJSgNBEO1xjXGLCl68NAYhXsKMuB2DXjxGMAskQ+jp6SRNerqH7hpJmOTgr3jxoIhXf8Obf2NnOWjig4LHe1VU1QtiwQ247reztLyyurae2chubm3v7Ob29qtGJZqyClVC6XpADBNcsgpwEKwea0aiQLBa0Lsd+7VHpg1X8gEGMfMj0pG8zSkBK7Vyh3GhP8RNGiowuAmsDymo0Wkrl3eL7gR4kXgzkkczlFu5r2aoaBIxCVQQYxqeG4OfEg2cCjbKNhPDYkJ7pMMalkoSMeOnk/tH+MQqIW4rbUsCnqi/J1ISGTOIAtsZEeiaeW8s/uc1Emhf+ymXcQJM0umidiIwKDwOA4dcMwpiYAmhmttbMe0STSjYyLI2BG/+5UVSPSt6l8WL+/N86WYWRwYdoWNUQB66QiV0h8qogigaomf0it6cJ+fFeXc+pq1LzmzmAP2B8/kDy8aV+w==</latexit> <latexit sha1_base64="lCxZUSI2bDJyt4tQii3hgQuEeiE=">AAACAXicbVDLSsNAFJ34rPUVdSO4GSxC3ZREfC2LblxWsA9oQplMJu3QyYOZG2mJdeOvuHGhiFv/wp1/47TNQlsPXDiccy/33uMlgiuwrG9jYXFpeWW1sFZc39jc2jZ3dhsqTiVldRqLWLY8opjgEasDB8FaiWQk9ARrev3rsd+8Z1LxOLqDYcLckHQjHnBKQEsdcz8pDx6wQ/0YFHaADSBLiOyPjjtmyapYE+B5YuekhHLUOuaX48c0DVkEVBCl2raVgJsRCZwKNio6qWIJoX3SZW1NIxIy5WaTD0b4SCs+DmKpKwI8UX9PZCRUahh6ujMk0FOz3lj8z2unEFy6GY+SFFhEp4uCVGCI8TgO7HPJKIihJoRKrm/FtEckoaBDK+oQ7NmX50njpGKfV85uT0vVqzyOAjpAh6iMbHSBqugG1VAdUfSIntErejOejBfj3fiYti4Y+cwe+gPj8wdbVJba</latexit> <latexit sha1_base64="MUKBCiIEN6kov3qjSDDj2XGDctE=">AAACDnicbVC5TgMxEPVyhnAtUNJYREihiXYRVxmggS5ADqQkirzOhFjxHrJnIdGSL6DhV2goQIiWmo6/wTkKIDzJ0tN7M+OZ50VSaHScL2tqemZ2bj61kF5cWl5ZtdfWyzqMFYcSD2Worj2mQYoASihQwnWkgPmehIrXOR34lVtQWoRBEXsR1H12E4iW4AyN1LC3o2z3vobQxeSqeHxZpOf0DgKkGFJsA42Y6uT6Ow074+ScIegkccckQ8YoNOzPWjPksW9Gccm0rrpOhPWEKRRcQj9dizVEjHfYDVQNDZgPup4Mz+nTbaM0aStU5plVhurPjoT5Wvd8z1T6DNv6rzcQ//OqMbaO6okIohgh4KOPWrEcHmuyoU2hgKPsGcK4EmZXyttMMY4mwbQJwf178iQp7+bcg9z+xV4mfzKOI0U2yRbJEpcckjw5IwVSIpw8kCfyQl6tR+vZerPeR6VT1rhng/yC9fENjmebKg==</latexit>

<latexit sha1_base64="MftO+SMqnPwqfD2tsYonKEJyGUQ=">AAAB+nicbVDLTsJAFJ36RHwVXbppJCa4Ia3xtUTduETllUBDpsMUJkynzcytQgqf4saFxrj1S9z5Nw7QhYInucnJOffm3nu8iDMFtv1tLC2vrK6tZzaym1vbO7tmbq+mwlgSWiUhD2XDw4pyJmgVGHDaiCTFgcdp3evfTPz6I5WKhaICw4i6Ae4K5jOCQUttMxcVBqMW0AEkD5Wr+8r4uG3m7aI9hbVInJTkUYpy2/xqdUISB1QA4VippmNH4CZYAiOcjrOtWNEIkz7u0qamAgdUucn09LF1pJWO5YdSlwBrqv6eSHCg1DDwdGeAoafmvYn4n9eMwb90EyaiGKggs0V+zC0IrUkOVodJSoAPNcFEMn2rRXpYYgI6rawOwZl/eZHUTorOefHs7jRfuk7jyKADdIgKyEEXqIRuURlVEUFP6Bm9ojdjZLwY78bHrHXJSGf20R8Ynz/UPJO7</latexit> <latexit sha1_base64="67dkCAqZgv/VHFmTH8BhrvEriJg=">AAACAHicbVDJTgJBEO3BDXFDPXjw0pGY4IXMGLcj0YtHTGRJGEJ6mgI69CzprjGQkYu/4sWDxnj1M7z5NzYwBwVfUsnLe1WpqudFUmi07W8rs7S8srqWXc9tbG5t7+R392o6jBWHKg9lqBoe0yBFAFUUKKERKWC+J6HuDW4mfv0BlBZhcI+jCFo+6wWiKzhDI7XzB1Fx+Ehd3glRUxdhiAn2YXzSzhfskj0FXSROSgokRaWd/3I7IY99CJBLpnXTsSNsJUyh4BLGOTfWEDE+YD1oGhowH3QrmT4wpsdG6dBuqEwFSKfq74mE+VqPfM90+gz7et6biP95zRi7V61EBFGMEPDZom4sKYZ0kgbtCAUc5cgQxpUwt1LeZ4pxNJnlTAjO/MuLpHZaci5K53dnhfJ1GkeWHJIjUiQOuSRlcksqpEo4GZNn8krerCfrxXq3PmatGSud2Sd/YH3+AIfzlmM=</latexit>

The 3 think 11% to 35% the 29% bathroo 3% and 14% I 21%
When 2.5% was 5% back 8% a m
9% doctor 2% with 9 It 6
They 2% went 2% into 5% see %
5% hospita 2% , 8% The 3%
… … am 1% through 4% my 3% l
store 1.5% to 7% There 3%
I 1% will 1% out 3% bed 2% … … … … … …
… … like 0.5% on 2% school 1% park 0.5% . 6% STOP 1%
Banana 0.1% … … … …% … … … … … … … …

Neural Network

START I went to the park . STOP

Natural Language Processing - CSE 517 / CSE 447 30


Neural language models: input vectors
But neural networks take in real-valued vectors, not words…
• Use one-hot or learned embeddings to map from words to vectors!
• Learned embeddings become part of parameters ✓
<latexit sha1_base64="RnyzIg4Iy64zhhoBANN9EcBOI5Y=">AAAB7XicbVDLSgNBEJyNrxhfUY9eBoPgKeyKr2PQi8cI5gHJEmYnvcmY2Z1lplcIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSKQw6LrfTmFldW19o7hZ2tre2d0r7x80jUo1hwZXUul2wAxIEUMDBUpoJxpYFEhoBaPbqd96Am2Eih9wnIAfsUEsQsEZWqnZxSEg65UrbtWdgS4TLycVkqPeK391+4qnEcTIJTOm47kJ+hnTKLiESambGkgYH7EBdCyNWQTGz2bXTuiJVfo0VNpWjHSm/p7IWGTMOApsZ8RwaBa9qfif10kxvPYzEScpQszni8JUUlR0+jrtCw0c5dgSxrWwt1I+ZJpxtAGVbAje4svLpHlW9S6rF/fnldpNHkeRHJFjcko8ckVq5I7USYNw8kieySt5c5Tz4rw7H/PWgpPPHJI/cD5/AKc3jzI=</latexit>

Natural Language Processing - CSE 517 / CSE 447 31


Tokenization to input vectors
p(x|START) p(x|START I)p(x| · · · went) p(x| · · · to) p(x| · · · the) p(x| · · · park) p(x|START I went to the park.)
<latexit sha1_base64="WcUxOGfkP4cx+E9QpV1PwAaAgeg=">AAAB/HicbVDJTgJBEO1xRdxGOXrpSEzwQmaM2xH1ojdUtgQmpKfpgQ49S7prDGTEX/HiQWO8+iHe/BsbmIOCL6nk5b2qVNVzI8EVWNa3sbC4tLyymlnLrm9sbm2bO7s1FcaSsioNRSgbLlFM8IBVgYNgjUgy4ruC1d3+1divPzCpeBhUYBgxxyfdgHucEtBS28xFhcFjC9gAkvvKxV0F34wO22beKloT4HlipySPUpTb5lerE9LYZwFQQZRq2lYETkIkcCrYKNuKFYsI7ZMua2oaEJ8pJ5kcP8IHWulgL5S6AsAT9fdEQnylhr6rO30CPTXrjcX/vGYM3rmT8CCKgQV0usiLBYYQj5PAHS4ZBTHUhFDJ9a2Y9ogkFHReWR2CPfvyPKkdFe3T4sntcb50mcaRQXtoHxWQjc5QCV2jMqoiioboGb2iN+PJeDHejY9p64KRzuTQHxifP8oVlDg=</latexit> <latexit sha1_base64="LRjKiWw0jRv+buME4YuEQgqShlY=">AAACAXicbVDJSgNBFOyJW4zbqBfBS2MQ4iXMiNsx6MVjBLNAZgg9PT1Jk56F7jeaMMaLv+LFgyJe/Qtv/o2d5aCJBQ1F1Stev/ISwRVY1reRW1hcWl7JrxbW1jc2t8ztnbqKU0lZjcYilk2PKCZ4xGrAQbBmIhkJPcEaXu9q5DfumFQ8jm5hkDA3JJ2IB5wS0FLb3EtK/QfsUD8GhR1gfcjuWQTDo7ZZtMrWGHie2FNSRFNU2+aX48c0DXWaCqJUy7YScDMigVPBhgUnVSwhtEc6rKVpREKm3Gx8wRAfasXHQSz1iwCP1d+JjIRKDUJPT4YEumrWG4n/ea0Uggs341GSAovoZFGQCgwxHtWBfS4ZBTHQhFDJ9V8x7RJJKOjSCroEe/bkeVI/Lttn5dObk2LlclpHHu2jA1RCNjpHFXSNqqiGKHpEz+gVvRlPxovxbnxMRnPGNLOL/sD4/AFzzZbq</latexit> <latexit sha1_base64="rGCue0TUWf3/AXEt1i3jVqq9cWE=">AAAB/3icbVDJSgNBEO1xjXGLCl68NAYhXsKMuB2DXjxGMAskQ+jp6SRNerqH7hpJmOTgr3jxoIhXf8Obf2NnOWjig4LHe1VU1QtiwQ247reztLyyurae2chubm3v7Ob29qtGJZqyClVC6XpADBNcsgpwEKwea0aiQLBa0Lsd+7VHpg1X8gEGMfMj0pG8zSkBK7Vyh3GhP8RNGiowuAmsDymo0Wkrl3eL7gR4kXgzkkczlFu5r2aoaBIxCVQQYxqeG4OfEg2cCjbKNhPDYkJ7pMMalkoSMeOnk/tH+MQqIW4rbUsCnqi/J1ISGTOIAtsZEeiaeW8s/uc1Emhf+ymXcQJM0umidiIwKDwOA4dcMwpiYAmhmttbMe0STSjYyLI2BG/+5UVSPSt6l8WL+/N86WYWRwYdoWNUQB66QiV0h8qogigaomf0it6cJ+fFeXc+pq1LzmzmAP2B8/kDy8aV+w==</latexit> <latexit sha1_base64="lCxZUSI2bDJyt4tQii3hgQuEeiE=">AAACAXicbVDLSsNAFJ34rPUVdSO4GSxC3ZREfC2LblxWsA9oQplMJu3QyYOZG2mJdeOvuHGhiFv/wp1/47TNQlsPXDiccy/33uMlgiuwrG9jYXFpeWW1sFZc39jc2jZ3dhsqTiVldRqLWLY8opjgEasDB8FaiWQk9ARrev3rsd+8Z1LxOLqDYcLckHQjHnBKQEsdcz8pDx6wQ/0YFHaADSBLiOyPjjtmyapYE+B5YuekhHLUOuaX48c0DVkEVBCl2raVgJsRCZwKNio6qWIJoX3SZW1NIxIy5WaTD0b4SCs+DmKpKwI8UX9PZCRUahh6ujMk0FOz3lj8z2unEFy6GY+SFFhEp4uCVGCI8TgO7HPJKIihJoRKrm/FtEckoaBDK+oQ7NmX50njpGKfV85uT0vVqzyOAjpAh6iMbHSBqugG1VAdUfSIntErejOejBfj3fiYti4Y+cwe+gPj8wdbVJba</latexit> <latexit sha1_base64="MUKBCiIEN6kov3qjSDDj2XGDctE=">AAACDnicbVC5TgMxEPVyhnAtUNJYREihiXYRVxmggS5ADqQkirzOhFjxHrJnIdGSL6DhV2goQIiWmo6/wTkKIDzJ0tN7M+OZ50VSaHScL2tqemZ2bj61kF5cWl5ZtdfWyzqMFYcSD2Worj2mQYoASihQwnWkgPmehIrXOR34lVtQWoRBEXsR1H12E4iW4AyN1LC3o2z3vobQxeSqeHxZpOf0DgKkGFJsA42Y6uT6Ow074+ScIegkccckQ8YoNOzPWjPksW9Gccm0rrpOhPWEKRRcQj9dizVEjHfYDVQNDZgPup4Mz+nTbaM0aStU5plVhurPjoT5Wvd8z1T6DNv6rzcQ//OqMbaO6okIohgh4KOPWrEcHmuyoU2hgKPsGcK4EmZXyttMMY4mwbQJwf178iQp7+bcg9z+xV4mfzKOI0U2yRbJEpcckjw5IwVSIpw8kCfyQl6tR+vZerPeR6VT1rhng/yC9fENjmebKg==</latexit>

<latexit sha1_base64="MftO+SMqnPwqfD2tsYonKEJyGUQ=">AAAB+nicbVDLTsJAFJ36RHwVXbppJCa4Ia3xtUTduETllUBDpsMUJkynzcytQgqf4saFxrj1S9z5Nw7QhYInucnJOffm3nu8iDMFtv1tLC2vrK6tZzaym1vbO7tmbq+mwlgSWiUhD2XDw4pyJmgVGHDaiCTFgcdp3evfTPz6I5WKhaICw4i6Ae4K5jOCQUttMxcVBqMW0AEkD5Wr+8r4uG3m7aI9hbVInJTkUYpy2/xqdUISB1QA4VippmNH4CZYAiOcjrOtWNEIkz7u0qamAgdUucn09LF1pJWO5YdSlwBrqv6eSHCg1DDwdGeAoafmvYn4n9eMwb90EyaiGKggs0V+zC0IrUkOVodJSoAPNcFEMn2rRXpYYgI6rawOwZl/eZHUTorOefHs7jRfuk7jyKADdIgKyEEXqIRuURlVEUFP6Bm9ojdjZLwY78bHrHXJSGf20R8Ynz/UPJO7</latexit> <latexit sha1_base64="67dkCAqZgv/VHFmTH8BhrvEriJg=">AAACAHicbVDJTgJBEO3BDXFDPXjw0pGY4IXMGLcj0YtHTGRJGEJ6mgI69CzprjGQkYu/4sWDxnj1M7z5NzYwBwVfUsnLe1WpqudFUmi07W8rs7S8srqWXc9tbG5t7+R392o6jBWHKg9lqBoe0yBFAFUUKKERKWC+J6HuDW4mfv0BlBZhcI+jCFo+6wWiKzhDI7XzB1Fx+Ehd3glRUxdhiAn2YXzSzhfskj0FXSROSgokRaWd/3I7IY99CJBLpnXTsSNsJUyh4BLGOTfWEDE+YD1oGhowH3QrmT4wpsdG6dBuqEwFSKfq74mE+VqPfM90+gz7et6biP95zRi7V61EBFGMEPDZom4sKYZ0kgbtCAUc5cgQxpUwt1LeZ4pxNJnlTAjO/MuLpHZaci5K53dnhfJ1GkeWHJIjUiQOuSRlcksqpEo4GZNn8krerCfrxXq3PmatGSud2Sd/YH3+AIfzlmM=</latexit>

Neural Network
Mapping each tokenized id into its corresponding embeddings

Tokenization:

START I went to the park . STOP


Lecture 3: Tokenization
ChatGPT tokenization example

https://platform.openai.com/tokenizer 33 Lecture 3: Tokenization


Vocabulary: word-level
• For the n-gram model, our vocabulary V was comprised of all of the words in a
<latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit>

language
• Some problems with this:
• can be quite large - ~470,000 words Webster’s English Dictionary (3rd
|V|
<latexit sha1_base64="O5VOI/dtVrARV8hqH5tB+XIfLTs=">AAAB9HicbVDLSgMxFL3js9ZX1aWbYBFclRnxtSy6cVnBPqAdSibNtKGZZEwyhTLtd7hxoYhbP8adf2OmnYW2HggczrmXe3KCmDNtXPfbWVldW9/YLGwVt3d29/ZLB4cNLRNFaJ1ILlUrwJpyJmjdMMNpK1YURwGnzWB4l/nNEVWaSfFoxjH1I9wXLGQEGyv5k06EzYBgnjamk26p7FbcGdAy8XJShhy1bumr05MkiagwhGOt254bGz/FyjDC6bTYSTSNMRniPm1bKnBEtZ/OQk/RqVV6KJTKPmHQTP29keJI63EU2Mkso170MvE/r52Y8MZPmYgTQwWZHwoTjoxEWQOoxxQlho8twUQxmxWRAVaYGNtT0ZbgLX55mTTOK95V5fLholy9zesowDGcwBl4cA1VuIca1IHAEzzDK7w5I+fFeXc+5qMrTr5zBH/gfP4AX7uSgg==</latexit>

edition)
• Language is changing all of the time - 690 words were added to Merriam
Webster's in September 2023 (“rizz”, “goated”, “mid”)
• Long tail of infrequent words. Many words just occur a few times
• Some words may not appear in a training set of documents
• No modeled relationship between words - e.g., “run”, “ran”, “runs”, “runner”
are all separate entries despite being linked in meaning

34 Lecture 3: Tokenization
Character-level?
What about representing text with characters?
• {a,
<latexit sha1_base64="MlhxriEpGaJ4rFM71wgrLVXdFa8=">AAACA3icbVDLSsNAFJ3UV62vqDvdDBbBRSiJ+NoIRTcuK9gHNKFMppN26GQmzEyEGgpu/BU3LhRx60+482+ctllo64ELh3Pu5d57woRRpV332yosLC4trxRXS2vrG5tb9vZOQ4lUYlLHggnZCpEijHJS11Qz0kokQXHISDMcXI/95j2Rigp+p4cJCWLU4zSiGGkjdey9BryEfoYcGDoQO9BnXaGVAx/8UccuuxV3AjhPvJyUQY5ax/7yuwKnMeEaM6RU23MTHWRIaooZGZX8VJEE4QHqkbahHMVEBdnkhxE8NEoXRkKa4hpO1N8TGYqVGsah6YyR7qtZbyz+57VTHV0EGeVJqgnH00VRyqAWcBwI7FJJsGZDQxCW1NwKcR9JhLWJrWRC8GZfnieN44p3Vjm9PSlXr/I4imAfHIAj4IFzUAU3oAbqAINH8AxewZv1ZL1Y79bHtLVg5TO74A+szx+CBpWO</latexit>

V = b, c, . . . , z}
• (Maybe add capital letters, punctuation, spaces, …)
• Pros:
• Small vocabulary size ( |V | = 26 for English)
<latexit sha1_base64="Y1BChkOBWyEXt94WKpIwyAAGjtI=">AAAB8XicbVBNSwMxEJ34WetX1aOXYBE8ld2i1YtQ9OKxgv3AdinZNNuGZrNLkhXKtv/CiwdFvPpvvPlvTNs9aOuDgcd7M8zM82PBtXGcb7Syura+sZnbym/v7O7tFw4OGzpKFGV1GolItXyimeCS1Q03grVixUjoC9b0h7dTv/nElOaRfDCjmHkh6UsecEqMlR7HuIHH+BqXK91C0Sk5M+Bl4makCBlq3cJXpxfRJGTSUEG0brtObLyUKMOpYJN8J9EsJnRI+qxtqSQh0146u3iCT63Sw0GkbEmDZ+rviZSEWo9C33aGxAz0ojcV//PaiQmuvJTLODFM0vmiIBHYRHj6Pu5xxagRI0sIVdzeiumAKEKNDSlvQ3AXX14mjXLJrZQu7s+L1ZssjhwcwwmcgQuXUIU7qEEdKEh4hld4Qxq9oHf0MW9dQdnMEfwB+vwBQwePWw==</latexit>

• Complete coverage (unseen words are represented by letters)


• Cons:
• Encoding becomes very long - # chars instead of # words
• Poor inductive bias for learning
Word Character Subword tokenization!
How can we combine the high coverage of character-level
representation with the ef ciency of word-level representation?

Subword tokenization! (e.g., Byte-Pair Encoding)


• Start with character-level representations
• Build up representations from there

Original BPE Paper (Sennrich et al., 2016)


https://arxiv.org/abs/1508.07909

36 Lecture 3: Tokenization
fi
Byte-pair encoding: ChatGPT example

https://platform.openai.com/tokenizer 37 Lecture 3: Tokenization


Byte-pair encoding: usage
• Basically state of the art in tokenization
• Used in all modern left-to-right large language models (LLMs),
including ChatGPT
Model/Tokenizer Vocabulary Size

GPT-3.5/GPT-4/ChatGPT 100k

GPT-2/GPT-3 50k

Llama2 32k

Falcon 65k

38 Lecture 3: Tokenization
Byte-pair encoding (BPE): algorithm
Required:
• Documents D
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

• Desired vocabulary size N (greater than characters in D)


<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>

Algorithm:
• Pre-tokenize D by splitting into words (split before whitespace/punctuation)
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

• Initialize V as the set of characters in D


<latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit>

<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

• Convert D into a list of tokens (characters)


<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

While |V| < N :


<latexit sha1_base64="f0Eq4chOXdaCL4akh6X1BvQxXWM=">AAAB+HicbVDLSsNAFL2pr1ofjbp0M1gEVyURqy6LblxWsA9oQ5lMJ+3QySTMTISa9kvcuFDErZ/izr9x0mahrQcGDufcyz1z/JgzpR3n2yqsrW9sbhW3Szu7e/tl++CwpaJEEtokEY9kx8eKciZoUzPNaSeWFIc+p21/fJv57UcqFYvEg57E1AvxULCAEayN1LfLU9QLsR4RzNPWDE37dsWpOnOgVeLmpAI5Gn37qzeISBJSoQnHSnVdJ9ZeiqVmhNNZqZcoGmMyxkPaNVTgkCovnQefoVOjDFAQSfOERnP190aKQ6UmoW8ms5Bq2cvE/7xuooNrL2UiTjQVZHEoSDjSEcpaQAMmKdF8YggmkpmsiIywxESbrkqmBHf5y6ukdV51L6u1+4tK/SavowjHcAJn4MIV1OEOGtAEAgk8wyu8WU/Wi/VufSxGC1a+cwR/YH3+AIm1kwc=</latexit>


<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>

Let n := |V| + 1
<latexit sha1_base64="qHDdAiIl6F+PIAHnOC1uenCsXQI=">AAACAHicbVDLSsNAFL3xWesr6sKFm8EiCEJJxBeCUHTjsoJ9QBvKZDpph04mYWYilJiNv+LGhSJu/Qx3/o2TtgttPTBw5px7ufceP+ZMacf5tubmFxaXlgsrxdW19Y1Ne2u7rqJEElojEY9k08eKciZoTTPNaTOWFIc+pw1/cJP7jQcqFYvEvR7G1AtxT7CAEayN1LF3Bbq8Qo+oHWLdJ5in9cz8jtyOXXLKzgholrgTUoIJqh37q92NSBJSoQnHSrVcJ9ZeiqVmhNOs2E4UjTEZ4B5tGSpwSJWXjg7I0IFRuiiIpHlCo5H6uyPFoVLD0DeV+Z5q2svF/7xWooMLL2UiTjQVZDwoSDjSEcrTQF0mKdF8aAgmkpldEeljiYk2mRVNCO70ybOkflx2z8qndyelyvUkjgLswT4cggvnUIFbqEINCGTwDK/wZj1ZL9a79TEunbMmPTvwB9bnD1QnlPg=</latexit>


• Get counts of all bigrams in D
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

• For the most frequent bigram v i , v j


<latexit sha1_base64="8qiJFR39nNAp8DHiEq3U32WlZhY=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgQcKu+DoGvXiMYB6SLMvsZJKMmZldZmYDYclXePGgiFc/x5t/42yyB40WNBRV3XR3hTFn2rjul1NYWl5ZXSuulzY2t7Z3yrt7TR0litAGiXik2iHWlDNJG4YZTtuxoliEnLbC0U3mt8ZUaRbJezOJqS/wQLI+I9hY6WEcsBM0Dh5LQbniVt0Z0F/i5aQCOepB+bPbi0giqDSEY607nhsbP8XKMMLptNRNNI0xGeEB7VgqsaDaT2cHT9GRVXqoHylb0qCZ+nMixULriQhtp8BmqBe9TPzP6ySmf+WnTMaJoZLMF/UTjkyEsu9RjylKDJ9Ygoli9lZEhlhhYmxGWQje4st/SfO06l1Uz+/OKrXrPI4iHMAhHIMHl1CDW6hDAwgIeIIXeHWU8+y8Oe/z1oKTz+zDLzgf36+tj7E=</latexit>

(breaking ties arbitrarily)


<latexit sha1_base64="FcsoGfdiXUbCtS2heHfv2tDDoMA=">AAACB3icbVDJSgNBEO2JW4zbqEdBGoMQQcKMuCEIQS8eI5gFkjD0dDpJm56eobsmGIbcvPgrXjwo4tVf8Obf2FkOmvig4PFeFVX1/EhwDY7zbaXm5hcWl9LLmZXVtfUNe3OrrMNYUVaioQhV1SeaCS5ZCTgIVo0UI4EvWMXvXg/9So8pzUN5B/2INQLSlrzFKQEjefZuz5P44hLXgT1AQkNpjEGu5/FD3PPuDzw76+SdEfAscSckiyYoevZXvRnSOGASqCBa11wngkZCFHAq2CBTjzWLCO2SNqsZKknAdCMZ/THA+0Zp4laoTEnAI/X3REICrfuBbzoDAh097Q3F/7xaDK3zRsJlFAOTdLyoFQsMIR6GgptcMQqibwihiptbMe0QRSiY6DImBHf65VlSPsq7p/mT2+Ns4WoSRxrtoD2UQy46QwV0g4qohCh6RM/oFb1ZT9aL9W59jFtT1mRmG/2B9fkDnQqYhg==</latexit>

• Let vn := concat(vi , vj )
• Change all instances in D of v i , v j to vn and add vn to V
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="8qiJFR39nNAp8DHiEq3U32WlZhY=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgQcKu+DoGvXiMYB6SLMvsZJKMmZldZmYDYclXePGgiFc/x5t/42yyB40WNBRV3XR3hTFn2rjul1NYWl5ZXSuulzY2t7Z3yrt7TR0litAGiXik2iHWlDNJG4YZTtuxoliEnLbC0U3mt8ZUaRbJezOJqS/wQLI+I9hY6WEcsBM0Dh5LQbniVt0Z0F/i5aQCOepB+bPbi0giqDSEY607nhsbP8XKMMLptNRNNI0xGeEB7VgqsaDaT2cHT9GRVXqoHylb0qCZ+nMixULriQhtp8BmqBe9TPzP6ySmf+WnTMaJoZLMF/UTjkyEsu9RjylKDJ9Ygoli9lZEhlhhYmxGWQje4st/SfO06l1Uz+/OKrXrPI4iHMAhHIMHl1CDW6hDAwgIeIIXeHWU8+y8Oe/z1oKTz+zDLzgf36+tj7E=</latexit>

<latexit sha1_base64="kiVMEoeDnjNUXHKvcbrow7t6lH4=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEmYns8mQ2dllpjcQlnyCFw+KePWLvPk3TpI9aLSgoajqprsrSKQw6LpfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWj25nfGnNtRKwecZJwP6IDJULBKFrpYdxTvXLFrbpzkL/Ey0kFctR75c9uP2ZpxBUySY3peG6CfkY1Cib5tNRNDU8oG9EB71iqaMSNn81PnZITq/RJGGtbCslc/TmR0ciYSRTYzoji0Cx7M/E/r5NieO1nQiUpcsUWi8JUEozJ7G/SF5ozlBNLKNPC3krYkGrK0KZTsiF4yy//Jc2zqndZvbg/r9Ru8jiKcATHcAoeXEEN7qAODWAwgCd4gVdHOs/Om/O+aC04+cwh/ILz8Q1oWI3l</latexit> <latexit sha1_base64="kiVMEoeDnjNUXHKvcbrow7t6lH4=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEmYns8mQ2dllpjcQlnyCFw+KePWLvPk3TpI9aLSgoajqprsrSKQw6LpfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWj25nfGnNtRKwecZJwP6IDJULBKFrpYdxTvXLFrbpzkL/Ey0kFctR75c9uP2ZpxBUySY3peG6CfkY1Cib5tNRNDU8oG9EB71iqaMSNn81PnZITq/RJGGtbCslc/TmR0ciYSRTYzoji0Cx7M/E/r5NieO1nQiUpcsUWi8JUEozJ7G/SF5ozlBNLKNPC3krYkGrK0KZTsiF4yy//Jc2zqndZvbg/r9Ru8jiKcATHcAoeXEEN7qAODWAwgCd4gVdHOs/Om/O+aC04+cwh/ILz8Q1oWI3l</latexit>


<latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit>

39 Lecture 3: Tokenization
Byte-pair encoding: example
Required:
D = {“i hug pugs”, “hugging pugs is fun”, “i make puns”}
<latexit sha1_base64="Lx3YuPjCnglROLa54gIN0Qp5coQ=">AAACjXicbVFLT8MwDHbLezw24IS4REwIDmhqEW8BQoAER0AMkNYJ0izdoqVplaSIqdqf5Ma/wd12YIAjx/b32YnjhKkUxnrel+NOTE5Nz8zOleYXFpfKleWVJ5NkmvE6S2SiX0JquBSK162wkr+kmtM4lPw57F4V/PM710Yk6tH2Ut6MaVuJSDBqEUoqDxBADBQsdIChlZDDNfSBwBlqgFGAHIcP3HN4wyUQ70AGbbTpwBrYwrWDccEPuTbmqbEcgkixRxirsYrixKKHLt4zzFdgyBpZwz4C1NJrperVvIGQv44/cqowkrvXymfQSlgWc2WZpMY0fC+1zZxqK5jk/VKQGZ5S1qVt3kBX0ZibZj6YZp9sItIiUaJRlSUD9GdFTmNjenGImTG1HfObK8D/uEZmo6NmLlSaWa7Y8KIok8QmpPga0hKaMyt76FCmBfZKWIdqyix+YDEE//eT/zpPuzX/oLZ/v1e9uByNYxbWYQO2wYdDuIBbuIM6MKfkeM6xc+KW3X331D0fprrOqGYVxsS9+QZqLKZy</latexit>

• Documents D
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

• Desired vocabulary size N (greater than chars in D) N = 20


<latexit sha1_base64="IVAxcO9sCm6fm/6mUKpSrfnvosw=">AAAB7XicbVBNSwMxEJ2tX7V+VT16CRbBU9ktVr0IRS+epIL9gHYp2TTbxmaTJckKZel/8OJBEa/+H2/+G9N2D9r6YODx3gwz84KYM21c99vJrayurW/kNwtb2zu7e8X9g6aWiSK0QSSXqh1gTTkTtGGY4bQdK4qjgNNWMLqZ+q0nqjST4sGMY+pHeCBYyAg2VmreoStUcXvFklt2Z0DLxMtICTLUe8Wvbl+SJKLCEI617nhubPwUK8MIp5NCN9E0xmSEB7RjqcAR1X46u3aCTqzSR6FUtoRBM/X3RIojrcdRYDsjbIZ60ZuK/3mdxISXfspEnBgqyHxRmHBkJJq+jvpMUWL42BJMFLO3IjLEChNjAyrYELzFl5dJs1L2zsvV+7NS7TqLIw9HcAyn4MEF1OAW6tAAAo/wDK/w5kjnxXl3PuatOSebOYQ/cD5/ALeije0=</latexit>

<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>

Algorithm:
D = {“i”, “ hug”, “ pugs”, “hugging”, “ pugs”,
<latexit sha1_base64="PdR2Dr+yjaKvMUQ0m9thT2ShX2U=">AAACcXicbZFdT9swFIadjG3QfZXBDUKbLKoNpE1VgjbGzSQEXOySSSsgNVXruCepVceJ7ONpVZT7/T7u+BPc7A/gtAFtZUey9Og97/HH67iQwmAQXHv+o5XHT56urrWePX/x8lV7/fW5ya3m0OO5zPVlzAxIoaCHAiVcFhpYFku4iKcndf/iJ2gjcvUDZwUMMpYqkQjO0EnD9u8oYzjhTJanFf1Ko/J9hPALy9FI7O5+pKMRndi0ocKmZoFOS4Va0isaRa37cSoaL02sWtDdjhmbwv2ocrYqqlrDdifoBvOiDyFsoEOaOhu2r6Jxzm0GCrlkxvTDoMBByTQKLqFqRdZAwfiUpdB3qFgGZlDOE6voO6eMaZJrtxTSufr3RMkyY2ZZ7Jx1Pma5V4v/6/UtJoeDUqjCIii+OCixkmJO6/jpWGjgKGcOGNfC3ZXyCdOMo/ukOoRw+ckP4Xy/Gx50P3//1Dk6buJYJdtkh+yRkHwhR+QbOSM9wsmNt+m98d56f/wtn/o7C6vvNTMb5J/yP9wCuM21Ew==</latexit>

• Pre-tokenize D by splitting into words (split before


<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

whitespace/punctuation) “ is”, “ fun”, “i”, “ make”, “ puns”}


• Initialize V as the set of characters in D
<latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit>

<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

• Convert D into a list of tokens (characters)


<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

V = {‘ ’, ‘a’, ‘e’, ‘f’, ‘g’, ‘h’, ‘i’, ‘k’, ‘m’,


<latexit sha1_base64="ObdjDZTwrvxXXs03PMNwUhDM78w=">AAACYnicbZFNTxsxEIa9y2eXAgGO7cEiatNDFO3yUbggofbSI0hNQMpGwevMJla83pU9ixot+ye5ceLCD8EOqUShI/nRq3c8mvE4KaQwGIYPnr+0vLK6tv4h2Pi4ubXd2NntmbzUHLo8l7m+TpgBKRR0UaCE60IDyxIJV8n0p8tf3YI2Ile/cVbAIGNjJVLBGVpr2JjFGcMJZ7Lq1fSMxtXXGOEPVje01aY3zAEcUoexw8RBOEwdsla7pnEc/K1TziwcjEPZquO6Te9etbmzfaLDYNhohp1wHvS9iBaiSRZxMWzcx6Oclxko5JIZ04/CAgcV0yi4hDqISwMF41M2hr6VimVgBtV8RTX9Yp0RTXNtj0I6d19XVCwzZpYl9qab1LzNOfN/uX6J6emgEqooERR/aZSWkmJO3b7pSGjgKGdWMK6FnZXyCdOMo/0Vt4To7ZPfi95BJ/reOb48ap7/WKxjnXwi++QbicgJOSe/yAXpEk4evRVvy9v2nvzA3/H3Xq763qJmj/wT/udnvrWupw==</latexit>

• While |V| < N :


<latexit sha1_base64="f0Eq4chOXdaCL4akh6X1BvQxXWM=">AAAB+HicbVDLSsNAFL2pr1ofjbp0M1gEVyURqy6LblxWsA9oQ5lMJ+3QySTMTISa9kvcuFDErZ/izr9x0mahrQcGDufcyz1z/JgzpR3n2yqsrW9sbhW3Szu7e/tl++CwpaJEEtokEY9kx8eKciZoUzPNaSeWFIc+p21/fJv57UcqFYvEg57E1AvxULCAEayN1LfLU9QLsR4RzNPWDE37dsWpOnOgVeLmpAI5Gn37qzeISBJSoQnHSnVdJ9ZeiqVmhNNZqZcoGmMyxkPaNVTgkCovnQefoVOjDFAQSfOERnP190aKQ6UmoW8ms5Bq2cvE/7xuooNrL2UiTjQVZHEoSDjSEcpaQAMmKdF8YggmkpmsiIywxESbrkqmBHf5y6ukdV51L6u1+4tK/SavowjHcAJn4MIV1OEOGtAEAgk8wyu8WU/Wi/VufSxGC1a+cwR/YH3+AIm1kwc=</latexit>

<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>

n := |V| + 1 ‘n’, ‘p’, ‘s’, ‘u’}, |V| = 13


<latexit sha1_base64="qHDdAiIl6F+PIAHnOC1uenCsXQI=">AAACAHicbVDLSsNAFL3xWesr6sKFm8EiCEJJxBeCUHTjsoJ9QBvKZDpph04mYWYilJiNv+LGhSJu/Qx3/o2TtgttPTBw5px7ufceP+ZMacf5tubmFxaXlgsrxdW19Y1Ne2u7rqJEElojEY9k08eKciZoTTPNaTOWFIc+pw1/cJP7jQcqFYvEvR7G1AtxT7CAEayN1LF3Bbq8Qo+oHWLdJ5in9cz8jtyOXXLKzgholrgTUoIJqh37q92NSBJSoQnHSrVcJ9ZeiqVmhNOs2E4UjTEZ4B5tGSpwSJWXjg7I0IFRuiiIpHlCo5H6uyPFoVLD0DeV+Z5q2svF/7xWooMLL2UiTjQVZDwoSDjSEcrTQF0mKdF8aAgmkpldEeljiYk2mRVNCO70ybOkflx2z8qndyelyvUkjgLswT4cggvnUIFbqEINCGTwDK/wZj1ZL9a79TEunbMmPTvwB9bnD1QnlPg=</latexit>

• Let
• Get counts of all bigrams in D
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

D = { [‘i’] , [‘ ’, ‘h’, ‘u’, ‘g’] , [‘ ’, ‘p’, ‘u’, ‘g’, ‘s’] ,


<latexit sha1_base64="JYwVy/JOJ32LM/WIMvX0Fdcktek=">AAADm3icpVJda9swFFXsfXTZR9PuZTAGYmHrHkqwx9b2ZaOsg5Wxhw6WthCZVFbkWESWjXRdGox/1P7K3vZvJqVhxHE3Brugw+Hec++RxI0LKQwEwc+O59+6fefuxr3u/QcPH232trZPTV5qxocsl7k+j6nhUig+BAGSnxea0yyW/CyeHbn62SXXRuTqG8wLHmV0qkQiGAWbGm91vpOMQsqorD7W+B0m1UsieQIjAvwKqguxUxMtpilEu7hRwDu7+CJ1UDqY/lVXrOgsmBUxJqTbtEyb4gUIB+q/jJzPDR2ipWtJkt9D1Z90Ys2oPSRzQB3MHPB/fIZqXg+Tujvu9YNBsAjcJuGS9NEyTsa9H2SSszLjCpikxozCoICoohoEk7zuktLwgrIZnfKRpYpm3ETVYrdq/MJmJjjJtT0K8CK72lHRzJh5FlulWySzXnPJm2qjEpKDqBKqKIErdm2UlBJDjt2i4onQnIGcW0KZFvaumKVUUwZ2nd0nhOtPbpPT14Nwb/D265v+4Yfld2ygp+g5eoVCtI8O0TE6QUPEvCfee++Td+w/84/8z/6Xa6nXWfY8Ro3wh78AsdoXeA==</latexit>

• For the most frequent bigram vi , vj


<latexit sha1_base64="8qiJFR39nNAp8DHiEq3U32WlZhY=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgQcKu+DoGvXiMYB6SLMvsZJKMmZldZmYDYclXePGgiFc/x5t/42yyB40WNBRV3XR3hTFn2rjul1NYWl5ZXSuulzY2t7Z3yrt7TR0litAGiXik2iHWlDNJG4YZTtuxoliEnLbC0U3mt8ZUaRbJezOJqS/wQLI+I9hY6WEcsBM0Dh5LQbniVt0Z0F/i5aQCOepB+bPbi0giqDSEY607nhsbP8XKMMLptNRNNI0xGeEB7VgqsaDaT2cHT9GRVXqoHylb0qCZ+nMixULriQhtp8BmqBe9TPzP6ySmf+WnTMaJoZLMF/UTjkyEsu9RjylKDJ9Ygoli9lZEhlhhYmxGWQje4st/SfO06l1Uz+/OKrXrPI4iHMAhHIMHl1CDW6hDAwgIeIIXeHWU8+y8Oe/z1oKTz+zDLzgf36+tj7E=</latexit>

(breaking
ties arbitrarily) [‘h’, ‘u’, ‘g’, ‘g’, ‘i’, ‘n’, ‘g’] , [‘ ’, ‘p’, ‘u’, ‘g’, ‘s’] ,
<latexit sha1_base64="FcsoGfdiXUbCtS2heHfv2tDDoMA=">AAACB3icbVDJSgNBEO2JW4zbqEdBGoMQQcKMuCEIQS8eI5gFkjD0dDpJm56eobsmGIbcvPgrXjwo4tVf8Obf2FkOmvig4PFeFVX1/EhwDY7zbaXm5hcWl9LLmZXVtfUNe3OrrMNYUVaioQhV1SeaCS5ZCTgIVo0UI4EvWMXvXg/9So8pzUN5B/2INQLSlrzFKQEjefZuz5P44hLXgT1AQkNpjEGu5/FD3PPuDzw76+SdEfAscSckiyYoevZXvRnSOGASqCBa11wngkZCFHAq2CBTjzWLCO2SNqsZKknAdCMZ/THA+0Zp4laoTEnAI/X3REICrfuBbzoDAh097Q3F/7xaDK3zRsJlFAOTdLyoFQsMIR6GgptcMQqibwihiptbMe0QRSiY6DImBHf65VlSPsq7p/mT2+Ns4WoSRxrtoD2UQy46QwV0g4qohCh6RM/oFb1ZT9aL9W59jFtT1mRmG/2B9fkDnQqYhg==</latexit>

• Let vn := concat(vi , vj )
[‘ ’, ‘i’, ‘s’] , [‘ ’, ‘f’, ‘u’, ‘n’] , [‘i’] ,
• Change all instances in D of vi , vj to vn
<latexit sha1_base64="8qiJFR39nNAp8DHiEq3U32WlZhY=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgQcKu+DoGvXiMYB6SLMvsZJKMmZldZmYDYclXePGgiFc/x5t/42yyB40WNBRV3XR3hTFn2rjul1NYWl5ZXSuulzY2t7Z3yrt7TR0litAGiXik2iHWlDNJG4YZTtuxoliEnLbC0U3mt8ZUaRbJezOJqS/wQLI+I9hY6WEcsBM0Dh5LQbniVt0Z0F/i5aQCOepB+bPbi0giqDSEY607nhsbP8XKMMLptNRNNI0xGeEB7VgqsaDaT2cHT9GRVXqoHylb0qCZ+nMixULriQhtp8BmqBe9TPzP6ySmf+WnTMaJoZLMF/UTjkyEsu9RjylKDJ9Ygoli9lZEhlhhYmxGWQje4st/SfO06l1Uz+/OKrXrPI4iHMAhHIMHl1CDW6hDAwgIeIIXeHWU8+y8Oe/z1oKTz+zDLzgf36+tj7E=</latexit>

<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="kiVMEoeDnjNUXHKvcbrow7t6lH4=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEmYns8mQ2dllpjcQlnyCFw+KePWLvPk3TpI9aLSgoajqprsrSKQw6LpfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWj25nfGnNtRKwecZJwP6IDJULBKFrpYdxTvXLFrbpzkL/Ey0kFctR75c9uP2ZpxBUySY3peG6CfkY1Cib5tNRNDU8oG9EB71iqaMSNn81PnZITq/RJGGtbCslc/TmR0ciYSRTYzoji0Cx7M/E/r5NieO1nQiUpcsUWi8JUEozJ7G/SF5ozlBNLKNPC3krYkGrK0KZTsiF4yy//Jc2zqndZvbg/r9Ru8jiKcATHcAoeXEEN7qAODWAwgCd4gVdHOs/Om/O+aC04+cwh/ILz8Q1oWI3l</latexit>

and add vn to V [‘ ’, ‘m’, ‘a’, ‘k’, ‘e’] , [‘ ’, ‘p’, ‘u’, ‘n’, ‘s’]}
<latexit sha1_base64="kiVMEoeDnjNUXHKvcbrow7t6lH4=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEmYns8mQ2dllpjcQlnyCFw+KePWLvPk3TpI9aLSgoajqprsrSKQw6LpfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWj25nfGnNtRKwecZJwP6IDJULBKFrpYdxTvXLFrbpzkL/Ey0kFctR75c9uP2ZpxBUySY3peG6CfkY1Cib5tNRNDU8oG9EB71iqaMSNn81PnZITq/RJGGtbCslc/TmR0ciYSRTYzoji0Cx7M/E/r5NieO1nQiUpcsUWi8JUEozJ7G/SF5ozlBNLKNPC3krYkGrK0KZTsiF4yy//Jc2zqndZvbg/r9Ru8jiKcATHcAoeXEEN7qAODWAwgCd4gVdHOs/Om/O+aC04+cwh/ILz8Q1oWI3l</latexit>

<latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit>

Example inspired by: https://huggingface.co/docs/transformers/tokenizer_summary


40 Lecture 3: Tokenization
Byte-pair encoding: example
V = {1 : ‘ ’, 2 : ‘a’, 3 : ‘e’, 4 : ‘f’, 5 : ‘g’, 6 : ‘h’, 7 : ‘i’,
<latexit sha1_base64="fQ0K85W2JfotPBftOOhlc51oDv0=">AAACxnicbZFNaxsxEIa126/U/XLbYy+ipkkPxeym+WohEJpLjinUTsBrXK08awtrtYs0m9QsC/mNueXQ/1JpY+Q26YDgfd7RoNFMWkphMIpugvDBw0ePn2w87Tx7/uLlq+7rN0NTVJrDgBey0OcpMyCFggEKlHBeamB5KuEsXRy7/NkFaCMK9QOXJYxzNlMiE5yhtSbd30nOcM6ZrIcNPaRJvRl/pQnCL6x/0q3mE932yBx+9ggOdzxmDnc9zhzueZxb3PckXJImSWfzwHsL533xmDuMI8+q5XVvZcvr5kzL6+6qrSZpOpNuL+pHbdD7Il6JHlnF6aR7nUwLXuWgkEtmzCiOShzXTKPgEppOUhkoGV+wGYysVCwHM67bNTT0g3WmNCu0PQpp6/5dUbPcmGWe2ptu6OZuzpn/y40qzA7GtVBlhaD47UNZJSkW1O2UToUGjnJpBeNa2F4pnzPNONrNuyHEd798Xwy3+/Fef/f7Tu/o22ocG+QdeU8+kpjskyNyQk7JgPDgOBCBDkx4EqqwCi9vr4bBquYt+SfCqz9aXNP0</latexit>

Required:
• Documents D
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

8 : ‘k’, 9 : ‘m’, 10 : ‘n’, 11 : ‘p’, 12 : ‘s’, 13 : ‘u’}


• Desired vocabulary size N (greater than chars in D)
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>

Algorithm:
• Pre-tokenize D by splitting into words (split before Implementation aside: We normally
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

whitespace/punctuation)
store D with the token indices instead
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

• Initialize V as the set of characters in D


<latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit>

<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

• Convert D into a list of tokens (characters) of the text itself!


<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

• While |V| < N :


<latexit sha1_base64="f0Eq4chOXdaCL4akh6X1BvQxXWM=">AAAB+HicbVDLSsNAFL2pr1ofjbp0M1gEVyURqy6LblxWsA9oQ5lMJ+3QySTMTISa9kvcuFDErZ/izr9x0mahrQcGDufcyz1z/JgzpR3n2yqsrW9sbhW3Szu7e/tl++CwpaJEEtokEY9kx8eKciZoUzPNaSeWFIc+p21/fJv57UcqFYvEg57E1AvxULCAEayN1LfLU9QLsR4RzNPWDE37dsWpOnOgVeLmpAI5Gn37qzeISBJSoQnHSnVdJ9ZeiqVmhNNZqZcoGmMyxkPaNVTgkCovnQefoVOjDFAQSfOERnP190aKQ6UmoW8ms5Bq2cvE/7xuooNrL2UiTjQVZHEoSDjSEcpaQAMmKdF8YggmkpmsiIywxESbrkqmBHf5y6ukdV51L6u1+4tK/SavowjHcAJn4MIV1OEOGtAEAgk8wyu8WU/Wi/VufSxGC1a+cwR/YH3+AIm1kwc=</latexit>

<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>

n := |V| + 1
<latexit sha1_base64="qHDdAiIl6F+PIAHnOC1uenCsXQI=">AAACAHicbVDLSsNAFL3xWesr6sKFm8EiCEJJxBeCUHTjsoJ9QBvKZDpph04mYWYilJiNv+LGhSJu/Qx3/o2TtgttPTBw5px7ufceP+ZMacf5tubmFxaXlgsrxdW19Y1Ne2u7rqJEElojEY9k08eKciZoTTPNaTOWFIc+pw1/cJP7jQcqFYvEvR7G1AtxT7CAEayN1LF3Bbq8Qo+oHWLdJ5in9cz8jtyOXXLKzgholrgTUoIJqh37q92NSBJSoQnHSrVcJ9ZeiqVmhNOs2E4UjTEZ4B5tGSpwSJWXjg7I0IFRuiiIpHlCo5H6uyPFoVLD0DeV+Z5q2svF/7xWooMLL2UiTjQVZDwoSDjSEcrTQF0mKdF8aAgmkpldEeljiYk2mRVNCO70ybOkflx2z8qndyelyvUkjgLswT4cggvnUIFbqEINCGTwDK/wZj1ZL9a79TEunbMmPTvwB9bnD1QnlPg=</latexit>

• Let D = { [7] , [1, 6, 13, 5] , [1, 11, 13, 5, 12] ,


<latexit sha1_base64="iyTfs33EENPeQayFRP6n30/vVzY=">AAADD3iclZJLbxMxEMe9y6MlPJrCkcuIqIjDKlqnbVIOSBX00GMrNW2l7CryOt7Eqvche7ZStMo34MJX4cIBhLhy7Y1vgzfZFkjaAyNb/nt+nvH4EeVKGvT9X4577/6Dh2vrjxqPnzx9ttHcfH5qskJz0eeZyvR5xIxQMhV9lKjEea4FSyIlzqKLDxU/uxTayCw9wWkuwoSNUxlLztC6hpvOVpAwnHCmyoMZvIOgfB0oEeOgF2g5nmDowWJOPeh6QLc92F0llNbIDp0/OGjUybo32Laenfj/k+Ya9+7IbtHOIpD6S1G3nOKtBx0P9jzYvruAqr6brSCYNYbNlt/25wargtaiRWo7GjavglHGi0SkyBUzZkD9HMOSaZRciVkjKIzIGb9gYzGwMmWJMGE5f88ZbFnPCOJM254izL1/R5QsMWaaRHZl9XhmmVXO29igwHgvLGWaFyhSvtgoLhRgBtXngJHUgqOaWsG4lrZW4BOmGUf7hapLoMtHXhWnnTbttnePd1r77+vrWCcvySvyhlDSI/vkkByRPuHOR+ez89X55n5yv7jf3R+Lpa5Tx7wg/5j78zfPkOHc</latexit>

• Get counts of all bigrams in D


<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

• For the most frequent bigram vi , vj


<latexit sha1_base64="8qiJFR39nNAp8DHiEq3U32WlZhY=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgQcKu+DoGvXiMYB6SLMvsZJKMmZldZmYDYclXePGgiFc/x5t/42yyB40WNBRV3XR3hTFn2rjul1NYWl5ZXSuulzY2t7Z3yrt7TR0litAGiXik2iHWlDNJG4YZTtuxoliEnLbC0U3mt8ZUaRbJezOJqS/wQLI+I9hY6WEcsBM0Dh5LQbniVt0Z0F/i5aQCOepB+bPbi0giqDSEY607nhsbP8XKMMLptNRNNI0xGeEB7VgqsaDaT2cHT9GRVXqoHylb0qCZ+nMixULriQhtp8BmqBe9TPzP6ySmf+WnTMaJoZLMF/UTjkyEsu9RjylKDJ9Ygoli9lZEhlhhYmxGWQje4st/SfO06l1Uz+/OKrXrPI4iHMAhHIMHl1CDW6hDAwgIeIIXeHWU8+y8Oe/z1oKTz+zDLzgf36+tj7E=</latexit>

(breaking
[6, 13, 5, 5, 7, 10, 5] , [1, 11, 13, 5, 12] , [1, 7, 12] ,
ties arbitrarily) [1, 4, 13, 10] , [7] , [1, 9, 2, 8, 3] , [1, 11, 13, 10, 12]}
<latexit sha1_base64="FcsoGfdiXUbCtS2heHfv2tDDoMA=">AAACB3icbVDJSgNBEO2JW4zbqEdBGoMQQcKMuCEIQS8eI5gFkjD0dDpJm56eobsmGIbcvPgrXjwo4tVf8Obf2FkOmvig4PFeFVX1/EhwDY7zbaXm5hcWl9LLmZXVtfUNe3OrrMNYUVaioQhV1SeaCS5ZCTgIVo0UI4EvWMXvXg/9So8pzUN5B/2INQLSlrzFKQEjefZuz5P44hLXgT1AQkNpjEGu5/FD3PPuDzw76+SdEfAscSckiyYoevZXvRnSOGASqCBa11wngkZCFHAq2CBTjzWLCO2SNqsZKknAdCMZ/THA+0Zp4laoTEnAI/X3REICrfuBbzoDAh097Q3F/7xaDK3zRsJlFAOTdLyoFQsMIR6GgptcMQqibwihiptbMe0QRSiY6DImBHf65VlSPsq7p/mT2+Ns4WoSRxrtoD2UQy46QwV0g4qohCh6RM/oFb1ZT9aL9W59jFtT1mRmG/2B9fkDnQqYhg==</latexit>

• Let vn := concat(vi , vj )
• Change all instances in D of vi , vj to vn
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="8qiJFR39nNAp8DHiEq3U32WlZhY=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgQcKu+DoGvXiMYB6SLMvsZJKMmZldZmYDYclXePGgiFc/x5t/42yyB40WNBRV3XR3hTFn2rjul1NYWl5ZXSuulzY2t7Z3yrt7TR0litAGiXik2iHWlDNJG4YZTtuxoliEnLbC0U3mt8ZUaRbJezOJqS/wQLI+I9hY6WEcsBM0Dh5LQbniVt0Z0F/i5aQCOepB+bPbi0giqDSEY607nhsbP8XKMMLptNRNNI0xGeEB7VgqsaDaT2cHT9GRVXqoHylb0qCZ+nMixULriQhtp8BmqBe9TPzP6ySmf+WnTMaJoZLMF/UTjkyEsu9RjylKDJ9Ygoli9lZEhlhhYmxGWQje4st/SfO06l1Uz+/OKrXrPI4iHMAhHIMHl1CDW6hDAwgIeIIXeHWU8+y8Oe/z1oKTz+zDLzgf36+tj7E=</latexit>

<latexit sha1_base64="kiVMEoeDnjNUXHKvcbrow7t6lH4=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEmYns8mQ2dllpjcQlnyCFw+KePWLvPk3TpI9aLSgoajqprsrSKQw6LpfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWj25nfGnNtRKwecZJwP6IDJULBKFrpYdxTvXLFrbpzkL/Ey0kFctR75c9uP2ZpxBUySY3peG6CfkY1Cib5tNRNDU8oG9EB71iqaMSNn81PnZITq/RJGGtbCslc/TmR0ciYSRTYzoji0Cx7M/E/r5NieO1nQiUpcsUWi8JUEozJ7G/SF5ozlBNLKNPC3krYkGrK0KZTsiF4yy//Jc2zqndZvbg/r9Ru8jiKcATHcAoeXEEN7qAODWAwgCd4gVdHOs/Om/O+aC04+cwh/ILz8Q1oWI3l</latexit>

For legibility of the example, we will show


and add vn to V the text corresponding to each token
<latexit sha1_base64="kiVMEoeDnjNUXHKvcbrow7t6lH4=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEmYns8mQ2dllpjcQlnyCFw+KePWLvPk3TpI9aLSgoajqprsrSKQw6LpfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWj25nfGnNtRKwecZJwP6IDJULBKFrpYdxTvXLFrbpzkL/Ey0kFctR75c9uP2ZpxBUySY3peG6CfkY1Cib5tNRNDU8oG9EB71iqaMSNn81PnZITq/RJGGtbCslc/TmR0ciYSRTYzoji0Cx7M/E/r5NieO1nQiUpcsUWi8JUEozJ7G/SF5ozlBNLKNPC3krYkGrK0KZTsiF4yy//Jc2zqndZvbg/r9Ru8jiKcATHcAoeXEEN7qAODWAwgCd4gVdHOs/Om/O+aC04+cwh/ILz8Q1oWI3l</latexit>

<latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit>

41 Lecture 3: Tokenization
Byte-pair encoding: example
Required:
D = { [‘i’] , [‘ ’, ‘h’, ‘u’, ‘g’] , [‘ ’, ‘p’, ‘u’, ‘g’, ‘s’] ,
<latexit sha1_base64="JYwVy/JOJ32LM/WIMvX0Fdcktek=">AAADm3icpVJda9swFFXsfXTZR9PuZTAGYmHrHkqwx9b2ZaOsg5Wxhw6WthCZVFbkWESWjXRdGox/1P7K3vZvJqVhxHE3Brugw+Hec++RxI0LKQwEwc+O59+6fefuxr3u/QcPH232trZPTV5qxocsl7k+j6nhUig+BAGSnxea0yyW/CyeHbn62SXXRuTqG8wLHmV0qkQiGAWbGm91vpOMQsqorD7W+B0m1UsieQIjAvwKqguxUxMtpilEu7hRwDu7+CJ1UDqY/lVXrOgsmBUxJqTbtEyb4gUIB+q/jJzPDR2ipWtJkt9D1Z90Ys2oPSRzQB3MHPB/fIZqXg+Tujvu9YNBsAjcJuGS9NEyTsa9H2SSszLjCpikxozCoICoohoEk7zuktLwgrIZnfKRpYpm3ETVYrdq/MJmJjjJtT0K8CK72lHRzJh5FlulWySzXnPJm2qjEpKDqBKqKIErdm2UlBJDjt2i4onQnIGcW0KZFvaumKVUUwZ2nd0nhOtPbpPT14Nwb/D265v+4Yfld2ygp+g5eoVCtI8O0TE6QUPEvCfee++Td+w/84/8z/6Xa6nXWfY8Ro3wh78AsdoXeA==</latexit>

• Documents D
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

• Desired vocabulary size N (greater than chars in D) [‘h’, ‘u’, ‘g’, ‘g’, ‘i’, ‘n’, ‘g’] , [‘ ’, ‘p’, ‘u’, ‘g’, ‘s’] ,
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>

Algorithm:
[‘ ’, ‘i’, ‘s’] , [‘ ’, ‘f’, ‘u’, ‘n’] , [‘i’] ,
• Pre-tokenize D by splitting into words (split before
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

whitespace/punctuation) [‘ ’, ‘m’, ‘a’, ‘k’, ‘e’] , [‘ ’, ‘p’, ‘u’, ‘n’, ‘s’]}


• Initialize V as the set of characters in D
<latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit>

<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

Bigram Count
• Convert D into a list of tokens (characters)
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

‘u’,’g’ 4
• While |V| < N :
<latexit sha1_base64="f0Eq4chOXdaCL4akh6X1BvQxXWM=">AAAB+HicbVDLSsNAFL2pr1ofjbp0M1gEVyURqy6LblxWsA9oQ5lMJ+3QySTMTISa9kvcuFDErZ/izr9x0mahrQcGDufcyz1z/JgzpR3n2yqsrW9sbhW3Szu7e/tl++CwpaJEEtokEY9kx8eKciZoUzPNaSeWFIc+p21/fJv57UcqFYvEg57E1AvxULCAEayN1LfLU9QLsR4RzNPWDE37dsWpOnOgVeLmpAI5Gn37qzeISBJSoQnHSnVdJ9ZeiqVmhNNZqZcoGmMyxkPaNVTgkCovnQefoVOjDFAQSfOERnP190aKQ6UmoW8ms5Bq2cvE/7xuooNrL2UiTjQVZHEoSDjSEcpaQAMmKdF8YggmkpmsiIywxESbrkqmBHf5y6ukdV51L6u1+4tK/SavowjHcAJn4MIV1OEOGtAEAgk8wyu8WU/Wi/VufSxGC1a+cwR/YH3+AIm1kwc=</latexit>

<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>

‘p’, ‘u’ 3
n := |V| + 1
<latexit sha1_base64="qHDdAiIl6F+PIAHnOC1uenCsXQI=">AAACAHicbVDLSsNAFL3xWesr6sKFm8EiCEJJxBeCUHTjsoJ9QBvKZDpph04mYWYilJiNv+LGhSJu/Qx3/o2TtgttPTBw5px7ufceP+ZMacf5tubmFxaXlgsrxdW19Y1Ne2u7rqJEElojEY9k08eKciZoTTPNaTOWFIc+pw1/cJP7jQcqFYvEvR7G1AtxT7CAEayN1LF3Bbq8Qo+oHWLdJ5in9cz8jtyOXXLKzgholrgTUoIJqh37q92NSBJSoQnHSrVcJ9ZeiqVmhNOs2E4UjTEZ4B5tGSpwSJWXjg7I0IFRuiiIpHlCo5H6uyPFoVLD0DeV+Z5q2svF/7xWooMLL2UiTjQVZDwoSDjSEcrTQF0mKdF8aAgmkpldEeljiYk2mRVNCO70ybOkflx2z8qndyelyvUkjgLswT4cggvnUIFbqEINCGTwDK/wZj1ZL9a79TEunbMmPTvwB9bnD1QnlPg=</latexit>

• Let ‘ ‘, ‘p’ 3
• Get counts of all bigrams in D
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

‘h’, ‘u’ 2
• For the most frequent bigram vi , vj (breaking
<latexit sha1_base64="8qiJFR39nNAp8DHiEq3U32WlZhY=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgQcKu+DoGvXiMYB6SLMvsZJKMmZldZmYDYclXePGgiFc/x5t/42yyB40WNBRV3XR3hTFn2rjul1NYWl5ZXSuulzY2t7Z3yrt7TR0litAGiXik2iHWlDNJG4YZTtuxoliEnLbC0U3mt8ZUaRbJezOJqS/wQLI+I9hY6WEcsBM0Dh5LQbniVt0Z0F/i5aQCOepB+bPbi0giqDSEY607nhsbP8XKMMLptNRNNI0xGeEB7VgqsaDaT2cHT9GRVXqoHylb0qCZ+nMixULriQhtp8BmqBe9TPzP6ySmf+WnTMaJoZLMF/UTjkyEsu9RjylKDJ9Ygoli9lZEhlhhYmxGWQje4st/SfO06l1Uz+/OKrXrPI4iHMAhHIMHl1CDW6hDAwgIeIIXeHWU8+y8Oe/z1oKTz+zDLzgf36+tj7E=</latexit>

… …
ties arbitrarily)
<latexit sha1_base64="FcsoGfdiXUbCtS2heHfv2tDDoMA=">AAACB3icbVDJSgNBEO2JW4zbqEdBGoMQQcKMuCEIQS8eI5gFkjD0dDpJm56eobsmGIbcvPgrXjwo4tVf8Obf2FkOmvig4PFeFVX1/EhwDY7zbaXm5hcWl9LLmZXVtfUNe3OrrMNYUVaioQhV1SeaCS5ZCTgIVo0UI4EvWMXvXg/9So8pzUN5B/2INQLSlrzFKQEjefZuz5P44hLXgT1AQkNpjEGu5/FD3PPuDzw76+SdEfAscSckiyYoevZXvRnSOGASqCBa11wngkZCFHAq2CBTjzWLCO2SNqsZKknAdCMZ/THA+0Zp4laoTEnAI/X3REICrfuBbzoDAh097Q3F/7xaDK3zRsJlFAOTdLyoFQsMIR6GgptcMQqibwihiptbMe0QRSiY6DImBHf65VlSPsq7p/mT2+Ns4WoSRxrtoD2UQy46QwV0g4qohCh6RM/oFb1ZT9aL9W59jFtT1mRmG/2B9fkDnQqYhg==</latexit>

• Let vn := concat(vi , vj )
Change all instances in D of vi , vj to vn
<latexit sha1_base64="0oaQ5SYUEal2a8mzvr8+b8yYcYk=">AAACIXicbVDLSgMxFM34rOOr6tJNsEgrSJmRqkUQim5cVrBVaEubSdM2NJMZkjvFMsyvuPFX3LhQpDvxZ0wfgq8DgXPPuZebe7xQcA2O827NzS8sLi2nVuzVtfWNzfTWdlUHkaKsQgMRqDuPaCa4ZBXgINhdqBjxPcFuvf7l2L8dMKV5IG9gGLKGT7qSdzglYKRmujhoxm4hwWfnuA7sHmIaSOMluWnVirKHuNXNJgf4q6EVmdK2m+mMk3cmwH+JOyMZNEO5mR7V2wGNfCaBCqJ1zXVCaMREAaeCJXY90iwktE+6rGaoJD7TjXhyYYL3jdLGnUCZJwFP1O8TMfG1Hvqe6fQJ9PRvbyz+59Ui6BQbMZdhBEzS6aJOJDAEeBwXbnPFKIihIYQqbv6KaY8oQsGEOg7B/X3yX1I9yrsn+ePrQqZ0MYsjhXbRHsohF52iErpCZVRBFD2gJ/SCXq1H69l6s0bT1jlrNrODfsD6+ATRb6IM</latexit>

• v14 := concat(‘u’, ‘g’) = ‘ug’


<latexit sha1_base64="8qiJFR39nNAp8DHiEq3U32WlZhY=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgQcKu+DoGvXiMYB6SLMvsZJKMmZldZmYDYclXePGgiFc/x5t/42yyB40WNBRV3XR3hTFn2rjul1NYWl5ZXSuulzY2t7Z3yrt7TR0litAGiXik2iHWlDNJG4YZTtuxoliEnLbC0U3mt8ZUaRbJezOJqS/wQLI+I9hY6WEcsBM0Dh5LQbniVt0Z0F/i5aQCOepB+bPbi0giqDSEY607nhsbP8XKMMLptNRNNI0xGeEB7VgqsaDaT2cHT9GRVXqoHylb0qCZ+nMixULriQhtp8BmqBe9TPzP6ySmf+WnTMaJoZLMF/UTjkyEsu9RjylKDJ9Ygoli9lZEhlhhYmxGWQje4st/SfO06l1Uz+/OKrXrPI4iHMAhHIMHl1CDW6hDAwgIeIIXeHWU8+y8Oe/z1oKTz+zDLzgf36+tj7E=</latexit>

<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="kiVMEoeDnjNUXHKvcbrow7t6lH4=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEmYns8mQ2dllpjcQlnyCFw+KePWLvPk3TpI9aLSgoajqprsrSKQw6LpfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWj25nfGnNtRKwecZJwP6IDJULBKFrpYdxTvXLFrbpzkL/Ey0kFctR75c9uP2ZpxBUySY3peG6CfkY1Cib5tNRNDU8oG9EB71iqaMSNn81PnZITq/RJGGtbCslc/TmR0ciYSRTYzoji0Cx7M/E/r5NieO1nQiUpcsUWi8JUEozJ7G/SF5ozlBNLKNPC3krYkGrK0KZTsiF4yy//Jc2zqndZvbg/r9Ru8jiKcATHcAoeXEEN7qAODWAwgCd4gVdHOs/Om/O+aC04+cwh/ILz8Q1oWI3l</latexit>

and add vn to V
<latexit sha1_base64="kiVMEoeDnjNUXHKvcbrow7t6lH4=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEmYns8mQ2dllpjcQlnyCFw+KePWLvPk3TpI9aLSgoajqprsrSKQw6LpfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWj25nfGnNtRKwecZJwP6IDJULBKFrpYdxTvXLFrbpzkL/Ey0kFctR75c9uP2ZpxBUySY3peG6CfkY1Cib5tNRNDU8oG9EB71iqaMSNn81PnZITq/RJGGtbCslc/TmR0ciYSRTYzoji0Cx7M/E/r5NieO1nQiUpcsUWi8JUEozJ7G/SF5ozlBNLKNPC3krYkGrK0KZTsiF4yy//Jc2zqndZvbg/r9Ru8jiKcATHcAoeXEEN7qAODWAwgCd4gVdHOs/Om/O+aC04+cwh/ILz8Q1oWI3l</latexit>

<latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit>

42 Lecture 3: Tokenization
Byte-pair encoding: example
D = { [‘i’] , [‘ ’, ‘h’, ‘u’, ‘g’] , [‘ ’, ‘p’, ‘u’, ‘g’, ‘s’] ,
<latexit sha1_base64="JYwVy/JOJ32LM/WIMvX0Fdcktek=">AAADm3icpVJda9swFFXsfXTZR9PuZTAGYmHrHkqwx9b2ZaOsg5Wxhw6WthCZVFbkWESWjXRdGox/1P7K3vZvJqVhxHE3Brugw+Hec++RxI0LKQwEwc+O59+6fefuxr3u/QcPH232trZPTV5qxocsl7k+j6nhUig+BAGSnxea0yyW/CyeHbn62SXXRuTqG8wLHmV0qkQiGAWbGm91vpOMQsqorD7W+B0m1UsieQIjAvwKqguxUxMtpilEu7hRwDu7+CJ1UDqY/lVXrOgsmBUxJqTbtEyb4gUIB+q/jJzPDR2ipWtJkt9D1Z90Ys2oPSRzQB3MHPB/fIZqXg+Tujvu9YNBsAjcJuGS9NEyTsa9H2SSszLjCpikxozCoICoohoEk7zuktLwgrIZnfKRpYpm3ETVYrdq/MJmJjjJtT0K8CK72lHRzJh5FlulWySzXnPJm2qjEpKDqBKqKIErdm2UlBJDjt2i4onQnIGcW0KZFvaumKVUUwZ2nd0nhOtPbpPT14Nwb/D265v+4Yfld2ygp+g5eoVCtI8O0TE6QUPEvCfee++Td+w/84/8z/6Xa6nXWfY8Ro3wh78AsdoXeA==</latexit>

Required:
• Documents D [‘h’, ‘u’, ‘g’, ‘g’, ‘i’, ‘n’, ‘g’] , [‘ ’, ‘p’, ‘u’, ‘g’, ‘s’] ,
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

• Desired vocabulary size N (greater than chars in D)


<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>

[‘ ’, ‘i’, ‘s’] , [‘ ’, ‘f’, ‘u’, ‘n’] , [‘i’] ,


Algorithm:
• Pre-tokenize D by splitting into words (split before
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

[‘ ’, ‘m’, ‘a’, ‘k’, ‘e’] , [‘ ’, ‘p’, ‘u’, ‘n’, ‘s’]}


whitespace/punctuation) <latexit sha1_base64="0oaQ5SYUEal2a8mzvr8+b8yYcYk=">AAACIXicbVDLSgMxFM34rOOr6tJNsEgrSJmRqkUQim5cVrBVaEubSdM2NJMZkjvFMsyvuPFX3LhQpDvxZ0wfgq8DgXPPuZebe7xQcA2O827NzS8sLi2nVuzVtfWNzfTWdlUHkaKsQgMRqDuPaCa4ZBXgINhdqBjxPcFuvf7l2L8dMKV5IG9gGLKGT7qSdzglYKRmujhoxm4hwWfnuA7sHmIaSOMluWnVirKHuNXNJgf4q6EVmdK2m+mMk3cmwH+JOyMZNEO5mR7V2wGNfCaBCqJ1zXVCaMREAaeCJXY90iwktE+6rGaoJD7TjXhyYYL3jdLGnUCZJwFP1O8TMfG1Hvqe6fQJ9PRvbyz+59Ui6BQbMZdhBEzS6aJOJDAEeBwXbnPFKIihIYQqbv6KaY8oQsGEOg7B/X3yX1I9yrsn+ePrQqZ0MYsjhXbRHsohF52iErpCZVRBFD2gJ/SCXq1H69l6s0bT1jlrNrODfsD6+ATRb6IM</latexit>

• Initialize V as the set of characters in D


<latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit>

<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

v14 := concat(‘u’, ‘g’) = ‘ug’


• Convert D into a list of tokens (characters)
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

D = { [‘i’] , [‘ ’, ‘h’, ‘ug’] , [‘ ’, ‘p’, ‘ug’, ‘s’] ,


<latexit sha1_base64="iX2xsNJ/MThCJfnr/fTrPxm80Ho=">AAADi3icnVJdb9MwFHUTPkph0MEjLxYVjIepSvjahECa+JB4HBLdJtVR57hOY9VxIvsGUUX5Mfwl3vg32GnE1mYgxJVydHLv8T2+1o0LKQwEwc+e51+7fuNm/9bg9p2du/eGu/dPTF5qxicsl7k+i6nhUig+AQGSnxWa0yyW/DRevnf1069cG5GrL7AqeJTRhRKJYBRsarbb+04yCimjsvpQ47eYVE+I5AlMCfBvUJ2LvZposUgh2scbBby3j89TB+Xir5qi1Vg0l4SYkMGmVXohbEA4UOvf/zBw/a9Qi46uI0mahmv3P+jEllG3SeaAOlg64P8wwu95L66HST2YDUfBOGgCd0nYkhFq43g2/EHmOSszroBJasw0DAqIKqpBMMnrASkNLyhb0gWfWqpoxk1UNbtU48c2M8dJru2nADfZyycqmhmzymKrdItjtmsueVVtWkJyGFVCFSVwxdZGSSkx5NgtJp4LzRnIlSWUaWHvillKNWVg19c9Qrg9cpecPBuHr8YvP78YHb1rn6OPHqJH6CkK0QE6Qp/QMZog5vW9sXfgHfo7/nP/tf9mLfV67ZkHaCP8j78Ar3YTjA==</latexit>

• While |V| < N :


<latexit sha1_base64="f0Eq4chOXdaCL4akh6X1BvQxXWM=">AAAB+HicbVDLSsNAFL2pr1ofjbp0M1gEVyURqy6LblxWsA9oQ5lMJ+3QySTMTISa9kvcuFDErZ/izr9x0mahrQcGDufcyz1z/JgzpR3n2yqsrW9sbhW3Szu7e/tl++CwpaJEEtokEY9kx8eKciZoUzPNaSeWFIc+p21/fJv57UcqFYvEg57E1AvxULCAEayN1LfLU9QLsR4RzNPWDE37dsWpOnOgVeLmpAI5Gn37qzeISBJSoQnHSnVdJ9ZeiqVmhNNZqZcoGmMyxkPaNVTgkCovnQefoVOjDFAQSfOERnP190aKQ6UmoW8ms5Bq2cvE/7xuooNrL2UiTjQVZHEoSDjSEcpaQAMmKdF8YggmkpmsiIywxESbrkqmBHf5y6ukdV51L6u1+4tK/SavowjHcAJn4MIV1OEOGtAEAgk8wyu8WU/Wi/VufSxGC1a+cwR/YH3+AIm1kwc=</latexit>

<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>

n := |V| + 1
<latexit sha1_base64="qHDdAiIl6F+PIAHnOC1uenCsXQI=">AAACAHicbVDLSsNAFL3xWesr6sKFm8EiCEJJxBeCUHTjsoJ9QBvKZDpph04mYWYilJiNv+LGhSJu/Qx3/o2TtgttPTBw5px7ufceP+ZMacf5tubmFxaXlgsrxdW19Y1Ne2u7rqJEElojEY9k08eKciZoTTPNaTOWFIc+pw1/cJP7jQcqFYvEvR7G1AtxT7CAEayN1LF3Bbq8Qo+oHWLdJ5in9cz8jtyOXXLKzgholrgTUoIJqh37q92NSBJSoQnHSrVcJ9ZeiqVmhNOs2E4UjTEZ4B5tGSpwSJWXjg7I0IFRuiiIpHlCo5H6uyPFoVLD0DeV+Z5q2svF/7xWooMLL2UiTjQVZDwoSDjSEcrTQF0mKdF8aAgmkpldEeljiYk2mRVNCO70ybOkflx2z8qndyelyvUkjgLswT4cggvnUIFbqEINCGTwDK/wZj1ZL9a79TEunbMmPTvwB9bnD1QnlPg=</latexit>

[‘h’, ‘ug’, ‘g’, ‘i’, ‘n’, ‘g’] , [‘ ’, ‘p’, ‘ug’, ‘s’] ,


• Let
• Get counts of all bigrams in D
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

[‘ ’, ‘i’, ‘s’] , [‘ ’, ‘f’, ‘u’, ‘n’] , [‘i’] ,


• For the most frequent bigram vi , vj (breaking
<latexit sha1_base64="8qiJFR39nNAp8DHiEq3U32WlZhY=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgQcKu+DoGvXiMYB6SLMvsZJKMmZldZmYDYclXePGgiFc/x5t/42yyB40WNBRV3XR3hTFn2rjul1NYWl5ZXSuulzY2t7Z3yrt7TR0litAGiXik2iHWlDNJG4YZTtuxoliEnLbC0U3mt8ZUaRbJezOJqS/wQLI+I9hY6WEcsBM0Dh5LQbniVt0Z0F/i5aQCOepB+bPbi0giqDSEY607nhsbP8XKMMLptNRNNI0xGeEB7VgqsaDaT2cHT9GRVXqoHylb0qCZ+nMixULriQhtp8BmqBe9TPzP6ySmf+WnTMaJoZLMF/UTjkyEsu9RjylKDJ9Ygoli9lZEhlhhYmxGWQje4st/SfO06l1Uz+/OKrXrPI4iHMAhHIMHl1CDW6hDAwgIeIIXeHWU8+y8Oe/z1oKTz+zDLzgf36+tj7E=</latexit>

[‘ ’, ‘m’, ‘a’, ‘k’, ‘e’] , [‘ ’, ‘p’, ‘u’, ‘n’, ‘s’]}


ties arbitrarily)
<latexit sha1_base64="FcsoGfdiXUbCtS2heHfv2tDDoMA=">AAACB3icbVDJSgNBEO2JW4zbqEdBGoMQQcKMuCEIQS8eI5gFkjD0dDpJm56eobsmGIbcvPgrXjwo4tVf8Obf2FkOmvig4PFeFVX1/EhwDY7zbaXm5hcWl9LLmZXVtfUNe3OrrMNYUVaioQhV1SeaCS5ZCTgIVo0UI4EvWMXvXg/9So8pzUN5B/2INQLSlrzFKQEjefZuz5P44hLXgT1AQkNpjEGu5/FD3PPuDzw76+SdEfAscSckiyYoevZXvRnSOGASqCBa11wngkZCFHAq2CBTjzWLCO2SNqsZKknAdCMZ/THA+0Zp4laoTEnAI/X3REICrfuBbzoDAh097Q3F/7xaDK3zRsJlFAOTdLyoFQsMIR6GgptcMQqibwihiptbMe0QRSiY6DImBHf65VlSPsq7p/mT2+Ns4WoSRxrtoD2UQy46QwV0g4qohCh6RM/oFb1ZT9aL9W59jFtT1mRmG/2B9fkDnQqYhg==</latexit>

• Let vn := concat(vi , vj ) V = {‘ ’, ‘a’, ‘e’, ‘f’, ‘g’, ‘h’, ‘i’, ‘k’, ‘m’,
<latexit sha1_base64="qeCxr9wOJoEQEBG2Ud9ytVA8q8M=">AAACZ3icbZFNbxMxEIa9WyhlgSalCCFxMUSQHqJoFxXoBamCC8cikQ8pGyVeZzax4vWu7Nmq0Xb7I3vrnQv/AjsNUkkZyY9evePRjMdJIYXBMLzx/J0HD3cf7T0Onjx9tt9oHjzvm7zUHHo8l7keJsyAFAp6KFDCsNDAskTCIFl+c/nBOWgjcvUTVwWMMzZXIhWcobUmzas4Y7jgTFb9mn6hcfU+RrjAakrbHTplDuCQOswdFg7CYemQtTs1jePgb51yZuFgHMo15u06rjv08k6zS9stOp40W2E3XAe9L6KNaJFNnE2a1/Es52UGCrlkxoyisMBxxTQKLqEO4tJAwfiSzWFkpWIZmHG13lNN31lnRtNc26OQrt27FRXLjFllib3pBjXbOWf+LzcqMT0ZV0IVJYLit43SUlLMqVs6nQkNHOXKCsa1sLNSvmCacbRfE9glRNtPvi/6H7rRp+7HH8et06+bdeyR1+QtOSIR+UxOyXdyRnqEk19e4B16L7zffsN/6b+6vep7m5pD8k/4b/4Aq6uwfw==</latexit>

• Change all instances in D of vi , vj to vn


<latexit sha1_base64="8qiJFR39nNAp8DHiEq3U32WlZhY=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgQcKu+DoGvXiMYB6SLMvsZJKMmZldZmYDYclXePGgiFc/x5t/42yyB40WNBRV3XR3hTFn2rjul1NYWl5ZXSuulzY2t7Z3yrt7TR0litAGiXik2iHWlDNJG4YZTtuxoliEnLbC0U3mt8ZUaRbJezOJqS/wQLI+I9hY6WEcsBM0Dh5LQbniVt0Z0F/i5aQCOepB+bPbi0giqDSEY607nhsbP8XKMMLptNRNNI0xGeEB7VgqsaDaT2cHT9GRVXqoHylb0qCZ+nMixULriQhtp8BmqBe9TPzP6ySmf+WnTMaJoZLMF/UTjkyEsu9RjylKDJ9Ygoli9lZEhlhhYmxGWQje4st/SfO06l1Uz+/OKrXrPI4iHMAhHIMHl1CDW6hDAwgIeIIXeHWU8+y8Oe/z1oKTz+zDLzgf36+tj7E=</latexit>

<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="kiVMEoeDnjNUXHKvcbrow7t6lH4=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEmYns8mQ2dllpjcQlnyCFw+KePWLvPk3TpI9aLSgoajqprsrSKQw6LpfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWj25nfGnNtRKwecZJwP6IDJULBKFrpYdxTvXLFrbpzkL/Ey0kFctR75c9uP2ZpxBUySY3peG6CfkY1Cib5tNRNDU8oG9EB71iqaMSNn81PnZITq/RJGGtbCslc/TmR0ciYSRTYzoji0Cx7M/E/r5NieO1nQiUpcsUWi8JUEozJ7G/SF5ozlBNLKNPC3krYkGrK0KZTsiF4yy//Jc2zqndZvbg/r9Ru8jiKcATHcAoeXEEN7qAODWAwgCd4gVdHOs/Om/O+aC04+cwh/ILz8Q1oWI3l</latexit>

and add vn to V
<latexit sha1_base64="kiVMEoeDnjNUXHKvcbrow7t6lH4=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEmYns8mQ2dllpjcQlnyCFw+KePWLvPk3TpI9aLSgoajqprsrSKQw6LpfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWj25nfGnNtRKwecZJwP6IDJULBKFrpYdxTvXLFrbpzkL/Ey0kFctR75c9uP2ZpxBUySY3peG6CfkY1Cib5tNRNDU8oG9EB71iqaMSNn81PnZITq/RJGGtbCslc/TmR0ciYSRTYzoji0Cx7M/E/r5NieO1nQiUpcsUWi8JUEozJ7G/SF5ozlBNLKNPC3krYkGrK0KZTsiF4yy//Jc2zqndZvbg/r9Ru8jiKcATHcAoeXEEN7qAODWAwgCd4gVdHOs/Om/O+aC04+cwh/ILz8Q1oWI3l</latexit>

<latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit>

‘n’, ‘p’, ‘s’, ‘u’, ‘ug’}, |V| = 14


43 Lecture 3: Tokenization
Byte-pair encoding: example
D = { [‘i’] , [‘ ’, ‘h’, ‘ug’] , [‘ ’, ‘p’, ‘ug’, ‘s’] ,
<latexit sha1_base64="iX2xsNJ/MThCJfnr/fTrPxm80Ho=">AAADi3icnVJdb9MwFHUTPkph0MEjLxYVjIepSvjahECa+JB4HBLdJtVR57hOY9VxIvsGUUX5Mfwl3vg32GnE1mYgxJVydHLv8T2+1o0LKQwEwc+e51+7fuNm/9bg9p2du/eGu/dPTF5qxicsl7k+i6nhUig+AQGSnxWa0yyW/DRevnf1069cG5GrL7AqeJTRhRKJYBRsarbb+04yCimjsvpQ47eYVE+I5AlMCfBvUJ2LvZposUgh2scbBby3j89TB+Xir5qi1Vg0l4SYkMGmVXohbEA4UOvf/zBw/a9Qi46uI0mahmv3P+jEllG3SeaAOlg64P8wwu95L66HST2YDUfBOGgCd0nYkhFq43g2/EHmOSszroBJasw0DAqIKqpBMMnrASkNLyhb0gWfWqpoxk1UNbtU48c2M8dJru2nADfZyycqmhmzymKrdItjtmsueVVtWkJyGFVCFSVwxdZGSSkx5NgtJp4LzRnIlSWUaWHvillKNWVg19c9Qrg9cpecPBuHr8YvP78YHb1rn6OPHqJH6CkK0QE6Qp/QMZog5vW9sXfgHfo7/nP/tf9mLfV67ZkHaCP8j78Ar3YTjA==</latexit>

Required:
• Documents D [‘h’, ‘ug’, ‘g’, ‘i’, ‘n’, ‘g’] , [‘ ’, ‘p’, ‘ug’, ‘s’] ,
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

• Desired vocabulary size N (greater than chars in D)


<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>

[‘ ’, ‘i’, ‘s’] , [‘ ’, ‘f’, ‘u’, ‘n’] , [‘i’] ,


Algorithm:
• Pre-tokenize D by splitting into words (split before
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

[‘ ’, ‘m’, ‘a’, ‘k’, ‘e’] , [‘ ’, ‘p’, ‘u’, ‘n’, ‘s’]}


whitespace/punctuation)
Bigram Count
• Initialize V as the set of characters in D
<latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit>

<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

‘ ‘, ‘p’ 3
• Convert D into a list of tokens (characters)
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

‘p’, ‘ug’ 2
• While |V| < N :
<latexit sha1_base64="f0Eq4chOXdaCL4akh6X1BvQxXWM=">AAAB+HicbVDLSsNAFL2pr1ofjbp0M1gEVyURqy6LblxWsA9oQ5lMJ+3QySTMTISa9kvcuFDErZ/izr9x0mahrQcGDufcyz1z/JgzpR3n2yqsrW9sbhW3Szu7e/tl++CwpaJEEtokEY9kx8eKciZoUzPNaSeWFIc+p21/fJv57UcqFYvEg57E1AvxULCAEayN1LfLU9QLsR4RzNPWDE37dsWpOnOgVeLmpAI5Gn37qzeISBJSoQnHSnVdJ9ZeiqVmhNNZqZcoGmMyxkPaNVTgkCovnQefoVOjDFAQSfOERnP190aKQ6UmoW8ms5Bq2cvE/7xuooNrL2UiTjQVZHEoSDjSEcpaQAMmKdF8YggmkpmsiIywxESbrkqmBHf5y6ukdV51L6u1+4tK/SavowjHcAJn4MIV1OEOGtAEAgk8wyu8WU/Wi/VufSxGC1a+cwR/YH3+AIm1kwc=</latexit>

<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>

‘ug’, ’s' 2
n := |V| + 1
<latexit sha1_base64="qHDdAiIl6F+PIAHnOC1uenCsXQI=">AAACAHicbVDLSsNAFL3xWesr6sKFm8EiCEJJxBeCUHTjsoJ9QBvKZDpph04mYWYilJiNv+LGhSJu/Qx3/o2TtgttPTBw5px7ufceP+ZMacf5tubmFxaXlgsrxdW19Y1Ne2u7rqJEElojEY9k08eKciZoTTPNaTOWFIc+pw1/cJP7jQcqFYvEvR7G1AtxT7CAEayN1LF3Bbq8Qo+oHWLdJ5in9cz8jtyOXXLKzgholrgTUoIJqh37q92NSBJSoQnHSrVcJ9ZeiqVmhNOs2E4UjTEZ4B5tGSpwSJWXjg7I0IFRuiiIpHlCo5H6uyPFoVLD0DeV+Z5q2svF/7xWooMLL2UiTjQVZDwoSDjSEcrTQF0mKdF8aAgmkpldEeljiYk2mRVNCO70ybOkflx2z8qndyelyvUkjgLswT4cggvnUIFbqEINCGTwDK/wZj1ZL9a79TEunbMmPTvwB9bnD1QnlPg=</latexit>

• Let
‘u’, ’n' 2
• Get counts of all bigrams in D
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

… …
• For the most frequent bigram vi , vj
<latexit sha1_base64="8qiJFR39nNAp8DHiEq3U32WlZhY=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgQcKu+DoGvXiMYB6SLMvsZJKMmZldZmYDYclXePGgiFc/x5t/42yyB40WNBRV3XR3hTFn2rjul1NYWl5ZXSuulzY2t7Z3yrt7TR0litAGiXik2iHWlDNJG4YZTtuxoliEnLbC0U3mt8ZUaRbJezOJqS/wQLI+I9hY6WEcsBM0Dh5LQbniVt0Z0F/i5aQCOepB+bPbi0giqDSEY607nhsbP8XKMMLptNRNNI0xGeEB7VgqsaDaT2cHT9GRVXqoHylb0qCZ+nMixULriQhtp8BmqBe9TPzP6ySmf+WnTMaJoZLMF/UTjkyEsu9RjylKDJ9Ygoli9lZEhlhhYmxGWQje4st/SfO06l1Uz+/OKrXrPI4iHMAhHIMHl1CDW6hDAwgIeIIXeHWU8+y8Oe/z1oKTz+zDLzgf36+tj7E=</latexit>

(breaking
ties arbitrarily) <latexit sha1_base64="Y7++K9WD8+bHKDDnO2E28JxuhGw=">AAACH3icbVDLSgMxFM3UV62vUZdugkVaQcqMWBVBKLpxqWAf0JY2k6ZtaCYzJHfEMsyfuPFX3LhQRNz5N6YPQasHAueecy8393ih4Boc59NKzc0vLC6llzMrq2vrG/bmVkUHkaKsTAMRqJpHNBNcsjJwEKwWKkZ8T7CqN7gc+dU7pjQP5C0MQ9b0SU/yLqcEjNSyj+9asVtM8Nk5bgC7h5gG0nhJflK1ce4At8Ncso+/G9rYlC076xScMfBf4k5JFk1x3bI/Gp2ARj6TQAXRuu46ITRjooBTwZJMI9IsJHRAeqxuqCQ+0814fF+C94zSwd1AmScBj9WfEzHxtR76nun0CfT1rDcS//PqEXRPmzGXYQRM0smibiQwBHgUFu5wxSiIoSGEKm7+immfKELBRJoxIbizJ/8llcOCe1wo3hxlSxfTONJoB+2iPHLRCSqhK3SNyoiiB/SEXtCr9Wg9W2/W+6Q1ZU1nttEvWJ9fVbehTQ==</latexit>

<latexit sha1_base64="FcsoGfdiXUbCtS2heHfv2tDDoMA=">AAACB3icbVDJSgNBEO2JW4zbqEdBGoMQQcKMuCEIQS8eI5gFkjD0dDpJm56eobsmGIbcvPgrXjwo4tVf8Obf2FkOmvig4PFeFVX1/EhwDY7zbaXm5hcWl9LLmZXVtfUNe3OrrMNYUVaioQhV1SeaCS5ZCTgIVo0UI4EvWMXvXg/9So8pzUN5B/2INQLSlrzFKQEjefZuz5P44hLXgT1AQkNpjEGu5/FD3PPuDzw76+SdEfAscSckiyYoevZXvRnSOGASqCBa11wngkZCFHAq2CBTjzWLCO2SNqsZKknAdCMZ/THA+0Zp4laoTEnAI/X3REICrfuBbzoDAh097Q3F/7xaDK3zRsJlFAOTdLyoFQsMIR6GgptcMQqibwihiptbMe0QRSiY6DImBHf65VlSPsq7p/mT2+Ns4WoSRxrtoD2UQy46QwV0g4qohCh6RM/oFb1ZT9aL9W59jFtT1mRmG/2B9fkDnQqYhg==</latexit>

v15 := concat(‘ ’, ‘p’) = ‘ p’


• Let vn := concat(vi , vj )
• Change all instances in D of vi , vj to vn
<latexit sha1_base64="8qiJFR39nNAp8DHiEq3U32WlZhY=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgQcKu+DoGvXiMYB6SLMvsZJKMmZldZmYDYclXePGgiFc/x5t/42yyB40WNBRV3XR3hTFn2rjul1NYWl5ZXSuulzY2t7Z3yrt7TR0litAGiXik2iHWlDNJG4YZTtuxoliEnLbC0U3mt8ZUaRbJezOJqS/wQLI+I9hY6WEcsBM0Dh5LQbniVt0Z0F/i5aQCOepB+bPbi0giqDSEY607nhsbP8XKMMLptNRNNI0xGeEB7VgqsaDaT2cHT9GRVXqoHylb0qCZ+nMixULriQhtp8BmqBe9TPzP6ySmf+WnTMaJoZLMF/UTjkyEsu9RjylKDJ9Ygoli9lZEhlhhYmxGWQje4st/SfO06l1Uz+/OKrXrPI4iHMAhHIMHl1CDW6hDAwgIeIIXeHWU8+y8Oe/z1oKTz+zDLzgf36+tj7E=</latexit>

<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="kiVMEoeDnjNUXHKvcbrow7t6lH4=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEmYns8mQ2dllpjcQlnyCFw+KePWLvPk3TpI9aLSgoajqprsrSKQw6LpfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWj25nfGnNtRKwecZJwP6IDJULBKFrpYdxTvXLFrbpzkL/Ey0kFctR75c9uP2ZpxBUySY3peG6CfkY1Cib5tNRNDU8oG9EB71iqaMSNn81PnZITq/RJGGtbCslc/TmR0ciYSRTYzoji0Cx7M/E/r5NieO1nQiUpcsUWi8JUEozJ7G/SF5ozlBNLKNPC3krYkGrK0KZTsiF4yy//Jc2zqndZvbg/r9Ru8jiKcATHcAoeXEEN7qAODWAwgCd4gVdHOs/Om/O+aC04+cwh/ILz8Q1oWI3l</latexit>

and add vn to V
<latexit sha1_base64="kiVMEoeDnjNUXHKvcbrow7t6lH4=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEmYns8mQ2dllpjcQlnyCFw+KePWLvPk3TpI9aLSgoajqprsrSKQw6LpfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWj25nfGnNtRKwecZJwP6IDJULBKFrpYdxTvXLFrbpzkL/Ey0kFctR75c9uP2ZpxBUySY3peG6CfkY1Cib5tNRNDU8oG9EB71iqaMSNn81PnZITq/RJGGtbCslc/TmR0ciYSRTYzoji0Cx7M/E/r5NieO1nQiUpcsUWi8JUEozJ7G/SF5ozlBNLKNPC3krYkGrK0KZTsiF4yy//Jc2zqndZvbg/r9Ru8jiKcATHcAoeXEEN7qAODWAwgCd4gVdHOs/Om/O+aC04+cwh/ILz8Q1oWI3l</latexit>

<latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit>

44 Lecture 3: Tokenization
Byte-pair encoding: example
D = { [‘i’] , [‘ ’, ‘h’, ‘ug’] , [‘ ’, ‘p’, ‘ug’, ‘s’] ,
<latexit sha1_base64="iX2xsNJ/MThCJfnr/fTrPxm80Ho=">AAADi3icnVJdb9MwFHUTPkph0MEjLxYVjIepSvjahECa+JB4HBLdJtVR57hOY9VxIvsGUUX5Mfwl3vg32GnE1mYgxJVydHLv8T2+1o0LKQwEwc+e51+7fuNm/9bg9p2du/eGu/dPTF5qxicsl7k+i6nhUig+AQGSnxWa0yyW/DRevnf1069cG5GrL7AqeJTRhRKJYBRsarbb+04yCimjsvpQ47eYVE+I5AlMCfBvUJ2LvZposUgh2scbBby3j89TB+Xir5qi1Vg0l4SYkMGmVXohbEA4UOvf/zBw/a9Qi46uI0mahmv3P+jEllG3SeaAOlg64P8wwu95L66HST2YDUfBOGgCd0nYkhFq43g2/EHmOSszroBJasw0DAqIKqpBMMnrASkNLyhb0gWfWqpoxk1UNbtU48c2M8dJru2nADfZyycqmhmzymKrdItjtmsueVVtWkJyGFVCFSVwxdZGSSkx5NgtJp4LzRnIlSWUaWHvillKNWVg19c9Qrg9cpecPBuHr8YvP78YHb1rn6OPHqJH6CkK0QE6Qp/QMZog5vW9sXfgHfo7/nP/tf9mLfV67ZkHaCP8j78Ar3YTjA==</latexit>

Required:
• Documents D
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

[‘h’, ‘ug’, ‘g’, ‘i’, ‘n’, ‘g’] , [‘ ’, ‘p’, ‘ug’, ‘s’] ,


• Desired vocabulary size N (greater than chars in D)
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

[‘ ’, ‘i’, ‘s’] , [‘ ’, ‘f’, ‘u’, ‘n’] , [‘i’] ,


<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>

Algorithm:
• Pre-tokenize D by splitting into words (split before
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

[‘ ’, ‘m’, ‘a’, ‘k’, ‘e’] , [‘ ’, ‘p’, ‘u’, ‘n’, ‘s’]}


whitespace/punctuation) <latexit sha1_base64="Y7++K9WD8+bHKDDnO2E28JxuhGw=">AAACH3icbVDLSgMxFM3UV62vUZdugkVaQcqMWBVBKLpxqWAf0JY2k6ZtaCYzJHfEMsyfuPFX3LhQRNz5N6YPQasHAueecy8393ih4Boc59NKzc0vLC6llzMrq2vrG/bmVkUHkaKsTAMRqJpHNBNcsjJwEKwWKkZ8T7CqN7gc+dU7pjQP5C0MQ9b0SU/yLqcEjNSyj+9asVtM8Nk5bgC7h5gG0nhJflK1ce4At8Ncso+/G9rYlC076xScMfBf4k5JFk1x3bI/Gp2ARj6TQAXRuu46ITRjooBTwZJMI9IsJHRAeqxuqCQ+0814fF+C94zSwd1AmScBj9WfEzHxtR76nun0CfT1rDcS//PqEXRPmzGXYQRM0smibiQwBHgUFu5wxSiIoSGEKm7+immfKELBRJoxIbizJ/8llcOCe1wo3hxlSxfTONJoB+2iPHLRCSqhK3SNyoiiB/SEXtCr9Wg9W2/W+6Q1ZU1nttEvWJ9fVbehTQ==</latexit>

• Initialize V as the set of characters in D v15 := concat(‘ ’, ‘p’) = ‘ p’


<latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit>

<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

• Convert D into a list of tokens (characters)


<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

D = { [‘i’] , [‘ ’, ‘h’, ‘ug’] , [‘ p’, ‘ug’, ‘s’] ,


<latexit sha1_base64="JVNJTL2808rJGveXmWeTIDUjsz8=">AAADf3iclVJda9swFJXtfXTZR9PtcS+iYctgIdhjX30YlG0Pe+xgaQuRSWVFjkVk2UjXpcH4Z/SP7W3/ZQ+zHLM2ccvYBR+O7zm690rcKJfCgO//clzvzt1793ce9B4+evxkt7/39NhkhWZ8wjKZ6dOIGi6F4hMQIPlprjlNI8lPouUXq5+cc21Epn7AKudhShdKxIJRqFOzPeeSpBQSRmX5tcKfMClfEsljmBLgF1CeiWFFtFgkEI7whoCHI3yWWCgWt3ryVq/RXDNhQnqbbZIrYwPCglr//mdxW/uGUUXH17HETcF151t8YqtRt0hqgVpYWuD/GP/vPa9Gw6TqzfoDf+w3gbskaMkAtXE06/8k84wVKVfAJDVmGvg5hCXVIJjkVY8UhueULemCT2uqaMpNWDb7U+EXdWaO40zXnwLcZK+fKGlqzCqNaqddFrOt2eRN2rSA+GNYCpUXwBVbN4oLiSHDdhnxXGjOQK5qQpkW9ayYJVRTBvXK2kcItq/cJcdvxsH78bvvbweHn9vn2EHP0T56hQL0AR2ib+gITRBzfrv77mt35Dne0Bt7/trqOu2ZZ2gjvIM/8lwRmg==</latexit>

• While |V| < N :


<latexit sha1_base64="f0Eq4chOXdaCL4akh6X1BvQxXWM=">AAAB+HicbVDLSsNAFL2pr1ofjbp0M1gEVyURqy6LblxWsA9oQ5lMJ+3QySTMTISa9kvcuFDErZ/izr9x0mahrQcGDufcyz1z/JgzpR3n2yqsrW9sbhW3Szu7e/tl++CwpaJEEtokEY9kx8eKciZoUzPNaSeWFIc+p21/fJv57UcqFYvEg57E1AvxULCAEayN1LfLU9QLsR4RzNPWDE37dsWpOnOgVeLmpAI5Gn37qzeISBJSoQnHSnVdJ9ZeiqVmhNNZqZcoGmMyxkPaNVTgkCovnQefoVOjDFAQSfOERnP190aKQ6UmoW8ms5Bq2cvE/7xuooNrL2UiTjQVZHEoSDjSEcpaQAMmKdF8YggmkpmsiIywxESbrkqmBHf5y6ukdV51L6u1+4tK/SavowjHcAJn4MIV1OEOGtAEAgk8wyu8WU/Wi/VufSxGC1a+cwR/YH3+AIm1kwc=</latexit>

<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>

n := |V| + 1
<latexit sha1_base64="qHDdAiIl6F+PIAHnOC1uenCsXQI=">AAACAHicbVDLSsNAFL3xWesr6sKFm8EiCEJJxBeCUHTjsoJ9QBvKZDpph04mYWYilJiNv+LGhSJu/Qx3/o2TtgttPTBw5px7ufceP+ZMacf5tubmFxaXlgsrxdW19Y1Ne2u7rqJEElojEY9k08eKciZoTTPNaTOWFIc+pw1/cJP7jQcqFYvEvR7G1AtxT7CAEayN1LF3Bbq8Qo+oHWLdJ5in9cz8jtyOXXLKzgholrgTUoIJqh37q92NSBJSoQnHSrVcJ9ZeiqVmhNOs2E4UjTEZ4B5tGSpwSJWXjg7I0IFRuiiIpHlCo5H6uyPFoVLD0DeV+Z5q2svF/7xWooMLL2UiTjQVZDwoSDjSEcrTQF0mKdF8aAgmkpldEeljiYk2mRVNCO70ybOkflx2z8qndyelyvUkjgLswT4cggvnUIFbqEINCGTwDK/wZj1ZL9a79TEunbMmPTvwB9bnD1QnlPg=</latexit>

[‘h’, ‘ug’, ‘g’, ‘i’, ‘n’, ‘g’] , [‘ p’, ‘ug’, ‘s’] ,


• Let
• Get counts of all bigrams in D
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

[‘ ’, ‘i’, ‘s’] , [‘ ’, ‘f’, ‘u’, ‘n’] , [‘i’] ,


• For the most frequent bigram vi , vj (breaking
<latexit sha1_base64="8qiJFR39nNAp8DHiEq3U32WlZhY=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgQcKu+DoGvXiMYB6SLMvsZJKMmZldZmYDYclXePGgiFc/x5t/42yyB40WNBRV3XR3hTFn2rjul1NYWl5ZXSuulzY2t7Z3yrt7TR0litAGiXik2iHWlDNJG4YZTtuxoliEnLbC0U3mt8ZUaRbJezOJqS/wQLI+I9hY6WEcsBM0Dh5LQbniVt0Z0F/i5aQCOepB+bPbi0giqDSEY607nhsbP8XKMMLptNRNNI0xGeEB7VgqsaDaT2cHT9GRVXqoHylb0qCZ+nMixULriQhtp8BmqBe9TPzP6ySmf+WnTMaJoZLMF/UTjkyEsu9RjylKDJ9Ygoli9lZEhlhhYmxGWQje4st/SfO06l1Uz+/OKrXrPI4iHMAhHIMHl1CDW6hDAwgIeIIXeHWU8+y8Oe/z1oKTz+zDLzgf36+tj7E=</latexit>

[‘ ’, ‘m’, ‘a’, ‘k’, ‘e’] , [‘ p’, ‘u’, ‘n’, ‘s’]}


ties arbitrarily)
V = {‘ ’, ‘a’, ‘e’, ‘f’, ‘g’, ‘h’, ‘i’, ‘k’, ‘m’,
<latexit sha1_base64="C3+NOW0Xw09UcT/bQD0oyC2uam4=">AAACbHicbZHPThsxEMa929LSQEsoPVAhJIsI6CGKdhGUXipF9NIjlZqAlI2C15lNrHi9K3sWNVr21DfkxiP00meovQkSfzqSf/r0jUczHse5FAaD4M7zX7xcefV69U1jbf3tu43m5vu+yQrNocczmenLmBmQQkEPBUq4zDWwNJZwEc++ufzFNWgjMvUT5zkMUzZRIhGcobVGzd9RynDKmSz7Ff1Ko/IgQviF5RU9bNMr5gAOicPEYeogHGYO6WG7olHUuK9TzswdjENRoy6keRVVbXrzoOONbRmejJqtoBPUQZ+LcClaZBnno+ZtNM54kYJCLpkxgzDIcVgyjYJLqBpRYSBnfMYmMLBSsRTMsKyXVdF964xpkml7FNLafVhRstSYeRrbm25Q8zTnzP/lBgUmX4alUHmBoPiiUVJIihl1m6djoYGjnFvBuBZ2VsqnTDOO9n8adgnh0yc/F/2jTvi5c/LjuNU9W65jleyQPfKJhOSUdMl3ck56hJM/3oa37X30/vof/B1/d3HV95Y1W+RR+Af/AP3Yse4=</latexit>

<latexit sha1_base64="FcsoGfdiXUbCtS2heHfv2tDDoMA=">AAACB3icbVDJSgNBEO2JW4zbqEdBGoMQQcKMuCEIQS8eI5gFkjD0dDpJm56eobsmGIbcvPgrXjwo4tVf8Obf2FkOmvig4PFeFVX1/EhwDY7zbaXm5hcWl9LLmZXVtfUNe3OrrMNYUVaioQhV1SeaCS5ZCTgIVo0UI4EvWMXvXg/9So8pzUN5B/2INQLSlrzFKQEjefZuz5P44hLXgT1AQkNpjEGu5/FD3PPuDzw76+SdEfAscSckiyYoevZXvRnSOGASqCBa11wngkZCFHAq2CBTjzWLCO2SNqsZKknAdCMZ/THA+0Zp4laoTEnAI/X3REICrfuBbzoDAh097Q3F/7xaDK3zRsJlFAOTdLyoFQsMIR6GgptcMQqibwihiptbMe0QRSiY6DImBHf65VlSPsq7p/mT2+Ns4WoSRxrtoD2UQy46QwV0g4qohCh6RM/oFb1ZT9aL9W59jFtT1mRmG/2B9fkDnQqYhg==</latexit>

• Let vn := concat(vi , vj )
• Change all instances in D of vi , vj to vn
<latexit sha1_base64="8qiJFR39nNAp8DHiEq3U32WlZhY=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgQcKu+DoGvXiMYB6SLMvsZJKMmZldZmYDYclXePGgiFc/x5t/42yyB40WNBRV3XR3hTFn2rjul1NYWl5ZXSuulzY2t7Z3yrt7TR0litAGiXik2iHWlDNJG4YZTtuxoliEnLbC0U3mt8ZUaRbJezOJqS/wQLI+I9hY6WEcsBM0Dh5LQbniVt0Z0F/i5aQCOepB+bPbi0giqDSEY607nhsbP8XKMMLptNRNNI0xGeEB7VgqsaDaT2cHT9GRVXqoHylb0qCZ+nMixULriQhtp8BmqBe9TPzP6ySmf+WnTMaJoZLMF/UTjkyEsu9RjylKDJ9Ygoli9lZEhlhhYmxGWQje4st/SfO06l1Uz+/OKrXrPI4iHMAhHIMHl1CDW6hDAwgIeIIXeHWU8+y8Oe/z1oKTz+zDLzgf36+tj7E=</latexit>

<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="kiVMEoeDnjNUXHKvcbrow7t6lH4=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEmYns8mQ2dllpjcQlnyCFw+KePWLvPk3TpI9aLSgoajqprsrSKQw6LpfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWj25nfGnNtRKwecZJwP6IDJULBKFrpYdxTvXLFrbpzkL/Ey0kFctR75c9uP2ZpxBUySY3peG6CfkY1Cib5tNRNDU8oG9EB71iqaMSNn81PnZITq/RJGGtbCslc/TmR0ciYSRTYzoji0Cx7M/E/r5NieO1nQiUpcsUWi8JUEozJ7G/SF5ozlBNLKNPC3krYkGrK0KZTsiF4yy//Jc2zqndZvbg/r9Ru8jiKcATHcAoeXEEN7qAODWAwgCd4gVdHOs/Om/O+aC04+cwh/ILz8Q1oWI3l</latexit>

‘n’, ‘p’, ‘s’, ‘u’, ‘ug’, ‘ p}, |V| = 15


and add vn to V
<latexit sha1_base64="kiVMEoeDnjNUXHKvcbrow7t6lH4=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEmYns8mQ2dllpjcQlnyCFw+KePWLvPk3TpI9aLSgoajqprsrSKQw6LpfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWj25nfGnNtRKwecZJwP6IDJULBKFrpYdxTvXLFrbpzkL/Ey0kFctR75c9uP2ZpxBUySY3peG6CfkY1Cib5tNRNDU8oG9EB71iqaMSNn81PnZITq/RJGGtbCslc/TmR0ciYSRTYzoji0Cx7M/E/r5NieO1nQiUpcsUWi8JUEozJ7G/SF5ozlBNLKNPC3krYkGrK0KZTsiF4yy//Jc2zqndZvbg/r9Ru8jiKcATHcAoeXEEN7qAODWAwgCd4gVdHOs/Om/O+aC04+cwh/ILz8Q1oWI3l</latexit>

<latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit>

45 Lecture 3: Tokenization
Byte-pair encoding: example
Repeat until |V| = N …
<latexit sha1_base64="y68mdarm10+XMeklLrKbgNCl/2U=">AAAB+nicbVDLSsNAFL2pr1pfqS7dDBbBVUnE10YounElFewD2lAm00k7dDIJMxOlpP0UNy4UceuXuPNvnLRZaPXAwOGce7lnjh9zprTjfFmFpeWV1bXiemljc2t7xy7vNlWUSEIbJOKRbPtYUc4EbWimOW3HkuLQ57Tlj64zv/VApWKRuNfjmHohHggWMIK1kXp2edINsR4SzNPmdIIu0W3PrjhVZwb0l7g5qUCOes/+7PYjkoRUaMKxUh3XibWXYqkZ4XRa6iaKxpiM8IB2DBU4pMpLZ9Gn6NAofRRE0jyh0Uz9uZHiUKlx6JvJLKda9DLxP6+T6ODCS5mIE00FmR8KEo50hLIeUJ9JSjQfG4KJZCYrIkMsMdGmrZIpwV388l/SPK66Z9XTu5NK7Sqvowj7cABH4MI51OAG6tAAAo/wBC/wak2sZ+vNep+PFqx8Zw9+wfr4BrZek6Y=</latexit>

Required:
• Documents D
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

• D = { [‘i’] , [‘ hug’] , [‘ pugs’] ,


<latexit sha1_base64="AI3TasaEh9M2PWJuawZpx6qmP5A=">AAADWXicnVJLi9swEJbt7jZ1X9nusRfR0MdhCXbp61JY2h563EKzuxCZrKyMHRFZNtK4NBj/yR4KpX+lh1qJKZvdhkIH9PFpvnlJTFopaTGKfnh+cGNv/+bgVnj7zt1794cHD05tWRsBE1Gq0pyn3IKSGiYoUcF5ZYAXqYKzdPne6WdfwFhZ6s+4qiApeK5lJgXHzjU78EpWcFwIrpoPLX1LWfOEKchwyhC+YnMhn7bMyHyByRHdEuiizndqVZ3byyILt6u63CN6sQbpQG+u/1uP/ilkd9Zwauag1rti5D97FA64g6UD2D1x32lrJMracDYcReNobfQ6iXsyIr2dzIbf2LwUdQEaheLWTuOowqThBqVQ0IastlBxseQ5TDuqeQE2adab0dLHnWdOs9J0RyNdey9nNLywdlWkXaRbA3tVc86/adMaszdJI3VVI2ixaZTVimJJ3ZrRuTQgUK06woWR3axULLjhArtldJ8QX33ydXL6fBy/Gr/89GJ0/K7/jgF5SB6RZyQmr8kx+UhOyIQI77v3y9/z9/2fgRcMgnAT6nt9ziHZsuDwN+AXCnE=</latexit>

Desired vocabulary size N (greater than chars in D)


<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>

Algorithm:
• Pre-tokenize D by splitting into words (split before
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

[‘hug’, ‘g’, ‘i’, ‘n’, ‘g’] , [‘ pugs’] ,


whitespace/punctuation)
[‘ ’, ‘i’, ‘s’] , [‘ ’, ‘f’, ‘un’] , [‘i’] ,
• Initialize V as the set of characters in D
<latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit>

<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

• Convert D into a list of tokens (characters)


<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

[‘ ’, ‘m’, ‘a’, ‘k’, ‘e’] , [‘ p’, ‘un’, ‘s’]}


• While |V| < N :
<latexit sha1_base64="f0Eq4chOXdaCL4akh6X1BvQxXWM=">AAAB+HicbVDLSsNAFL2pr1ofjbp0M1gEVyURqy6LblxWsA9oQ5lMJ+3QySTMTISa9kvcuFDErZ/izr9x0mahrQcGDufcyz1z/JgzpR3n2yqsrW9sbhW3Szu7e/tl++CwpaJEEtokEY9kx8eKciZoUzPNaSeWFIc+p21/fJv57UcqFYvEg57E1AvxULCAEayN1LfLU9QLsR4RzNPWDE37dsWpOnOgVeLmpAI5Gn37qzeISBJSoQnHSnVdJ9ZeiqVmhNNZqZcoGmMyxkPaNVTgkCovnQefoVOjDFAQSfOERnP190aKQ6UmoW8ms5Bq2cvE/7xuooNrL2UiTjQVZHEoSDjSEcpaQAMmKdF8YggmkpmsiIywxESbrkqmBHf5y6ukdV51L6u1+4tK/SavowjHcAJn4MIV1OEOGtAEAgk8wyu8WU/Wi/VufSxGC1a+cwR/YH3+AIm1kwc=</latexit>

<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>

n := |V| + 1
<latexit sha1_base64="qHDdAiIl6F+PIAHnOC1uenCsXQI=">AAACAHicbVDLSsNAFL3xWesr6sKFm8EiCEJJxBeCUHTjsoJ9QBvKZDpph04mYWYilJiNv+LGhSJu/Qx3/o2TtgttPTBw5px7ufceP+ZMacf5tubmFxaXlgsrxdW19Y1Ne2u7rqJEElojEY9k08eKciZoTTPNaTOWFIc+pw1/cJP7jQcqFYvEvR7G1AtxT7CAEayN1LF3Bbq8Qo+oHWLdJ5in9cz8jtyOXXLKzgholrgTUoIJqh37q92NSBJSoQnHSrVcJ9ZeiqVmhNOs2E4UjTEZ4B5tGSpwSJWXjg7I0IFRuiiIpHlCo5H6uyPFoVLD0DeV+Z5q2svF/7xWooMLL2UiTjQVZDwoSDjSEcrTQF0mKdF8aAgmkpldEeljiYk2mRVNCO70ybOkflx2z8qndyelyvUkjgLswT4cggvnUIFbqEINCGTwDK/wZj1ZL9a79TEunbMmPTvwB9bnD1QnlPg=</latexit>

• Let V = {‘ ’, ‘a’, ‘e’, ‘f’, ‘g’, ‘h’, ‘i’, ‘k’, ‘m’,‘n’, ‘p’, ‘s’, ‘u’,
<latexit sha1_base64="KOOrUjEbeZqzWAw9vn5Se+XTe/M=">AAACmHicbVFdb9MwFHXCgFG+OhAv24tFBeWhqpIJBhJCqmDT4G0g2k1qqtZxb1qrjhPZN4gqy2/iv/DGv5mdBm1sXMnHJ+feY99cx7kUBoPgj+ff2rp95+72vdb9Bw8fPW7vPBmZrNAchjyTmT6LmQEpFAxRoISzXANLYwmn8eqTy5/+AG1Epr7jOodJyhZKJIIztNK0/StKGS45k+Wooh9oVL6MEH5iOaPdHp0xB+AgcbBwsHQgHKwcpN3eTDmSOzAOim6volHUao6yQm2kdcXy78flvjHVh1CXrqKqV/vPrzR3brvbD1rTdifoB3XQmyRsSIc0cTJt/47mGS9SUMglM2YcBjlOSqZRcAlVKyoM5Iyv2ALGliqWgpmU9WAr+sIqc5pk2i6FtFavOkqWGrNOY1vpOjXXc078X25cYPJuUgqVFwiKby5KCkkxo+6V6Fxo4CjXljCuhe2V8iXTjKN9SzeE8Pov3ySj/X540H/z9XVn8LEZxzbZI8/JKxKSt2RAPpMTMiTce+a99w69I3/XH/jH/pdNqe81nqfkn/C/XQDlHL2y</latexit>

• Get counts of all bigrams in D


<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

• For the most frequent bigram vi , vj (breaking


<latexit sha1_base64="8qiJFR39nNAp8DHiEq3U32WlZhY=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgQcKu+DoGvXiMYB6SLMvsZJKMmZldZmYDYclXePGgiFc/x5t/42yyB40WNBRV3XR3hTFn2rjul1NYWl5ZXSuulzY2t7Z3yrt7TR0litAGiXik2iHWlDNJG4YZTtuxoliEnLbC0U3mt8ZUaRbJezOJqS/wQLI+I9hY6WEcsBM0Dh5LQbniVt0Z0F/i5aQCOepB+bPbi0giqDSEY607nhsbP8XKMMLptNRNNI0xGeEB7VgqsaDaT2cHT9GRVXqoHylb0qCZ+nMixULriQhtp8BmqBe9TPzP6ySmf+WnTMaJoZLMF/UTjkyEsu9RjylKDJ9Ygoli9lZEhlhhYmxGWQje4st/SfO06l1Uz+/OKrXrPI4iHMAhHIMHl1CDW6hDAwgIeIIXeHWU8+y8Oe/z1oKTz+zDLzgf36+tj7E=</latexit>

‘ug’, ‘ p’, ‘hug’, ‘ pug’, ‘ pugs’, ‘un’, ‘ hug’},


ties arbitrarily)
<latexit sha1_base64="FcsoGfdiXUbCtS2heHfv2tDDoMA=">AAACB3icbVDJSgNBEO2JW4zbqEdBGoMQQcKMuCEIQS8eI5gFkjD0dDpJm56eobsmGIbcvPgrXjwo4tVf8Obf2FkOmvig4PFeFVX1/EhwDY7zbaXm5hcWl9LLmZXVtfUNe3OrrMNYUVaioQhV1SeaCS5ZCTgIVo0UI4EvWMXvXg/9So8pzUN5B/2INQLSlrzFKQEjefZuz5P44hLXgT1AQkNpjEGu5/FD3PPuDzw76+SdEfAscSckiyYoevZXvRnSOGASqCBa11wngkZCFHAq2CBTjzWLCO2SNqsZKknAdCMZ/THA+0Zp4laoTEnAI/X3REICrfuBbzoDAh097Q3F/7xaDK3zRsJlFAOTdLyoFQsMIR6GgptcMQqibwihiptbMe0QRSiY6DImBHf65VlSPsq7p/mT2+Ns4WoSRxrtoD2UQy46QwV0g4qohCh6RM/oFb1ZT9aL9W59jFtT1mRmG/2B9fkDnQqYhg==</latexit>

|V| = 20
• Let vn := concat(vi , vj )
• Change all instances in D of vi , vj to vn
<latexit sha1_base64="8qiJFR39nNAp8DHiEq3U32WlZhY=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgQcKu+DoGvXiMYB6SLMvsZJKMmZldZmYDYclXePGgiFc/x5t/42yyB40WNBRV3XR3hTFn2rjul1NYWl5ZXSuulzY2t7Z3yrt7TR0litAGiXik2iHWlDNJG4YZTtuxoliEnLbC0U3mt8ZUaRbJezOJqS/wQLI+I9hY6WEcsBM0Dh5LQbniVt0Z0F/i5aQCOepB+bPbi0giqDSEY607nhsbP8XKMMLptNRNNI0xGeEB7VgqsaDaT2cHT9GRVXqoHylb0qCZ+nMixULriQhtp8BmqBe9TPzP6ySmf+WnTMaJoZLMF/UTjkyEsu9RjylKDJ9Ygoli9lZEhlhhYmxGWQje4st/SfO06l1Uz+/OKrXrPI4iHMAhHIMHl1CDW6hDAwgIeIIXeHWU8+y8Oe/z1oKTz+zDLzgf36+tj7E=</latexit>

<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="kiVMEoeDnjNUXHKvcbrow7t6lH4=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEmYns8mQ2dllpjcQlnyCFw+KePWLvPk3TpI9aLSgoajqprsrSKQw6LpfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWj25nfGnNtRKwecZJwP6IDJULBKFrpYdxTvXLFrbpzkL/Ey0kFctR75c9uP2ZpxBUySY3peG6CfkY1Cib5tNRNDU8oG9EB71iqaMSNn81PnZITq/RJGGtbCslc/TmR0ciYSRTYzoji0Cx7M/E/r5NieO1nQiUpcsUWi8JUEozJ7G/SF5ozlBNLKNPC3krYkGrK0KZTsiF4yy//Jc2zqndZvbg/r9Ru8jiKcATHcAoeXEEN7qAODWAwgCd4gVdHOs/Om/O+aC04+cwh/ILz8Q1oWI3l</latexit>

and add vn to V
<latexit sha1_base64="kiVMEoeDnjNUXHKvcbrow7t6lH4=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEmYns8mQ2dllpjcQlnyCFw+KePWLvPk3TpI9aLSgoajqprsrSKQw6LpfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWj25nfGnNtRKwecZJwP6IDJULBKFrpYdxTvXLFrbpzkL/Ey0kFctR75c9uP2ZpxBUySY3peG6CfkY1Cib5tNRNDU8oG9EB71iqaMSNn81PnZITq/RJGGtbCslc/TmR0ciYSRTYzoji0Cx7M/E/r5NieO1nQiUpcsUWi8JUEozJ7G/SF5ozlBNLKNPC3krYkGrK0KZTsiF4yy//Jc2zqndZvbg/r9Ru8jiKcATHcAoeXEEN7qAODWAwgCd4gVdHOs/Om/O+aC04+cwh/ILz8Q1oWI3l</latexit>

<latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit>

CHANGES FROM START

46 Lecture 3: Tokenization
Byte-pair encoding: example
CHANGES FROM START
D = { [‘i’] , [‘ hug’] , [‘ pugs’] ,
<latexit sha1_base64="AI3TasaEh9M2PWJuawZpx6qmP5A=">AAADWXicnVJLi9swEJbt7jZ1X9nusRfR0MdhCXbp61JY2h563EKzuxCZrKyMHRFZNtK4NBj/yR4KpX+lh1qJKZvdhkIH9PFpvnlJTFopaTGKfnh+cGNv/+bgVnj7zt1794cHD05tWRsBE1Gq0pyn3IKSGiYoUcF5ZYAXqYKzdPne6WdfwFhZ6s+4qiApeK5lJgXHzjU78EpWcFwIrpoPLX1LWfOEKchwyhC+YnMhn7bMyHyByRHdEuiizndqVZ3byyILt6u63CN6sQbpQG+u/1uP/ilkd9Zwauag1rti5D97FA64g6UD2D1x32lrJMracDYcReNobfQ6iXsyIr2dzIbf2LwUdQEaheLWTuOowqThBqVQ0IastlBxseQ5TDuqeQE2adab0dLHnWdOs9J0RyNdey9nNLywdlWkXaRbA3tVc86/adMaszdJI3VVI2ixaZTVimJJ3ZrRuTQgUK06woWR3axULLjhArtldJ8QX33ydXL6fBy/Gr/89GJ0/K7/jgF5SB6RZyQmr8kx+UhOyIQI77v3y9/z9/2fgRcMgnAT6nt9ziHZsuDwN+AXCnE=</latexit>

Questions to think about: [‘hug’, ‘g’, ‘i’, ‘n’, ‘g’] , [‘ pugs’] ,


• Is every token we made used [‘ ’, ‘i’, ‘s’] , [‘ ’, ‘f’, ‘un’] , [‘i’] ,
in the corpus? Why or why [‘ ’, ‘m’, ‘a’, ‘k’, ‘e’] , [‘ p’, ‘un’, ‘s’]}
not? D = { [7] , [20] , [18] ,
<latexit sha1_base64="y8GPACNgNxygINJc7Mww+a64Ff8=">AAAC3HicjVJNb9NAEF270I8AbQpHLqNGIA5WZKdNkx4qVYUDx1YibVFsovVmnayy/tDuGCmyIi4cqBDX/jBu/A7+AOvEUlu3B0Za6e2beW93ZzbMpNDoun8se+3J0/WNza3Gs+cvtneauy8vdJorxgcslam6CqnmUiR8gAIlv8oUp3Eo+WU4e1/mL79ypUWafMJ5xoOYThIRCUbRUKPmXz+mOGVUFh8WcAx+8daXPMJhz1diMsXAgdW+49YIr39L+I1K5R060HWg54DnGvRfklV5p17rwIGhj2p073GHIwc6DvQd2K/bdEuTO/7gLxqjZsttu8uAh8CrQItUcTZq/vbHKctjniCTVOuh52YYFFShYJIvGn6ueUbZjE740MCExlwHxXI4C3hjmDFEqTIrQViydxUFjbWex6GpLEeh67mSfCw3zDHqB4VIshx5wlYHRbkETKGcNIyF4gzl3ADKlDB3BTalijI0/6Fsgld/8kNw0Wl7h+3u+UHr5LRqxyZ5TfbIO+KRHjkhH8kZGRBmfba+WT+sa/uL/d3+af9aldpWpXlF7oV98w/aY9jQ</latexit>

• How much memory (#tokens) [16, 5, 7, 10, 5] , [18] , (as tokens


have we saved for each [1, 7, 12] , [1, 4, 19] , [7] , indices)
document? [1, 9, 2, 8, 3] , [15, 19, 12]}
• What would happen if you V = {1 : ‘ ’, 2 : ‘a’, 3 : ‘e’, 4 : ‘f’, 5 : ‘g’, 6 : ‘h’, 7 : ‘i’,
<latexit sha1_base64="lHm+oUzz/MCY2ZwNogXxdq6cAiY=">AAADTnicbZJNb9MwGMedjJdSXtaNIxeLisEBVUnZ1g4JaYILxyHRblJTFcd1WquOE/kFUUX5hFwQNz4GFw4gBLEbnG7jkSw9v78fP/7HeeKcUamC4Jvn79y4eet260777r37D3Y7e/tjmWmByQhnLBMXMZKEUU5GiipGLnJBUBozch6v3pj9849ESJrx92qdk2mKFpwmFCNVSbM9bx6lSC0xYsW4hK9gVByEL2GkyCdVfIBPy+ew7xAZfOGQGDx0mBg8crgweOxwaXDgkBqMovbB0EkrI504TA2GgWNuufGWW27MScuNO/3vhrCxqK2psDEJN122bNYlg62SWhpekqRr31jW1mO/8Qxtu6hszzrdoBfYgNeTsE66oI6zWedrNM+wTglXmCEpJ2GQq2mBhKKYkbIdaUlyhFdoQSZVylFK5LSw41DCJ5Uyh0kmqsUVtOr2iQKlUq7TuKo0/15e3TPi//YmWiXDaUF5rhXheHNRohlUGTSzBedUEKzYukoQFrTyCvESCYRVNYHmEcKrn3w9Gfd74XHv6N1h9/R1/Rwt8Ag8Bs9ACAbgFLwFZ2AEsPfZ++799H75X/wf/m//z6bU9+ozD8Gl2Gn9BePSAiA=</latexit>

kept adding vocabulary until 8 : ‘k’, 9 : ‘m’, 10 : ‘n’, 11 : ‘p’, 12 : ‘s’, 13 : ‘u’,
you couldn’t anymore? 14 : ‘ug’, 15 : ‘ p’, 16 : ‘hug’, 17 : ‘ pug’, 18 : ‘ pugs’,
19 : ‘un’, 20 : ‘ hug’}
47 Lecture 3: Tokenization
Byte-pair encoding: tokenization/encoding
With this vocabulary, can you represent (or, tokenize/encode):
• “apple”?
• No, there is no ‘l’ in the vocabulary
• “huge”?
V = {1 : ‘ ’, 2 : ‘a’, 3 : ‘e’, 4 : ‘f’, 5 : ‘g’, 6 : ‘h’, 7 : ‘i’,
<latexit sha1_base64="lHm+oUzz/MCY2ZwNogXxdq6cAiY=">AAADTnicbZJNb9MwGMedjJdSXtaNIxeLisEBVUnZ1g4JaYILxyHRblJTFcd1WquOE/kFUUX5hFwQNz4GFw4gBLEbnG7jkSw9v78fP/7HeeKcUamC4Jvn79y4eet260777r37D3Y7e/tjmWmByQhnLBMXMZKEUU5GiipGLnJBUBozch6v3pj9849ESJrx92qdk2mKFpwmFCNVSbM9bx6lSC0xYsW4hK9gVByEL2GkyCdVfIBPy+ew7xAZfOGQGDx0mBg8crgweOxwaXDgkBqMovbB0EkrI504TA2GgWNuufGWW27MScuNO/3vhrCxqK2psDEJN122bNYlg62SWhpekqRr31jW1mO/8Qxtu6hszzrdoBfYgNeTsE66oI6zWedrNM+wTglXmCEpJ2GQq2mBhKKYkbIdaUlyhFdoQSZVylFK5LSw41DCJ5Uyh0kmqsUVtOr2iQKlUq7TuKo0/15e3TPi//YmWiXDaUF5rhXheHNRohlUGTSzBedUEKzYukoQFrTyCvESCYRVNYHmEcKrn3w9Gfd74XHv6N1h9/R1/Rwt8Ag8Bs9ACAbgFLwFZ2AEsPfZ++799H75X/wf/m//z6bU9+ozD8Gl2Gn9BePSAiA=</latexit>

• Yes - [16, 4]
8 : ‘k’, 9 : ‘m’, 10 : ‘n’, 11 : ‘p’, 12 : ‘s’, 13 : ‘u’,
• “ huge”?
14 : ‘ug’, 15 : ‘ p’, 16 : ‘hug’, 17 : ‘ pug’, 18 : ‘ pugs’,
• Yes - [18, 4]
19 : ‘un’, 20 : ‘ hug’}
• “ hugest”?
• No, there is no ’t’ in the vocabulary
• “unassumingness”?
• Yes - [19, 2, 12, 12, 13, 9, 7, 10, 5, 10, 3, 12, 12]
48 Lecture 3: Tokenization
Byte-pair encoding: tokenization/encoding
V = {1 : ‘ ’, 2 : ‘a’, 3 : ‘e’, 4 : ‘f’, 5 : ‘g’, 6 : ‘h’, 7 : ‘i’,
<latexit sha1_base64="lHm+oUzz/MCY2ZwNogXxdq6cAiY=">AAADTnicbZJNb9MwGMedjJdSXtaNIxeLisEBVUnZ1g4JaYILxyHRblJTFcd1WquOE/kFUUX5hFwQNz4GFw4gBLEbnG7jkSw9v78fP/7HeeKcUamC4Jvn79y4eet260777r37D3Y7e/tjmWmByQhnLBMXMZKEUU5GiipGLnJBUBozch6v3pj9849ESJrx92qdk2mKFpwmFCNVSbM9bx6lSC0xYsW4hK9gVByEL2GkyCdVfIBPy+ew7xAZfOGQGDx0mBg8crgweOxwaXDgkBqMovbB0EkrI504TA2GgWNuufGWW27MScuNO/3vhrCxqK2psDEJN122bNYlg62SWhpekqRr31jW1mO/8Qxtu6hszzrdoBfYgNeTsE66oI6zWedrNM+wTglXmCEpJ2GQq2mBhKKYkbIdaUlyhFdoQSZVylFK5LSw41DCJ5Uyh0kmqsUVtOr2iQKlUq7TuKo0/15e3TPi//YmWiXDaUF5rhXheHNRohlUGTSzBedUEKzYukoQFrTyCvESCYRVNYHmEcKrn3w9Gfd74XHv6N1h9/R1/Rwt8Ag8Bs9ACAbgFLwFZ2AEsPfZ++799H75X/wf/m//z6bU9+ozD8Gl2Gn9BePSAiA=</latexit>

8 : ‘k’, 9 : ‘m’, 10 : ‘n’, 11 : ‘p’, 12 : ‘s’, 13 : ‘u’,


14 : ‘ug’, 15 : ‘ p’, 16 : ‘hug’, 17 : ‘ pug’, 18 : ‘ pugs’,
19 : ‘un’, 20 : ‘ hug’}
• Sometimes, there may be more than one way to represent a word with the
vocabulary…
• E.g., “ hugs” = [20, 12] = [1, 16, 12] = [1, 6, 14, 12] = [1, 6, 13, 5, 13]
• Which is the best representation? Why?

49 Lecture 3: Tokenization
Byte-pair encoding: tokenization/encoding
V = {1 : ‘ ’, 2 : ‘a’, 3 : ‘e’, 4 : ‘f’, 5 : ‘g’, 6 : ‘h’, 7 : ‘i’,
<latexit sha1_base64="lHm+oUzz/MCY2ZwNogXxdq6cAiY=">AAADTnicbZJNb9MwGMedjJdSXtaNIxeLisEBVUnZ1g4JaYILxyHRblJTFcd1WquOE/kFUUX5hFwQNz4GFw4gBLEbnG7jkSw9v78fP/7HeeKcUamC4Jvn79y4eet260777r37D3Y7e/tjmWmByQhnLBMXMZKEUU5GiipGLnJBUBozch6v3pj9849ESJrx92qdk2mKFpwmFCNVSbM9bx6lSC0xYsW4hK9gVByEL2GkyCdVfIBPy+ew7xAZfOGQGDx0mBg8crgweOxwaXDgkBqMovbB0EkrI504TA2GgWNuufGWW27MScuNO/3vhrCxqK2psDEJN122bNYlg62SWhpekqRr31jW1mO/8Qxtu6hszzrdoBfYgNeTsE66oI6zWedrNM+wTglXmCEpJ2GQq2mBhKKYkbIdaUlyhFdoQSZVylFK5LSw41DCJ5Uyh0kmqsUVtOr2iQKlUq7TuKo0/15e3TPi//YmWiXDaUF5rhXheHNRohlUGTSzBedUEKzYukoQFrTyCvESCYRVNYHmEcKrn3w9Gfd74XHv6N1h9/R1/Rwt8Ag8Bs9ACAbgFLwFZ2AEsPfZ++799H75X/wf/m//z6bU9+ozD8Gl2Gn9BePSAiA=</latexit>

8 : ‘k’, 9 : ‘m’, 10 : ‘n’, 11 : ‘p’, 12 : ‘s’, 13 : ‘u’,


14 : ‘ug’, 15 : ‘ p’, 16 : ‘hug’, 17 : ‘ pug’, 18 : ‘ pugs’,
19 : ‘un’, 20 : ‘ hug’}
Encoding algorithm
Given string s and (ordered) vocab V ,
<latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit>

<latexit sha1_base64="NNhp4HiCDIt6HWe1guuCXXVMPqg=">AAAB6HicbVDLSgNBEOz1GeMr6tHLYBA8hV3xdQx68ZiAeUCyhNlJbzJmdnaZmRXCki/w4kERr36SN//GSbIHTSxoKKq66e4KEsG1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6m/qtJ1Sax/LBjBP0IzqQPOSMGivVda9UdivuDGSZeDkpQ45ar/TV7ccsjVAaJqjWHc9NjJ9RZTgTOCl2U40JZSM6wI6lkkao/Wx26IScWqVPwljZkobM1N8TGY20HkeB7YyoGepFbyr+53VSE974GZdJalCy+aIwFcTEZPo16XOFzIixJZQpbm8lbEgVZcZmU7QheIsvL5PmecW7qlzWL8rV2zyOAhzDCZyBB9dQhXuoQQMYIDzDK7w5j86L8+58zFtXnHzmCP7A+fwB4eWNAQ==</latexit>

• Pretokenize D in same way as before


<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

• Tokenize D into characters


<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

• Perform merge rules in same order as in training until no more merges


may be done

50 Lecture 3: Tokenization
Byte-pair encoding: tokenization/encoding
V = {1 : ‘ ’, 2 : ‘a’, 3 : ‘e’, 4 : ‘f’, 5 : ‘g’, 6 : ‘h’, 7 : ‘i’,
<latexit sha1_base64="lHm+oUzz/MCY2ZwNogXxdq6cAiY=">AAADTnicbZJNb9MwGMedjJdSXtaNIxeLisEBVUnZ1g4JaYILxyHRblJTFcd1WquOE/kFUUX5hFwQNz4GFw4gBLEbnG7jkSw9v78fP/7HeeKcUamC4Jvn79y4eet260777r37D3Y7e/tjmWmByQhnLBMXMZKEUU5GiipGLnJBUBozch6v3pj9849ESJrx92qdk2mKFpwmFCNVSbM9bx6lSC0xYsW4hK9gVByEL2GkyCdVfIBPy+ew7xAZfOGQGDx0mBg8crgweOxwaXDgkBqMovbB0EkrI504TA2GgWNuufGWW27MScuNO/3vhrCxqK2psDEJN122bNYlg62SWhpekqRr31jW1mO/8Qxtu6hszzrdoBfYgNeTsE66oI6zWedrNM+wTglXmCEpJ2GQq2mBhKKYkbIdaUlyhFdoQSZVylFK5LSw41DCJ5Uyh0kmqsUVtOr2iQKlUq7TuKo0/15e3TPi//YmWiXDaUF5rhXheHNRohlUGTSzBedUEKzYukoQFrTyCvESCYRVNYHmEcKrn3w9Gfd74XHv6N1h9/R1/Rwt8Ag8Bs9ACAbgFLwFZ2AEsPfZ++799H75X/wf/m//z6bU9+ozD8Gl2Gn9BePSAiA=</latexit>

8 : ‘k’, 9 : ‘m’, 10 : ‘n’, 11 : ‘p’, 12 : ‘s’, 13 : ‘u’,


14 : ‘ug’, 15 : ‘ p’, 16 : ‘hug’, 17 : ‘ pug’, 18 : ‘ pugs’,
19 : ‘un’, 20 : ‘ hug’}
Encoding algorithm <latexit sha1_base64="qzhhk1IxdinBsDB80AX938mEHjI=">AAACZXicbVFdSxtBFJ1dW42xatTSlz44JDRYCGE3fvZBkBahjwpGheySzE5uksHZ2WXmrhjW/EnffPWlf6OTj0pNvHCHw7nncGfORKkUBj3v2XGXPnxcXimsFtc+rW9slra2r02SaQ5NnshE30bMgBQKmihQwm2qgcWRhJvo7td4fnMP2ohEXeEwhTBmfSV6gjO0VLv0GCA8YH6ueNKFvU6HDrK+KX8f0VPaqja8GvUbIQ2C4pwsFsYMWApKgXmV/6jR47Fh2kc1ak9rrfp+je5b0pv2/j9N2C5VvLo3KboI/BmokFldtEtPQTfhWQwKuWTGtHwvxTBnGgWXMCoGmYGU8TvWh5aFisVgwnyS0oh+s0yX9hJtWyGdsP87chYbM4wjq4wZDsz8bEy+N2tl2DsJc6HSDEHx6aJeJikmdBw57QoNHOXQAsa1sHelfMA042g/pmhD8OefvAiuG3X/qH54eVA5+zmLo0C+kjLZIz45JmfkN7kgTcLJi1Nwtpxt54+77n52v0ylrjPz7JA35e7+BeaorC8=</latexit>

s and (ordered) vocab V , Encode(“ hugs”) = [20, 12]


Given string
<latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit>

<latexit sha1_base64="NNhp4HiCDIt6HWe1guuCXXVMPqg=">AAAB6HicbVDLSgNBEOz1GeMr6tHLYBA8hV3xdQx68ZiAeUCyhNlJbzJmdnaZmRXCki/w4kERr36SN//GSbIHTSxoKKq66e4KEsG1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6m/qtJ1Sax/LBjBP0IzqQPOSMGivVda9UdivuDGSZeDkpQ45ar/TV7ccsjVAaJqjWHc9NjJ9RZTgTOCl2U40JZSM6wI6lkkao/Wx26IScWqVPwljZkobM1N8TGY20HkeB7YyoGepFbyr+53VSE974GZdJalCy+aIwFcTEZPo16XOFzIixJZQpbm8lbEgVZcZmU7QheIsvL5PmecW7qlzWL8rV2zyOAhzDCZyBB9dQhXuoQQMYIDzDK7w5j86L8+58zFtXnHzmCP7A+fwB4eWNAQ==</latexit>

Encode(“misshapenness”) = [9, 7, 12, 12, 6, 2,


• Pretokenize D in same way as
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

before 11, 3, 10, 10, 3, 12, 12]


• Tokenize D into characters
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

• Perform merge rules in same order


as in training until no more merges
may be done
51 Lecture 3: Tokenization
Byte-pair encoding: decoding
V = {1 : ‘ ’, 2 : ‘a’, 3 : ‘e’, 4 : ‘f’, 5 : ‘g’, 6 : ‘h’, 7 : ‘i’,
<latexit sha1_base64="lHm+oUzz/MCY2ZwNogXxdq6cAiY=">AAADTnicbZJNb9MwGMedjJdSXtaNIxeLisEBVUnZ1g4JaYILxyHRblJTFcd1WquOE/kFUUX5hFwQNz4GFw4gBLEbnG7jkSw9v78fP/7HeeKcUamC4Jvn79y4eet260777r37D3Y7e/tjmWmByQhnLBMXMZKEUU5GiipGLnJBUBozch6v3pj9849ESJrx92qdk2mKFpwmFCNVSbM9bx6lSC0xYsW4hK9gVByEL2GkyCdVfIBPy+ew7xAZfOGQGDx0mBg8crgweOxwaXDgkBqMovbB0EkrI504TA2GgWNuufGWW27MScuNO/3vhrCxqK2psDEJN122bNYlg62SWhpekqRr31jW1mO/8Qxtu6hszzrdoBfYgNeTsE66oI6zWedrNM+wTglXmCEpJ2GQq2mBhKKYkbIdaUlyhFdoQSZVylFK5LSw41DCJ5Uyh0kmqsUVtOr2iQKlUq7TuKo0/15e3TPi//YmWiXDaUF5rhXheHNRohlUGTSzBedUEKzYukoQFrTyCvESCYRVNYHmEcKrn3w9Gfd74XHv6N1h9/R1/Rwt8Ag8Bs9ACAbgFLwFZ2AEsPfZ++799H75X/wf/m//z6bU9+ozD8Gl2Gn9BePSAiA=</latexit>

8 : ‘k’, 9 : ‘m’, 10 : ‘n’, 11 : ‘p’, 12 : ‘s’, 13 : ‘u’,


14 : ‘ug’, 15 : ‘ p’, 16 : ‘hug’, 17 : ‘ pug’, 18 : ‘ pugs’,
19 : ‘un’, 20 : ‘ hug’}
Decoding algorithm <latexit sha1_base64="qzhhk1IxdinBsDB80AX938mEHjI=">AAACZXicbVFdSxtBFJ1dW42xatTSlz44JDRYCGE3fvZBkBahjwpGheySzE5uksHZ2WXmrhjW/EnffPWlf6OTj0pNvHCHw7nncGfORKkUBj3v2XGXPnxcXimsFtc+rW9slra2r02SaQ5NnshE30bMgBQKmihQwm2qgcWRhJvo7td4fnMP2ohEXeEwhTBmfSV6gjO0VLv0GCA8YH6ueNKFvU6HDrK+KX8f0VPaqja8GvUbIQ2C4pwsFsYMWApKgXmV/6jR47Fh2kc1ak9rrfp+je5b0pv2/j9N2C5VvLo3KboI/BmokFldtEtPQTfhWQwKuWTGtHwvxTBnGgWXMCoGmYGU8TvWh5aFisVgwnyS0oh+s0yX9hJtWyGdsP87chYbM4wjq4wZDsz8bEy+N2tl2DsJc6HSDEHx6aJeJikmdBw57QoNHOXQAsa1sHelfMA042g/pmhD8OefvAiuG3X/qH54eVA5+zmLo0C+kjLZIz45JmfkN7kgTcLJi1Nwtpxt54+77n52v0ylrjPz7JA35e7+BeaorC8=</latexit>

Encode(“ hugs”) = [20, 12]


Given list of tokens T :
<latexit sha1_base64="EqeA+mbLkNRXGR722P9vRZdlcXk=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8cE8oJkCbOT3mTM7OwyMyuEkC/w4kERr36SN//GSbIHTSxoKKq66e4KEsG1cd1vJ7e2vrG5ld8u7Ozu7R8UD4+aOk4VwwaLRazaAdUouMSG4UZgO1FIo0BgKxjdz/zWEyrNY1k34wT9iA4kDzmjxkq1eq9YcsvuHGSVeBkpQYZqr/jV7ccsjVAaJqjWHc9NjD+hynAmcFrophoTykZ0gB1LJY1Q+5P5oVNyZpU+CWNlSxoyV39PTGik9TgKbGdEzVAvezPxP6+TmvDWn3CZpAYlWywKU0FMTGZfkz5XyIwYW0KZ4vZWwoZUUWZsNgUbgrf88ippXpS96/JV7bJUucviyMMJnMI5eHADFXiAKjSAAcIzvMKb8+i8OO/Ox6I152Qzx/AHzucPsumM4g==</latexit>

Encode(“misshapenness”) = [9, 7, 12, 12, 6, 2,


• Initialize string s := “”
<latexit sha1_base64="CyF72rszRc2CwCXRTJ3KDcZhl8Q=">AAAB+HicbVDLSgNBEJyNrxgfWfXoZTBIPIVd8YUgBL14jGAekCzJ7GSSDJmdXWZ6xbjkS7x4UMSrn+LNv3GS7EETCxqKqm66u/xIcA2O821llpZXVtey67mNza3tvL2zW9NhrCir0lCEquETzQSXrAocBGtEipHAF6zuD28mfv2BKc1DeQ+jiHkB6Uve45SAkdp2Xl9e4RawR0g6nWJx3LYLTsmZAi8SNyUFlKLStr9a3ZDGAZNABdG66ToReAlRwKlg41wr1iwidEj6rGmoJAHTXjI9fIwPjdLFvVCZkoCn6u+JhARajwLfdAYEBnrem4j/ec0YehdewmUUA5N0tqgXCwwhnqSAu1wxCmJkCKGKm1sxHRBFKJisciYEd/7lRVI7LrlnpdO7k0L5Oo0ji/bRATpCLjpHZXSLKqiKKIrRM3pFb9aT9WK9Wx+z1oyVzuyhP7A+fwCzHpJ8</latexit>

11, 3, 10, 10, 3, 12, 12]


• Keep popping off tokens from the <latexit sha1_base64="ySJpLa6GdAVtzCirl7fucC6q0PY=">AAACc3icbVFNT9tAEF27hULKR0qlXjh0lQACKUR2oEl7qIRaDj1SqQGk2ArrzSRZZb22dsdVIyt/oD+PW/9FL713HQcBgZFG+/TezNvdmSiVwqDn/XHcFy9XVl+trVdeb2xubVff7FyaJNMcujyRib6OmAEpFHRRoITrVAOLIwlX0eRroV/9BG1Eon7gNIUwZiMlhoIztFS/+jtA+IX5OfBkALPD3kHLa1C/FR7Rz7SUbm7oOBuZ2iwIKsvFnxq0U5SX2W7QAvkNemIPr8yTO91aWoeDe9tYGDNmKSgFxtpX+tW61/TmQZ8CfwHqZBEX/eptMEh4FoNCLpkxPd9LMcyZRsElzCpBZiBlfMJG0LNQsRhMmM9nNqP7lhnQYaJtKqRz9mFHzmJjpnFkK2OGY7OsFeRzWi/D4ccwFyrNEBQvLxpmkmJCiwXQgdDAUU4tYFwL+1bKx0wzjnZNxRD85S8/BZetpt9ufvh+Wj/7shjHGtklNXJIfNIhZ+QbuSBdwslf553z3qHOP3fXrbl7ZanrLHrekkfhHv8H1HSyww==</latexit>

Decode([20, 12]) = “ hugs”


front of T and appending the
<latexit sha1_base64="EqeA+mbLkNRXGR722P9vRZdlcXk=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8cE8oJkCbOT3mTM7OwyMyuEkC/w4kERr36SN//GSbIHTSxoKKq66e4KEsG1cd1vJ7e2vrG5ld8u7Ozu7R8UD4+aOk4VwwaLRazaAdUouMSG4UZgO1FIo0BgKxjdz/zWEyrNY1k34wT9iA4kDzmjxkq1eq9YcsvuHGSVeBkpQYZqr/jV7ccsjVAaJqjWHc9NjD+hynAmcFrophoTykZ0gB1LJY1Q+5P5oVNyZpU+CWNlSxoyV39PTGik9TgKbGdEzVAvezPxP6+TmvDWn3CZpAYlWywKU0FMTGZfkz5XyIwYW0KZ4vZWwoZUUWZsNgUbgrf88ippXpS96/JV7bJUucviyMMJnMI5eHADFXiAKjSAAcIzvMKb8+i8OO/Ox6I152Qzx/AHzucPsumM4g==</latexit>

Decode([9, 7, 12, 12, 6, 2, 11, 3, 10, 10, 3, 12, 12])


corresponding string to s
<latexit sha1_base64="NNhp4HiCDIt6HWe1guuCXXVMPqg=">AAAB6HicbVDLSgNBEOz1GeMr6tHLYBA8hV3xdQx68ZiAeUCyhNlJbzJmdnaZmRXCki/w4kERr36SN//GSbIHTSxoKKq66e4KEsG1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6m/qtJ1Sax/LBjBP0IzqQPOSMGivVda9UdivuDGSZeDkpQ45ar/TV7ccsjVAaJqjWHc9NjJ9RZTgTOCl2U40JZSM6wI6lkkao/Wx26IScWqVPwljZkobM1N8TGY20HkeB7YyoGepFbyr+53VSE974GZdJalCy+aIwFcTEZPo16XOFzIixJZQpbm8lbEgVZcZmU7QheIsvL5PmecW7qlzWL8rV2zyOAhzDCZyBB9dQhXuoQQMYIDzDK7w5j86L8+58zFtXnHzmCP7A+fwB4eWNAQ==</latexit>

= “misshapenness”

52 Lecture 3: Tokenization
Byte-pair encoding: properties
• Ef cient to run (greedy vs. global optimization)
• Lossless compression
• Potentially some shared representations - e.g., the token “hug” could
be used both in “hug” and “hugging”

53 Lecture 3: Tokenization
fi
Weird properties of tokenizers

• Token != word
• Spaces are part of token
• “run” is a different token than “ run”
• Not invariant to case changes
• “Run” is a different token than “run”

54 Lecture 3: Tokenization
Weird properties of tokenizers
• Token != word
• Spaces are part of token
• “run” is a different token than “ run”
• Not invariant to case changes
• “Run” is a different token than “run”
• Tokenization ts statistics of your data
• e.g., while these words are multiple tokens…
• These words are all 1 token in GPT-3’s tokenizer!
• Why?
• Reddit usernames and certain code attributes appeared
enough in the corpus to surface as its own token!
Example from https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation
55 Lecture 3: Tokenization
fi
Other tokenization variants

56 Lecture 3: Tokenization
Variants: no spaces in tokens
• The way we presented BPE, we included whitespace with the following word. (E.g., “ pug”)
• This is most common in modern LMs space
• However, in another BPE variant, you instead strip whitespace (e.g., “pug”) and add spaces
between words at decoding time no space
• This was the original BPE paper’s implementation!
• Example:
• [“I”, “hug”, “pugs”] -> “I hug pugs” (w/out whitespace)
• [“I”, “ hug”, “ pugs”] -> “I hug pugs” (w/ whitespace)
Original (w/ whitespace) Updated (w/out whitespace)
Required: Required:
• Documents D • Documents D
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit> <latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

• Desired vocabulary size N (greater than chars in D) • Desired vocabulary size N (greater than chars in D)
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit> <latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit> <latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>

Algorithm: Algorithm:
- Pre-tokenize D by splitting into words (split + Pre-tokenize D by splitting into words
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

before whitespace/punctuation) (removing whitespace)


• Initialize V as the set of characters in D 57 • Initialize V as the set of characters in D Lecture 3: Tokenization
<latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit>
<latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit>

<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
Variants: no spaces in tokens
• For sub-word tokens, need to add “continue word” special character
• E.g., for the word “Tokenization”, if the subword tokens are “Token”
and “ization”,
• W/out special character: [“Token”, “ization”] -> “Token ization”
• W/ special character #: [“Token”, “#ization”] -> Tokenization”
• When decoding, if does not have special character add a space
• Example:
• [“I”, “li”, “#ke”, “to”, “hug”, “pug”, “#s”] -> “I like to hug pugs”

58 Lecture 3: Tokenization
Variants: no spaces in tokens
• Loses some whitespace information (lossy compression!)
• E.g., Tokenize(“I eat cake.”) == Tokenize(“ I eat cake .”)
• Especially problematic for code (e.g., Python) - why?

(Example using
GPT’s tokenizer,
which does not
include spaces in
the token)

59 Lecture 3: Tokenization
Variants: no pre-tokenization
• In the variant we proposed, we start by splitting into words
• This guarantees that each token will be no longer than one word
• However, this does not work so well for character-based languages.
Why?

60 Lecture 3: Tokenization
Variants: no pre-tokenization
• Instead, we could not pre-tokenize, and treat the entire document or
sentence as a single list of tokens
• Allows for tokens to span multiple words/characters
• Sometimes called SentencePiece tokenization* (Kudo, 2018)
* (not to be confused with the
SentencePiece library, which Paper: https://arxiv.org/abs/1808.06226
is an implementation of many Library: https://github.com/google/sentencepiece
kinds of tokenization)

Original (w/ pre-tokenization) Updated (w/out pre-tokenization)


Required: Required:
• Documents D
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

• Documents D
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

• Desired vocabulary size N (greater than chars in D)


<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>

• Desired vocabulary size N (greater than chars in D)


<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>

Algorithm:
Algorithm:
- Pre-tokenize D by splitting into words
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

+ Do not pre-tokenize D
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

(split before whitespace/punctuation)


• Initialize V as the set of characters in D
<latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit>

<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

• Initialize V as the set of characters in D


<latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit>

61
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

Lecture 3: Tokenization
Variants: no pre-tokenization
• Allows sequences of words/characters to become tokens

SentencePiece paper example in Japanese:


https://arxiv.org/pdf/1808.06226.pdf

Jurassic-1 model example in English:


https://uploads-ssl.web ow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf

62 Lecture 3: Tokenization
fl
Variants: byte-based
• Originally, we presented BPE as dealing with characters as the smallest unit
• However, there are many characters - especially if you want to support:
• character-based languages (e.g., Chinese has >100k characters!)
• non-alphanumeric characters like emojis (Unicode 15 has ~150k
characters!) *Only 256 bytes!
• Instead, can initialize tokens as set of bytes! (e.g., with UTF-8*) Each Unicode
Original (w/ characters) Modi ed (w/ bytes) char is 1-4 bytes
Required: Required:
• Documents D • Documents D
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit> <latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

• Desired vocabulary size N (greater than chars in D) • Desired vocabulary size N (greater than chars in D)
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit> <latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit> <latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>

Algorithm: Algorithm:
• Pre-tokenize D by splitting into words (split before • Pre-tokenize D by splitting into words (split before
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit> <latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

whitespace/punctuation) whitespace/punctuation)
- Initialize V as the set of characters in D + Initialize V as the set of bytes in D
<latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit> <latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit>

<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit> <latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

- Convert D into a list of tokens (characters) + Convert D into a list of tokens (bytes)
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit> <latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>

• While |V| < N : 63 • While |V| < N : Lecture 3: Tokenization


<latexit sha1_base64="f0Eq4chOXdaCL4akh6X1BvQxXWM=">AAAB+HicbVDLSsNAFL2pr1ofjbp0M1gEVyURqy6LblxWsA9oQ5lMJ+3QySTMTISa9kvcuFDErZ/izr9x0mahrQcGDufcyz1z/JgzpR3n2yqsrW9sbhW3Szu7e/tl++CwpaJEEtokEY9kx8eKciZoUzPNaSeWFIc+p21/fJv57UcqFYvEg57E1AvxULCAEayN1LfLU9QLsR4RzNPWDE37dsWpOnOgVeLmpAI5Gn37qzeISBJSoQnHSnVdJ9ZeiqVmhNNZqZcoGmMyxkPaNVTgkCovnQefoVOjDFAQSfOERnP190aKQ6UmoW8ms5Bq2cvE/7xuooNrL2UiTjQVZHEoSDjSEcpaQAMmKdF8YggmkpmsiIywxESbrkqmBHf5y6ukdV51L6u1+4tK/SavowjHcAJn4MIV1OEOGtAEAgk8wyu8WU/Wi/VufSxGC1a+cwR/YH3+AIm1kwc=</latexit> <latexit sha1_base64="f0Eq4chOXdaCL4akh6X1BvQxXWM=">AAAB+HicbVDLSsNAFL2pr1ofjbp0M1gEVyURqy6LblxWsA9oQ5lMJ+3QySTMTISa9kvcuFDErZ/izr9x0mahrQcGDufcyz1z/JgzpR3n2yqsrW9sbhW3Szu7e/tl++CwpaJEEtokEY9kx8eKciZoUzPNaSeWFIc+p21/fJv57UcqFYvEg57E1AvxULCAEayN1LfLU9QLsR4RzNPWDE37dsWpOnOgVeLmpAI5Gn37qzeISBJSoQnHSnVdJ9ZeiqVmhNNZqZcoGmMyxkPaNVTgkCovnQefoVOjDFAQSfOERnP190aKQ6UmoW8ms5Bq2cvE/7xuooNrL2UiTjQVZHEoSDjSEcpaQAMmKdF8YggmkpmsiIywxESbrkqmBHf5y6ukdV51L6u1+4tK/SavowjHcAJn4MIV1OEOGtAEAgk8wyu8WU/Wi/VufSxGC1a+cwR/YH3+AIm1kwc=</latexit>

<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit> <latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>


fi
Variants: byte-based
While character-based GPT tokenizer The Byte-based GPT-2 tokenizer
fails on emojis and Japanese… succeeds!

64 Lecture 3: Tokenization
Variants: WordPiece objective
• To merge, we selected the bigram with highest <latexit sha1_base64="mqKJtv+0MGObOc3GHM5/8l5Wm/Y=">AAAB8nicbVDLSsNAFJ3UV62vqks3g0WoICURX8uiG5cV7APSECbTSTt2MgkzN4US+hluXCji1q9x5984bbPQ1gMXDufcy733BIngGmz72yqsrK6tbxQ3S1vbO7t75f2Dlo5TRVmTxiJWnYBoJrhkTeAgWCdRjESBYO1geDf12yOmNI/lI4wT5kWkL3nIKQEjuUl15PMzPPKfTv1yxa7ZM+Bl4uSkgnI0/PJXtxfTNGISqCBau46dgJcRBZwKNil1U80SQoekz1xDJYmY9rLZyRN8YpQeDmNlSgKeqb8nMhJpPY4C0xkRGOhFbyr+57kphDdexmWSApN0vihMBYYYT//HPa4YBTE2hFDFza2YDogiFExKJROCs/jyMmmd15yr2uXDRaV+m8dRREfoGFWRg65RHd2jBmoiimL0jF7RmwXWi/VufcxbC1Y+c4j+wPr8ARTZkHw=</latexit>

frequency p(vi , vj )
• This is the same as bigram with highest probability!
• Instead, we could choose the bigram which would Modi ed (Word Piece)
maximize the likelihood of the data after the
merge is made (also called WordPiece!)

+ For the bigram that would
Original (BPE)
maximize likelihood of the training
… data once the change is made v i , v j
<latexit sha1_base64="8qiJFR39nNAp8DHiEq3U32WlZhY=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgQcKu+DoGvXiMYB6SLMvsZJKMmZldZmYDYclXePGgiFc/x5t/42yyB40WNBRV3XR3hTFn2rjul1NYWl5ZXSuulzY2t7Z3yrt7TR0litAGiXik2iHWlDNJG4YZTtuxoliEnLbC0U3mt8ZUaRbJezOJqS/wQLI+I9hY6WEcsBM0Dh5LQbniVt0Z0F/i5aQCOepB+bPbi0giqDSEY607nhsbP8XKMMLptNRNNI0xGeEB7VgqsaDaT2cHT9GRVXqoHylb0qCZ+nMixULriQhtp8BmqBe9TPzP6ySmf+WnTMaJoZLMF/UTjkyEsu9RjylKDJ9Ygoli9lZEhlhhYmxGWQje4st/SfO06l1Uz+/OKrXrPI4iHMAhHIMHl1CDW6hDAwgIeIIXeHWU8+y8Oe/z1oKTz+zDLzgf36+tj7E=</latexit>

- For the most frequent bigram (breaking ties arbitrarily)


vi , vj (breaking ties arbitrarily)
<latexit sha1_base64="8qiJFR39nNAp8DHiEq3U32WlZhY=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgQcKu+DoGvXiMYB6SLMvsZJKMmZldZmYDYclXePGgiFc/x5t/42yyB40WNBRV3XR3hTFn2rjul1NYWl5ZXSuulzY2t7Z3yrt7TR0litAGiXik2iHWlDNJG4YZTtuxoliEnLbC0U3mt8ZUaRbJezOJqS/wQLI+I9hY6WEcsBM0Dh5LQbniVt0Z0F/i5aQCOepB+bPbi0giqDSEY607nhsbP8XKMMLptNRNNI0xGeEB7VgqsaDaT2cHT9GRVXqoHylb0qCZ+nMixULriQhtp8BmqBe9TPzP6ySmf+WnTMaJoZLMF/UTjkyEsu9RjylKDJ9Ygoli9lZEhlhhYmxGWQje4st/SfO06l1Uz+/OKrXrPI4iHMAhHIMHl1CDW6hDAwgIeIIXeHWU8+y8Oe/z1oKTz+zDLzgf36+tj7E=</latexit>

(Same as bigram which maximizes


(Sam as bigram which
<latexit sha1_base64="sMKGSkiqZuHgLmqBFSpQFlCgpVg=">AAACCnicbVDLSgMxFL1TX7W+Rl26iRahBSkz4mtZdOOygn1AOwyZNNPGZh4kmUIZunbjr7hxoYhbv8Cdf2Om7UJbD4Sce869JPd4MWdSWda3kVtaXlldy68XNja3tnfM3b2GjBJBaJ1EPBItD0vKWUjriilOW7GgOPA4bXqDm8xvDqmQLArv1SimToB7IfMZwUpLrnnY8QUmaVwauuwEDd2H8nhalFF26dI1i1bFmgAtEntGijBDzTW/Ot2IJAENFeFYyrZtxcpJsVCMcDoudBJJY0wGuEfbmoY4oNJJJ6uM0bFWusiPhD6hQhP190SKAylHgac7A6z6ct7LxP+8dqL8KydlYZwoGpLpQ37CkYpQlgvqMkGJ4iNNMBFM/xWRPtbZKJ1eQYdgz6+8SBqnFfuicn53Vqxez+LIwwEcQQlsuIQq3EIN6kDgEZ7hFd6MJ+PFeDc+pq05YzazD39gfP4AlZ6ZkA==</latexit>

<latexit sha1_base64="mqKJtv+0MGObOc3GHM5/8l5Wm/Y=">AAAB8nicbVDLSsNAFJ3UV62vqks3g0WoICURX8uiG5cV7APSECbTSTt2MgkzN4US+hluXCji1q9x5984bbPQ1gMXDufcy733BIngGmz72yqsrK6tbxQ3S1vbO7t75f2Dlo5TRVmTxiJWnYBoJrhkTeAgWCdRjESBYO1geDf12yOmNI/lI4wT5kWkL3nIKQEjuUl15PMzPPKfTv1yxa7ZM+Bl4uSkgnI0/PJXtxfTNGISqCBau46dgJcRBZwKNil1U80SQoekz1xDJYmY9rLZyRN8YpQeDmNlSgKeqb8nMhJpPY4C0xkRGOhFbyr+57kphDdexmWSApN0vihMBYYYT//HPa4YBTE2hFDFza2YDogiFExKJROCs/jyMmmd15yr2uXDRaV+m8dRREfoGFWRg65RHd2jBmoiimL0jF7RmwXWi/VufcxbC1Y+c4j+wPr8ARTZkHw=</latexit>
p(vi , vj )
maximizes - p(v i , v j ) ) p(vi )p(vj ) )
65 Lecture 3: Tokenization
fi
Variants: WordPiece objective

<latexit sha1_base64="mqKJtv+0MGObOc3GHM5/8l5Wm/Y=">AAAB8nicbVDLSsNAFJ3UV62vqks3g0WoICURX8uiG5cV7APSECbTSTt2MgkzN4US+hluXCji1q9x5984bbPQ1gMXDufcy733BIngGmz72yqsrK6tbxQ3S1vbO7t75f2Dlo5TRVmTxiJWnYBoJrhkTeAgWCdRjESBYO1geDf12yOmNI/lI4wT5kWkL3nIKQEjuUl15PMzPPKfTv1yxa7ZM+Bl4uSkgnI0/PJXtxfTNGISqCBau46dgJcRBZwKNil1U80SQoekz1xDJYmY9rLZyRN8YpQeDmNlSgKeqb8nMhJpPY4C0xkRGOhFbyr+57kphDdexmWSApN0vihMBYYYT//HPa4YBTE2hFDFza2YDogiFExKJROCs/jyMmmd15yr2uXDRaV+m8dRREfoGFWRg65RHd2jBmoiimL0jF7RmwXWi/VufcxbC1Y+c4j+wPr8ARTZkHw=</latexit>

BPE: the bigram with highest frequency/highest probability p(vi , vj )


• WordPiece: bigram which maximizes the likelihood of the
<latexit sha1_base64="sMKGSkiqZuHgLmqBFSpQFlCgpVg=">AAACCnicbVDLSgMxFL1TX7W+Rl26iRahBSkz4mtZdOOygn1AOwyZNNPGZh4kmUIZunbjr7hxoYhbv8Cdf2Om7UJbD4Sce869JPd4MWdSWda3kVtaXlldy68XNja3tnfM3b2GjBJBaJ1EPBItD0vKWUjriilOW7GgOPA4bXqDm8xvDqmQLArv1SimToB7IfMZwUpLrnnY8QUmaVwauuwEDd2H8nhalFF26dI1i1bFmgAtEntGijBDzTW/Ot2IJAENFeFYyrZtxcpJsVCMcDoudBJJY0wGuEfbmoY4oNJJJ6uM0bFWusiPhD6hQhP190SKAylHgac7A6z6ct7LxP+8dqL8KydlYZwoGpLpQ37CkYpQlgvqMkGJ4iNNMBFM/xWRPtbZKJ1eQYdgz6+8SBqnFfuicn53Vqxez+LIwwEcQQlsuIQq3EIN6kDgEZ7hFd6MJ+PFeDc+pq05YzazD39gfP4AlZ6ZkA==</latexit>

p(vi , vj )
data after the merge is made p(vi )p(vj )
• Maximizes the probability of the bigram, normalized by the
probability of the unigrams

66 Lecture 3: Tokenization
Variants: WordPiece encoding
At inference time, instead of applying the merge rules in order, tokens are
selected left-to-right greedily:

Encoding algorithm
Given string s and (unordered) vocab V ,
<latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit>

<latexit sha1_base64="NNhp4HiCDIt6HWe1guuCXXVMPqg=">AAAB6HicbVDLSgNBEOz1GeMr6tHLYBA8hV3xdQx68ZiAeUCyhNlJbzJmdnaZmRXCki/w4kERr36SN//GSbIHTSxoKKq66e4KEsG1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6m/qtJ1Sax/LBjBP0IzqQPOSMGivVda9UdivuDGSZeDkpQ45ar/TV7ccsjVAaJqjWHc9NjJ9RZTgTOCl2U40JZSM6wI6lkkao/Wx26IScWqVPwljZkobM1N8TGY20HkeB7YyoGepFbyr+53VSE974GZdJalCy+aIwFcTEZPo16XOFzIixJZQpbm8lbEgVZcZmU7QheIsvL5PmecW7qlzWL8rV2zyOAhzDCZyBB9dQhXuoQQMYIDzDK7w5j86L8+58zFtXnHzmCP7A+fwB4eWNAQ==</latexit>

<latexit sha1_base64="XHQTIRHnlBQPtzHv4o3Sgmq43IM=">AAAB/HicbVDLSsNAFJ3UV62vaJduBovgqiTiC0EounFZoS9IQplMJ+3QyYOZGyGE+ituXCji1g9x5984bbPQ1gMXDufcy733+IngCizr2yitrK6tb5Q3K1vbO7t75v5BR8WppKxNYxHLnk8UEzxibeAgWC+RjIS+YF1/fDf1u49MKh5HLcgS5oVkGPGAUwJa6pvVFr6+wa5gATjYlXw4Aq9v1qy6NQNeJnZBaqhAs29+uYOYpiGLgAqilGNbCXg5kcCpYJOKmyqWEDomQ+ZoGpGQKS+fHT/Bx1oZ4CCWuiLAM/X3RE5CpbLQ150hgZFa9Kbif56TQnDl5TxKUmARnS8KUoEhxtMk8IBLRkFkmhAqub4V0xGRhILOq6JDsBdfXiad07p9UT9/OKs1bos4yugQHaETZKNL1ED3qInaiKIMPaNX9GY8GS/Gu/Exby0ZxUwV/YHx+QMorJPX</latexit>

• Initialize list of tokens T := []



<latexit sha1_base64="oI2zEUs5B5/h0FAcOU0fkde5Tfg=">AAAB8XicbVDLSgNBEOz1GeMr6tHLYBDiJeyKr5MEvXiMYB6YLGF20kmGzM4uM7NCWPIXXjwo4tW/8ebfOEn2oIkFDUVVN91dQSy4Nq777Swtr6yurec28ptb2zu7hb39uo4SxbDGIhGpZkA1Ci6xZrgR2IwV0jAQ2AiGtxO/8YRK80g+mFGMfkj7kvc4o8ZKjwJlSZ+Qa+J2CkW37E5BFomXkSJkqHYKX+1uxJIQpWGCat3y3Nj4KVWGM4HjfDvRGFM2pH1sWSppiNpPpxePybFVuqQXKVvSkKn6eyKlodajMLCdITUDPe9NxP+8VmJ6V37KZZwYlGy2qJcIYiIyeZ90uUJmxMgSyhS3txI2oIoyY0PK2xC8+ZcXSf207F2Uz+/PipWbLI4cHMIRlMCDS6jAHVShBgwkPMMrvDnaeXHenY9Z65KTzRzAHzifP6M5j5k=</latexit>

While len(s) > 0:


• Find longest token ti that matches the beginning of s
<latexit sha1_base64="sQPLb/AbKi8RGiLGPH1BFHbfyqc=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEmYns8mQ2dllplcISz7BiwdFvPpF3vwbJ8keNLGgoajqprsrSKQw6LrfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWj26nfeuLaiFg94jjhfkQHSoSCUbTSA/ZEr1xxq+4MZJl4OalAjnqv/NXtxyyNuEImqTEdz03Qz6hGwSSflLqp4QllIzrgHUsVjbjxs9mpE3JilT4JY21LIZmpvycyGhkzjgLbGVEcmkVvKv7ndVIMr/1MqCRFrth8UZhKgjGZ/k36QnOGcmwJZVrYWwkbUk0Z2nRKNgRv8eVl0jyrepfVi/vzSu0mj6MIR3AMp+DBFdTgDurQAAYDeIZXeHOk8+K8Ox/z1oKTzxzCHzifP124jd4=</latexit>

<latexit sha1_base64="NNhp4HiCDIt6HWe1guuCXXVMPqg=">AAAB6HicbVDLSgNBEOz1GeMr6tHLYBA8hV3xdQx68ZiAeUCyhNlJbzJmdnaZmRXCki/w4kERr36SN//GSbIHTSxoKKq66e4KEsG1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6m/qtJ1Sax/LBjBP0IzqQPOSMGivVda9UdivuDGSZeDkpQ45ar/TV7ccsjVAaJqjWHc9NjJ9RZTgTOCl2U40JZSM6wI6lkkao/Wx26IScWqVPwljZkobM1N8TGY20HkeB7YyoGepFbyr+53VSE974GZdJalCy+aIwFcTEZPo16XOFzIixJZQpbm8lbEgVZcZmU7QheIsvL5PmecW7qlzWL8rV2zyOAhzDCZyBB9dQhXuoQQMYIDzDK7w5j86L8+58zFtXnHzmCP7A+fwB4eWNAQ==</latexit>

• Let T := T + [ti ]
<latexit sha1_base64="6ZBvzTpUEX6Fl15RpumG92iFTo8=">AAACBHicbVDLSsNAFJ3UV62vqMtuBosgCCURXwhC0Y3LCn1BEsJkOmmHTh7M3AildOHGX3HjQhG3foQ7/8Zpm4W2HrhwOOde7r0nSAVXYFnfRmFpeWV1rbhe2tjc2t4xd/daKskkZU2aiER2AqKY4DFrAgfBOqlkJAoEaweD24nffmBS8SRuwDBlXkR6MQ85JaAl3yw38NU1buBj7AoWgoPB59iVvNcHzzcrVtWaAi8SOycVlKPum19uN6FZxGKggijl2FYK3ohI4FSwccnNFEsJHZAeczSNScSUN5o+McaHWuniMJG6YsBT9ffEiERKDaNAd0YE+mrem4j/eU4G4aU34nGaAYvpbFGYCQwJniSCu1wyCmKoCaGS61sx7RNJKOjcSjoEe/7lRdI6qdrn1bP700rtJo+jiMroAB0hG12gGrpDddREFD2iZ/SK3own48V4Nz5mrQUjn9lHf2B8/gCwgJZC</latexit>

• Pop corresponding vocab vi off of front of s


<latexit sha1_base64="/3OlgPGggSmALA3ZFQ7AKnQYDhg=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEmYns8mQ2dllpjcQlnyCFw+KePWLvPk3TpI9aLSgoajqprsrSKQw6LpfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWj25nfGnNtRKwecZJwP6IDJULBKFrpYdwTvXLFrbpzkL/Ey0kFctR75c9uP2ZpxBUySY3peG6CfkY1Cib5tNRNDU8oG9EB71iqaMSNn81PnZITq/RJGGtbCslc/TmR0ciYSRTYzoji0Cx7M/E/r5NieO1nQiUpcsUWi8JUEozJ7G/SF5ozlBNLKNPC3krYkGrK0KZTsiF4yy//Jc2zqndZvbg/r9Ru8jiKcATHcAoeXEEN7qAODWAwgCd4gVdHOs/Om/O+aC04+cwh/ILz8Q1gxI3g</latexit>

<latexit sha1_base64="NNhp4HiCDIt6HWe1guuCXXVMPqg=">AAAB6HicbVDLSgNBEOz1GeMr6tHLYBA8hV3xdQx68ZiAeUCyhNlJbzJmdnaZmRXCki/w4kERr36SN//GSbIHTSxoKKq66e4KEsG1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6m/qtJ1Sax/LBjBP0IzqQPOSMGivVda9UdivuDGSZeDkpQ45ar/TV7ccsjVAaJqjWHc9NjJ9RZTgTOCl2U40JZSM6wI6lkkao/Wx26IScWqVPwljZkobM1N8TGY20HkeB7YyoGepFbyr+53VSE974GZdJalCy+aIwFcTEZPo16XOFzIixJZQpbm8lbEgVZcZmU7QheIsvL5PmecW7qlzWL8rV2zyOAhzDCZyBB9dQhXuoQQMYIDzDK7w5j86L8+58zFtXnHzmCP7A+fwB4eWNAQ==</latexit>

• Return T
<latexit sha1_base64="EqeA+mbLkNRXGR722P9vRZdlcXk=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8cE8oJkCbOT3mTM7OwyMyuEkC/w4kERr36SN//GSbIHTSxoKKq66e4KEsG1cd1vJ7e2vrG5ld8u7Ozu7R8UD4+aOk4VwwaLRazaAdUouMSG4UZgO1FIo0BgKxjdz/zWEyrNY1k34wT9iA4kDzmjxkq1eq9YcsvuHGSVeBkpQYZqr/jV7ccsjVAaJqjWHc9NjD+hynAmcFrophoTykZ0gB1LJY1Q+5P5oVNyZpU+CWNlSxoyV39PTGik9TgKbGdEzVAvezPxP6+TmvDWn3CZpAYlWywKU0FMTGZfkz5XyIwYW0KZ4vZWwoZUUWZsNgUbgrf88ippXpS96/JV7bJUucviyMMJnMI5eHADFXiAKjSAAcIzvMKb8+i8OO/Ox6I152Qzx/AHzucPsumM4g==</latexit>

67 Lecture 3: Tokenization
Variants: unigram objective
• BPE starts with a small vocabulary (characters) and builds up until the
desired vocabulary size N
<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>

• The Unigram tokenization algorithm starts with a large vocabulary (all


sub-word substrings) and throws away tokens until we reach size N
<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>

68 Lecture 3: Tokenization
Examples of LLMs and their tokenizers
SentencePiece
Model/Tokenizer
(treat whitespace
Objective
like char)BPE
Spaces part of token?
(w/spaces)
Pre-tokenization Smallest unit

GPT BPE No Yes Character-level

GPT-2/3/4, ChatGPT,
BPE Yes Yes Byte-level
Llama(2), Falcon, …

No. “SentencePiece” -
Jurassic BPE Yes treat whitespace like Byte-level
char
Bert, DistilBert,
Electra WordPiece No Yes Character-level

No. “SentencePiece” -
T5, ALBERT, XLNet,
Unigram Yes treat whitespace like Character-level
Marian
char*
*For non-English languages
69 Lecture 3: Tokenization
Next lecture: Recurrent neural networks

You might also like