6. Neural Language Models & Tokenization
6. Neural Language Models & Tokenization
Spring 2024
Many materials from CSE447@UW (Taylor Sorensen) and COS484@Princeton with special thanks!
Announcements
3 Lecture 3: Tokenization
Neural language models: overview
4 Lecture 3: Tokenization
Neural language models: inputs/outputs
• Input: sequences of words (or tokens)
• Output: probability distribution over the next word (token)
p(x|START) p(x|START I)p(x| · · · went) p(x| · · · to) p(x| · · · the) p(x| · · · park) p(x|START I went to the park.)
<latexit sha1_base64="WcUxOGfkP4cx+E9QpV1PwAaAgeg=">AAAB/HicbVDJTgJBEO1xRdxGOXrpSEzwQmaM2xH1ojdUtgQmpKfpgQ49S7prDGTEX/HiQWO8+iHe/BsbmIOCL6nk5b2qVNVzI8EVWNa3sbC4tLyymlnLrm9sbm2bO7s1FcaSsioNRSgbLlFM8IBVgYNgjUgy4ruC1d3+1divPzCpeBhUYBgxxyfdgHucEtBS28xFhcFjC9gAkvvKxV0F34wO22beKloT4HlipySPUpTb5lerE9LYZwFQQZRq2lYETkIkcCrYKNuKFYsI7ZMua2oaEJ8pJ5kcP8IHWulgL5S6AsAT9fdEQnylhr6rO30CPTXrjcX/vGYM3rmT8CCKgQV0usiLBYYQj5PAHS4ZBTHUhFDJ9a2Y9ogkFHReWR2CPfvyPKkdFe3T4sntcb50mcaRQXtoHxWQjc5QCV2jMqoiioboGb2iN+PJeDHejY9p64KRzuTQHxifP8oVlDg=</latexit> <latexit sha1_base64="LRjKiWw0jRv+buME4YuEQgqShlY=">AAACAXicbVDJSgNBFOyJW4zbqBfBS2MQ4iXMiNsx6MVjBLNAZgg9PT1Jk56F7jeaMMaLv+LFgyJe/Qtv/o2d5aCJBQ1F1Stev/ISwRVY1reRW1hcWl7JrxbW1jc2t8ztnbqKU0lZjcYilk2PKCZ4xGrAQbBmIhkJPcEaXu9q5DfumFQ8jm5hkDA3JJ2IB5wS0FLb3EtK/QfsUD8GhR1gfcjuWQTDo7ZZtMrWGHie2FNSRFNU2+aX48c0DXWaCqJUy7YScDMigVPBhgUnVSwhtEc6rKVpREKm3Gx8wRAfasXHQSz1iwCP1d+JjIRKDUJPT4YEumrWG4n/ea0Uggs341GSAovoZFGQCgwxHtWBfS4ZBTHQhFDJ9V8x7RJJKOjSCroEe/bkeVI/Lttn5dObk2LlclpHHu2jA1RCNjpHFXSNqqiGKHpEz+gVvRlPxovxbnxMRnPGNLOL/sD4/AFzzZbq</latexit> <latexit sha1_base64="rGCue0TUWf3/AXEt1i3jVqq9cWE=">AAAB/3icbVDJSgNBEO1xjXGLCl68NAYhXsKMuB2DXjxGMAskQ+jp6SRNerqH7hpJmOTgr3jxoIhXf8Obf2NnOWjig4LHe1VU1QtiwQ247reztLyyurae2chubm3v7Ob29qtGJZqyClVC6XpADBNcsgpwEKwea0aiQLBa0Lsd+7VHpg1X8gEGMfMj0pG8zSkBK7Vyh3GhP8RNGiowuAmsDymo0Wkrl3eL7gR4kXgzkkczlFu5r2aoaBIxCVQQYxqeG4OfEg2cCjbKNhPDYkJ7pMMalkoSMeOnk/tH+MQqIW4rbUsCnqi/J1ISGTOIAtsZEeiaeW8s/uc1Emhf+ymXcQJM0umidiIwKDwOA4dcMwpiYAmhmttbMe0STSjYyLI2BG/+5UVSPSt6l8WL+/N86WYWRwYdoWNUQB66QiV0h8qogigaomf0it6cJ+fFeXc+pq1LzmzmAP2B8/kDy8aV+w==</latexit> <latexit sha1_base64="lCxZUSI2bDJyt4tQii3hgQuEeiE=">AAACAXicbVDLSsNAFJ34rPUVdSO4GSxC3ZREfC2LblxWsA9oQplMJu3QyYOZG2mJdeOvuHGhiFv/wp1/47TNQlsPXDiccy/33uMlgiuwrG9jYXFpeWW1sFZc39jc2jZ3dhsqTiVldRqLWLY8opjgEasDB8FaiWQk9ARrev3rsd+8Z1LxOLqDYcLckHQjHnBKQEsdcz8pDx6wQ/0YFHaADSBLiOyPjjtmyapYE+B5YuekhHLUOuaX48c0DVkEVBCl2raVgJsRCZwKNio6qWIJoX3SZW1NIxIy5WaTD0b4SCs+DmKpKwI8UX9PZCRUahh6ujMk0FOz3lj8z2unEFy6GY+SFFhEp4uCVGCI8TgO7HPJKIihJoRKrm/FtEckoaBDK+oQ7NmX50njpGKfV85uT0vVqzyOAjpAh6iMbHSBqugG1VAdUfSIntErejOejBfj3fiYti4Y+cwe+gPj8wdbVJba</latexit> <latexit sha1_base64="MUKBCiIEN6kov3qjSDDj2XGDctE=">AAACDnicbVC5TgMxEPVyhnAtUNJYREihiXYRVxmggS5ADqQkirzOhFjxHrJnIdGSL6DhV2goQIiWmo6/wTkKIDzJ0tN7M+OZ50VSaHScL2tqemZ2bj61kF5cWl5ZtdfWyzqMFYcSD2Worj2mQYoASihQwnWkgPmehIrXOR34lVtQWoRBEXsR1H12E4iW4AyN1LC3o2z3vobQxeSqeHxZpOf0DgKkGFJsA42Y6uT6Ow074+ScIegkccckQ8YoNOzPWjPksW9Gccm0rrpOhPWEKRRcQj9dizVEjHfYDVQNDZgPup4Mz+nTbaM0aStU5plVhurPjoT5Wvd8z1T6DNv6rzcQ//OqMbaO6okIohgh4KOPWrEcHmuyoU2hgKPsGcK4EmZXyttMMY4mwbQJwf178iQp7+bcg9z+xV4mfzKOI0U2yRbJEpcckjw5IwVSIpw8kCfyQl6tR+vZerPeR6VT1rhng/yC9fENjmebKg==</latexit>
The 3 think 11% to 35% the 29% bathroo 3% and 14% I 21%
When 2.5% was 5% back 8% a m
9% doctor 2% with 9 It 6
They 2% went 2% into 5% see %
5% hospita 2% , 8% The 3%
… … am 1% through 4% my 3% l
store 1.5% to 7% There 3%
I 1% will 1% out 3% bed 2% … … … … … …
… … like 0.5% on 2% school 1% park 0.5% . 6% STOP 1%
Banana 0.1% … … … …% … … … … … … … …
Neural Network
Task:
Data:
29 Lecture 3: Tokenization
Neural language models: inputs/outputs
• Input: sequences of words (or tokens)
• Output: probability distribution over the next word (token)
p(x|START) p(x|START I)p(x| · · · went) p(x| · · · to) p(x| · · · the) p(x| · · · park) p(x|START I went to the park.)
<latexit sha1_base64="WcUxOGfkP4cx+E9QpV1PwAaAgeg=">AAAB/HicbVDJTgJBEO1xRdxGOXrpSEzwQmaM2xH1ojdUtgQmpKfpgQ49S7prDGTEX/HiQWO8+iHe/BsbmIOCL6nk5b2qVNVzI8EVWNa3sbC4tLyymlnLrm9sbm2bO7s1FcaSsioNRSgbLlFM8IBVgYNgjUgy4ruC1d3+1divPzCpeBhUYBgxxyfdgHucEtBS28xFhcFjC9gAkvvKxV0F34wO22beKloT4HlipySPUpTb5lerE9LYZwFQQZRq2lYETkIkcCrYKNuKFYsI7ZMua2oaEJ8pJ5kcP8IHWulgL5S6AsAT9fdEQnylhr6rO30CPTXrjcX/vGYM3rmT8CCKgQV0usiLBYYQj5PAHS4ZBTHUhFDJ9a2Y9ogkFHReWR2CPfvyPKkdFe3T4sntcb50mcaRQXtoHxWQjc5QCV2jMqoiioboGb2iN+PJeDHejY9p64KRzuTQHxifP8oVlDg=</latexit> <latexit sha1_base64="LRjKiWw0jRv+buME4YuEQgqShlY=">AAACAXicbVDJSgNBFOyJW4zbqBfBS2MQ4iXMiNsx6MVjBLNAZgg9PT1Jk56F7jeaMMaLv+LFgyJe/Qtv/o2d5aCJBQ1F1Stev/ISwRVY1reRW1hcWl7JrxbW1jc2t8ztnbqKU0lZjcYilk2PKCZ4xGrAQbBmIhkJPcEaXu9q5DfumFQ8jm5hkDA3JJ2IB5wS0FLb3EtK/QfsUD8GhR1gfcjuWQTDo7ZZtMrWGHie2FNSRFNU2+aX48c0DXWaCqJUy7YScDMigVPBhgUnVSwhtEc6rKVpREKm3Gx8wRAfasXHQSz1iwCP1d+JjIRKDUJPT4YEumrWG4n/ea0Uggs341GSAovoZFGQCgwxHtWBfS4ZBTHQhFDJ9V8x7RJJKOjSCroEe/bkeVI/Lttn5dObk2LlclpHHu2jA1RCNjpHFXSNqqiGKHpEz+gVvRlPxovxbnxMRnPGNLOL/sD4/AFzzZbq</latexit> <latexit sha1_base64="rGCue0TUWf3/AXEt1i3jVqq9cWE=">AAAB/3icbVDJSgNBEO1xjXGLCl68NAYhXsKMuB2DXjxGMAskQ+jp6SRNerqH7hpJmOTgr3jxoIhXf8Obf2NnOWjig4LHe1VU1QtiwQ247reztLyyurae2chubm3v7Ob29qtGJZqyClVC6XpADBNcsgpwEKwea0aiQLBa0Lsd+7VHpg1X8gEGMfMj0pG8zSkBK7Vyh3GhP8RNGiowuAmsDymo0Wkrl3eL7gR4kXgzkkczlFu5r2aoaBIxCVQQYxqeG4OfEg2cCjbKNhPDYkJ7pMMalkoSMeOnk/tH+MQqIW4rbUsCnqi/J1ISGTOIAtsZEeiaeW8s/uc1Emhf+ymXcQJM0umidiIwKDwOA4dcMwpiYAmhmttbMe0STSjYyLI2BG/+5UVSPSt6l8WL+/N86WYWRwYdoWNUQB66QiV0h8qogigaomf0it6cJ+fFeXc+pq1LzmzmAP2B8/kDy8aV+w==</latexit> <latexit sha1_base64="lCxZUSI2bDJyt4tQii3hgQuEeiE=">AAACAXicbVDLSsNAFJ34rPUVdSO4GSxC3ZREfC2LblxWsA9oQplMJu3QyYOZG2mJdeOvuHGhiFv/wp1/47TNQlsPXDiccy/33uMlgiuwrG9jYXFpeWW1sFZc39jc2jZ3dhsqTiVldRqLWLY8opjgEasDB8FaiWQk9ARrev3rsd+8Z1LxOLqDYcLckHQjHnBKQEsdcz8pDx6wQ/0YFHaADSBLiOyPjjtmyapYE+B5YuekhHLUOuaX48c0DVkEVBCl2raVgJsRCZwKNio6qWIJoX3SZW1NIxIy5WaTD0b4SCs+DmKpKwI8UX9PZCRUahh6ujMk0FOz3lj8z2unEFy6GY+SFFhEp4uCVGCI8TgO7HPJKIihJoRKrm/FtEckoaBDK+oQ7NmX50njpGKfV85uT0vVqzyOAjpAh6iMbHSBqugG1VAdUfSIntErejOejBfj3fiYti4Y+cwe+gPj8wdbVJba</latexit> <latexit sha1_base64="MUKBCiIEN6kov3qjSDDj2XGDctE=">AAACDnicbVC5TgMxEPVyhnAtUNJYREihiXYRVxmggS5ADqQkirzOhFjxHrJnIdGSL6DhV2goQIiWmo6/wTkKIDzJ0tN7M+OZ50VSaHScL2tqemZ2bj61kF5cWl5ZtdfWyzqMFYcSD2Worj2mQYoASihQwnWkgPmehIrXOR34lVtQWoRBEXsR1H12E4iW4AyN1LC3o2z3vobQxeSqeHxZpOf0DgKkGFJsA42Y6uT6Ow074+ScIegkccckQ8YoNOzPWjPksW9Gccm0rrpOhPWEKRRcQj9dizVEjHfYDVQNDZgPup4Mz+nTbaM0aStU5plVhurPjoT5Wvd8z1T6DNv6rzcQ//OqMbaO6okIohgh4KOPWrEcHmuyoU2hgKPsGcK4EmZXyttMMY4mwbQJwf178iQp7+bcg9z+xV4mfzKOI0U2yRbJEpcckjw5IwVSIpw8kCfyQl6tR+vZerPeR6VT1rhng/yC9fENjmebKg==</latexit>
The 3 think 11% to 35% the 29% bathroo 3% and 14% I 21%
When 2.5% was 5% back 8% a m
9% doctor 2% with 9 It 6
They 2% went 2% into 5% see %
5% hospita 2% , 8% The 3%
… … am 1% through 4% my 3% l
store 1.5% to 7% There 3%
I 1% will 1% out 3% bed 2% … … … … … …
… … like 0.5% on 2% school 1% park 0.5% . 6% STOP 1%
Banana 0.1% … … … …% … … … … … … … …
Neural Network
Neural Network
Mapping each tokenized id into its corresponding embeddings
Tokenization:
language
• Some problems with this:
• can be quite large - ~470,000 words Webster’s English Dictionary (3rd
|V|
<latexit sha1_base64="O5VOI/dtVrARV8hqH5tB+XIfLTs=">AAAB9HicbVDLSgMxFL3js9ZX1aWbYBFclRnxtSy6cVnBPqAdSibNtKGZZEwyhTLtd7hxoYhbP8adf2OmnYW2HggczrmXe3KCmDNtXPfbWVldW9/YLGwVt3d29/ZLB4cNLRNFaJ1ILlUrwJpyJmjdMMNpK1YURwGnzWB4l/nNEVWaSfFoxjH1I9wXLGQEGyv5k06EzYBgnjamk26p7FbcGdAy8XJShhy1bumr05MkiagwhGOt254bGz/FyjDC6bTYSTSNMRniPm1bKnBEtZ/OQk/RqVV6KJTKPmHQTP29keJI63EU2Mkso170MvE/r52Y8MZPmYgTQwWZHwoTjoxEWQOoxxQlho8twUQxmxWRAVaYGNtT0ZbgLX55mTTOK95V5fLholy9zesowDGcwBl4cA1VuIca1IHAEzzDK7w5I+fFeXc+5qMrTr5zBH/gfP4AX7uSgg==</latexit>
edition)
• Language is changing all of the time - 690 words were added to Merriam
Webster's in September 2023 (“rizz”, “goated”, “mid”)
• Long tail of infrequent words. Many words just occur a few times
• Some words may not appear in a training set of documents
• No modeled relationship between words - e.g., “run”, “ran”, “runs”, “runner”
are all separate entries despite being linked in meaning
34 Lecture 3: Tokenization
Character-level?
What about representing text with characters?
• {a,
<latexit sha1_base64="MlhxriEpGaJ4rFM71wgrLVXdFa8=">AAACA3icbVDLSsNAFJ3UV62vqDvdDBbBRSiJ+NoIRTcuK9gHNKFMppN26GQmzEyEGgpu/BU3LhRx60+482+ctllo64ELh3Pu5d57woRRpV332yosLC4trxRXS2vrG5tb9vZOQ4lUYlLHggnZCpEijHJS11Qz0kokQXHISDMcXI/95j2Rigp+p4cJCWLU4zSiGGkjdey9BryEfoYcGDoQO9BnXaGVAx/8UccuuxV3AjhPvJyUQY5ax/7yuwKnMeEaM6RU23MTHWRIaooZGZX8VJEE4QHqkbahHMVEBdnkhxE8NEoXRkKa4hpO1N8TGYqVGsah6YyR7qtZbyz+57VTHV0EGeVJqgnH00VRyqAWcBwI7FJJsGZDQxCW1NwKcR9JhLWJrWRC8GZfnieN44p3Vjm9PSlXr/I4imAfHIAj4IFzUAU3oAbqAINH8AxewZv1ZL1Y79bHtLVg5TO74A+szx+CBpWO</latexit>
V = b, c, . . . , z}
• (Maybe add capital letters, punctuation, spaces, …)
• Pros:
• Small vocabulary size ( |V | = 26 for English)
<latexit sha1_base64="Y1BChkOBWyEXt94WKpIwyAAGjtI=">AAAB8XicbVBNSwMxEJ34WetX1aOXYBE8ld2i1YtQ9OKxgv3AdinZNNuGZrNLkhXKtv/CiwdFvPpvvPlvTNs9aOuDgcd7M8zM82PBtXGcb7Syura+sZnbym/v7O7tFw4OGzpKFGV1GolItXyimeCS1Q03grVixUjoC9b0h7dTv/nElOaRfDCjmHkh6UsecEqMlR7HuIHH+BqXK91C0Sk5M+Bl4makCBlq3cJXpxfRJGTSUEG0brtObLyUKMOpYJN8J9EsJnRI+qxtqSQh0146u3iCT63Sw0GkbEmDZ+rviZSEWo9C33aGxAz0ojcV//PaiQmuvJTLODFM0vmiIBHYRHj6Pu5xxagRI0sIVdzeiumAKEKNDSlvQ3AXX14mjXLJrZQu7s+L1ZssjhwcwwmcgQuXUIU7qEEdKEh4hld4Qxq9oHf0MW9dQdnMEfwB+vwBQwePWw==</latexit>
36 Lecture 3: Tokenization
fi
Byte-pair encoding: ChatGPT example
GPT-3.5/GPT-4/ChatGPT 100k
GPT-2/GPT-3 50k
Llama2 32k
Falcon 65k
38 Lecture 3: Tokenization
Byte-pair encoding (BPE): algorithm
Required:
• Documents D
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>
Algorithm:
• Pre-tokenize D by splitting into words (split before whitespace/punctuation)
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
•
<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>
Let n := |V| + 1
<latexit sha1_base64="qHDdAiIl6F+PIAHnOC1uenCsXQI=">AAACAHicbVDLSsNAFL3xWesr6sKFm8EiCEJJxBeCUHTjsoJ9QBvKZDpph04mYWYilJiNv+LGhSJu/Qx3/o2TtgttPTBw5px7ufceP+ZMacf5tubmFxaXlgsrxdW19Y1Ne2u7rqJEElojEY9k08eKciZoTTPNaTOWFIc+pw1/cJP7jQcqFYvEvR7G1AtxT7CAEayN1LF3Bbq8Qo+oHWLdJ5in9cz8jtyOXXLKzgholrgTUoIJqh37q92NSBJSoQnHSrVcJ9ZeiqVmhNOs2E4UjTEZ4B5tGSpwSJWXjg7I0IFRuiiIpHlCo5H6uyPFoVLD0DeV+Z5q2svF/7xWooMLL2UiTjQVZDwoSDjSEcrTQF0mKdF8aAgmkpldEeljiYk2mRVNCO70ybOkflx2z8qndyelyvUkjgLswT4cggvnUIFbqEINCGTwDK/wZj1ZL9a79TEunbMmPTvwB9bnD1QnlPg=</latexit>
•
• Get counts of all bigrams in D
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
• Let vn := concat(vi , vj )
• Change all instances in D of v i , v j to vn and add vn to V
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="8qiJFR39nNAp8DHiEq3U32WlZhY=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgQcKu+DoGvXiMYB6SLMvsZJKMmZldZmYDYclXePGgiFc/x5t/42yyB40WNBRV3XR3hTFn2rjul1NYWl5ZXSuulzY2t7Z3yrt7TR0litAGiXik2iHWlDNJG4YZTtuxoliEnLbC0U3mt8ZUaRbJezOJqS/wQLI+I9hY6WEcsBM0Dh5LQbniVt0Z0F/i5aQCOepB+bPbi0giqDSEY607nhsbP8XKMMLptNRNNI0xGeEB7VgqsaDaT2cHT9GRVXqoHylb0qCZ+nMixULriQhtp8BmqBe9TPzP6ySmf+WnTMaJoZLMF/UTjkyEsu9RjylKDJ9Ygoli9lZEhlhhYmxGWQje4st/SfO06l1Uz+/OKrXrPI4iHMAhHIMHl1CDW6hDAwgIeIIXeHWU8+y8Oe/z1oKTz+zDLzgf36+tj7E=</latexit>
39 Lecture 3: Tokenization
Byte-pair encoding: example
Required:
D = {“i hug pugs”, “hugging pugs is fun”, “i make puns”}
<latexit sha1_base64="Lx3YuPjCnglROLa54gIN0Qp5coQ=">AAACjXicbVFLT8MwDHbLezw24IS4REwIDmhqEW8BQoAER0AMkNYJ0izdoqVplaSIqdqf5Ma/wd12YIAjx/b32YnjhKkUxnrel+NOTE5Nz8zOleYXFpfKleWVJ5NkmvE6S2SiX0JquBSK162wkr+kmtM4lPw57F4V/PM710Yk6tH2Ut6MaVuJSDBqEUoqDxBADBQsdIChlZDDNfSBwBlqgFGAHIcP3HN4wyUQ70AGbbTpwBrYwrWDccEPuTbmqbEcgkixRxirsYrixKKHLt4zzFdgyBpZwz4C1NJrperVvIGQv44/cqowkrvXymfQSlgWc2WZpMY0fC+1zZxqK5jk/VKQGZ5S1qVt3kBX0ZibZj6YZp9sItIiUaJRlSUD9GdFTmNjenGImTG1HfObK8D/uEZmo6NmLlSaWa7Y8KIok8QmpPga0hKaMyt76FCmBfZKWIdqyix+YDEE//eT/zpPuzX/oLZ/v1e9uByNYxbWYQO2wYdDuIBbuIM6MKfkeM6xc+KW3X331D0fprrOqGYVxsS9+QZqLKZy</latexit>
• Documents D
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>
Algorithm:
D = {“i”, “ hug”, “ pugs”, “hugging”, “ pugs”,
<latexit sha1_base64="PdR2Dr+yjaKvMUQ0m9thT2ShX2U=">AAACcXicbZFdT9swFIadjG3QfZXBDUKbLKoNpE1VgjbGzSQEXOySSSsgNVXruCepVceJ7ONpVZT7/T7u+BPc7A/gtAFtZUey9Og97/HH67iQwmAQXHv+o5XHT56urrWePX/x8lV7/fW5ya3m0OO5zPVlzAxIoaCHAiVcFhpYFku4iKcndf/iJ2gjcvUDZwUMMpYqkQjO0EnD9u8oYzjhTJanFf1Ko/J9hPALy9FI7O5+pKMRndi0ocKmZoFOS4Va0isaRa37cSoaL02sWtDdjhmbwv2ocrYqqlrDdifoBvOiDyFsoEOaOhu2r6Jxzm0GCrlkxvTDoMBByTQKLqFqRdZAwfiUpdB3qFgGZlDOE6voO6eMaZJrtxTSufr3RMkyY2ZZ7Jx1Pma5V4v/6/UtJoeDUqjCIii+OCixkmJO6/jpWGjgKGcOGNfC3ZXyCdOMo/ukOoRw+ckP4Xy/Gx50P3//1Dk6buJYJdtkh+yRkHwhR+QbOSM9wsmNt+m98d56f/wtn/o7C6vvNTMb5J/yP9wCuM21Ew==</latexit>
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>
• Let
• Get counts of all bigrams in D
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
(breaking
ties arbitrarily) [‘h’, ‘u’, ‘g’, ‘g’, ‘i’, ‘n’, ‘g’] , [‘ ’, ‘p’, ‘u’, ‘g’, ‘s’] ,
<latexit sha1_base64="FcsoGfdiXUbCtS2heHfv2tDDoMA=">AAACB3icbVDJSgNBEO2JW4zbqEdBGoMQQcKMuCEIQS8eI5gFkjD0dDpJm56eobsmGIbcvPgrXjwo4tVf8Obf2FkOmvig4PFeFVX1/EhwDY7zbaXm5hcWl9LLmZXVtfUNe3OrrMNYUVaioQhV1SeaCS5ZCTgIVo0UI4EvWMXvXg/9So8pzUN5B/2INQLSlrzFKQEjefZuz5P44hLXgT1AQkNpjEGu5/FD3PPuDzw76+SdEfAscSckiyYoevZXvRnSOGASqCBa11wngkZCFHAq2CBTjzWLCO2SNqsZKknAdCMZ/THA+0Zp4laoTEnAI/X3REICrfuBbzoDAh097Q3F/7xaDK3zRsJlFAOTdLyoFQsMIR6GgptcMQqibwihiptbMe0QRSiY6DImBHf65VlSPsq7p/mT2+Ns4WoSRxrtoD2UQy46QwV0g4qohCh6RM/oFb1ZT9aL9W59jFtT1mRmG/2B9fkDnQqYhg==</latexit>
• Let vn := concat(vi , vj )
[‘ ’, ‘i’, ‘s’] , [‘ ’, ‘f’, ‘u’, ‘n’] , [‘i’] ,
• Change all instances in D of vi , vj to vn
<latexit sha1_base64="8qiJFR39nNAp8DHiEq3U32WlZhY=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgQcKu+DoGvXiMYB6SLMvsZJKMmZldZmYDYclXePGgiFc/x5t/42yyB40WNBRV3XR3hTFn2rjul1NYWl5ZXSuulzY2t7Z3yrt7TR0litAGiXik2iHWlDNJG4YZTtuxoliEnLbC0U3mt8ZUaRbJezOJqS/wQLI+I9hY6WEcsBM0Dh5LQbniVt0Z0F/i5aQCOepB+bPbi0giqDSEY607nhsbP8XKMMLptNRNNI0xGeEB7VgqsaDaT2cHT9GRVXqoHylb0qCZ+nMixULriQhtp8BmqBe9TPzP6ySmf+WnTMaJoZLMF/UTjkyEsu9RjylKDJ9Ygoli9lZEhlhhYmxGWQje4st/SfO06l1Uz+/OKrXrPI4iHMAhHIMHl1CDW6hDAwgIeIIXeHWU8+y8Oe/z1oKTz+zDLzgf36+tj7E=</latexit>
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="kiVMEoeDnjNUXHKvcbrow7t6lH4=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEmYns8mQ2dllpjcQlnyCFw+KePWLvPk3TpI9aLSgoajqprsrSKQw6LpfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWj25nfGnNtRKwecZJwP6IDJULBKFrpYdxTvXLFrbpzkL/Ey0kFctR75c9uP2ZpxBUySY3peG6CfkY1Cib5tNRNDU8oG9EB71iqaMSNn81PnZITq/RJGGtbCslc/TmR0ciYSRTYzoji0Cx7M/E/r5NieO1nQiUpcsUWi8JUEozJ7G/SF5ozlBNLKNPC3krYkGrK0KZTsiF4yy//Jc2zqndZvbg/r9Ru8jiKcATHcAoeXEEN7qAODWAwgCd4gVdHOs/Om/O+aC04+cwh/ILz8Q1oWI3l</latexit>
and add vn to V [‘ ’, ‘m’, ‘a’, ‘k’, ‘e’] , [‘ ’, ‘p’, ‘u’, ‘n’, ‘s’]}
<latexit sha1_base64="kiVMEoeDnjNUXHKvcbrow7t6lH4=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEmYns8mQ2dllpjcQlnyCFw+KePWLvPk3TpI9aLSgoajqprsrSKQw6LpfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWj25nfGnNtRKwecZJwP6IDJULBKFrpYdxTvXLFrbpzkL/Ey0kFctR75c9uP2ZpxBUySY3peG6CfkY1Cib5tNRNDU8oG9EB71iqaMSNn81PnZITq/RJGGtbCslc/TmR0ciYSRTYzoji0Cx7M/E/r5NieO1nQiUpcsUWi8JUEozJ7G/SF5ozlBNLKNPC3krYkGrK0KZTsiF4yy//Jc2zqndZvbg/r9Ru8jiKcATHcAoeXEEN7qAODWAwgCd4gVdHOs/Om/O+aC04+cwh/ILz8Q1oWI3l</latexit>
<latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit>
Required:
• Documents D
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>
Algorithm:
• Pre-tokenize D by splitting into words (split before Implementation aside: We normally
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
whitespace/punctuation)
store D with the token indices instead
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>
n := |V| + 1
<latexit sha1_base64="qHDdAiIl6F+PIAHnOC1uenCsXQI=">AAACAHicbVDLSsNAFL3xWesr6sKFm8EiCEJJxBeCUHTjsoJ9QBvKZDpph04mYWYilJiNv+LGhSJu/Qx3/o2TtgttPTBw5px7ufceP+ZMacf5tubmFxaXlgsrxdW19Y1Ne2u7rqJEElojEY9k08eKciZoTTPNaTOWFIc+pw1/cJP7jQcqFYvEvR7G1AtxT7CAEayN1LF3Bbq8Qo+oHWLdJ5in9cz8jtyOXXLKzgholrgTUoIJqh37q92NSBJSoQnHSrVcJ9ZeiqVmhNOs2E4UjTEZ4B5tGSpwSJWXjg7I0IFRuiiIpHlCo5H6uyPFoVLD0DeV+Z5q2svF/7xWooMLL2UiTjQVZDwoSDjSEcrTQF0mKdF8aAgmkpldEeljiYk2mRVNCO70ybOkflx2z8qndyelyvUkjgLswT4cggvnUIFbqEINCGTwDK/wZj1ZL9a79TEunbMmPTvwB9bnD1QnlPg=</latexit>
(breaking
[6, 13, 5, 5, 7, 10, 5] , [1, 11, 13, 5, 12] , [1, 7, 12] ,
ties arbitrarily) [1, 4, 13, 10] , [7] , [1, 9, 2, 8, 3] , [1, 11, 13, 10, 12]}
<latexit sha1_base64="FcsoGfdiXUbCtS2heHfv2tDDoMA=">AAACB3icbVDJSgNBEO2JW4zbqEdBGoMQQcKMuCEIQS8eI5gFkjD0dDpJm56eobsmGIbcvPgrXjwo4tVf8Obf2FkOmvig4PFeFVX1/EhwDY7zbaXm5hcWl9LLmZXVtfUNe3OrrMNYUVaioQhV1SeaCS5ZCTgIVo0UI4EvWMXvXg/9So8pzUN5B/2INQLSlrzFKQEjefZuz5P44hLXgT1AQkNpjEGu5/FD3PPuDzw76+SdEfAscSckiyYoevZXvRnSOGASqCBa11wngkZCFHAq2CBTjzWLCO2SNqsZKknAdCMZ/THA+0Zp4laoTEnAI/X3REICrfuBbzoDAh097Q3F/7xaDK3zRsJlFAOTdLyoFQsMIR6GgptcMQqibwihiptbMe0QRSiY6DImBHf65VlSPsq7p/mT2+Ns4WoSRxrtoD2UQy46QwV0g4qohCh6RM/oFb1ZT9aL9W59jFtT1mRmG/2B9fkDnQqYhg==</latexit>
• Let vn := concat(vi , vj )
• Change all instances in D of vi , vj to vn
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="8qiJFR39nNAp8DHiEq3U32WlZhY=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgQcKu+DoGvXiMYB6SLMvsZJKMmZldZmYDYclXePGgiFc/x5t/42yyB40WNBRV3XR3hTFn2rjul1NYWl5ZXSuulzY2t7Z3yrt7TR0litAGiXik2iHWlDNJG4YZTtuxoliEnLbC0U3mt8ZUaRbJezOJqS/wQLI+I9hY6WEcsBM0Dh5LQbniVt0Z0F/i5aQCOepB+bPbi0giqDSEY607nhsbP8XKMMLptNRNNI0xGeEB7VgqsaDaT2cHT9GRVXqoHylb0qCZ+nMixULriQhtp8BmqBe9TPzP6ySmf+WnTMaJoZLMF/UTjkyEsu9RjylKDJ9Ygoli9lZEhlhhYmxGWQje4st/SfO06l1Uz+/OKrXrPI4iHMAhHIMHl1CDW6hDAwgIeIIXeHWU8+y8Oe/z1oKTz+zDLzgf36+tj7E=</latexit>
<latexit sha1_base64="kiVMEoeDnjNUXHKvcbrow7t6lH4=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEmYns8mQ2dllpjcQlnyCFw+KePWLvPk3TpI9aLSgoajqprsrSKQw6LpfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWj25nfGnNtRKwecZJwP6IDJULBKFrpYdxTvXLFrbpzkL/Ey0kFctR75c9uP2ZpxBUySY3peG6CfkY1Cib5tNRNDU8oG9EB71iqaMSNn81PnZITq/RJGGtbCslc/TmR0ciYSRTYzoji0Cx7M/E/r5NieO1nQiUpcsUWi8JUEozJ7G/SF5ozlBNLKNPC3krYkGrK0KZTsiF4yy//Jc2zqndZvbg/r9Ru8jiKcATHcAoeXEEN7qAODWAwgCd4gVdHOs/Om/O+aC04+cwh/ILz8Q1oWI3l</latexit>
<latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit>
41 Lecture 3: Tokenization
Byte-pair encoding: example
Required:
D = { [‘i’] , [‘ ’, ‘h’, ‘u’, ‘g’] , [‘ ’, ‘p’, ‘u’, ‘g’, ‘s’] ,
<latexit sha1_base64="JYwVy/JOJ32LM/WIMvX0Fdcktek=">AAADm3icpVJda9swFFXsfXTZR9PuZTAGYmHrHkqwx9b2ZaOsg5Wxhw6WthCZVFbkWESWjXRdGox/1P7K3vZvJqVhxHE3Brugw+Hec++RxI0LKQwEwc+O59+6fefuxr3u/QcPH232trZPTV5qxocsl7k+j6nhUig+BAGSnxea0yyW/CyeHbn62SXXRuTqG8wLHmV0qkQiGAWbGm91vpOMQsqorD7W+B0m1UsieQIjAvwKqguxUxMtpilEu7hRwDu7+CJ1UDqY/lVXrOgsmBUxJqTbtEyb4gUIB+q/jJzPDR2ipWtJkt9D1Z90Ys2oPSRzQB3MHPB/fIZqXg+Tujvu9YNBsAjcJuGS9NEyTsa9H2SSszLjCpikxozCoICoohoEk7zuktLwgrIZnfKRpYpm3ETVYrdq/MJmJjjJtT0K8CK72lHRzJh5FlulWySzXnPJm2qjEpKDqBKqKIErdm2UlBJDjt2i4onQnIGcW0KZFvaumKVUUwZ2nd0nhOtPbpPT14Nwb/D265v+4Yfld2ygp+g5eoVCtI8O0TE6QUPEvCfee++Td+w/84/8z/6Xa6nXWfY8Ro3wh78AsdoXeA==</latexit>
• Documents D
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
• Desired vocabulary size N (greater than chars in D) [‘h’, ‘u’, ‘g’, ‘g’, ‘i’, ‘n’, ‘g’] , [‘ ’, ‘p’, ‘u’, ‘g’, ‘s’] ,
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>
Algorithm:
[‘ ’, ‘i’, ‘s’] , [‘ ’, ‘f’, ‘u’, ‘n’] , [‘i’] ,
• Pre-tokenize D by splitting into words (split before
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
Bigram Count
• Convert D into a list of tokens (characters)
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
‘u’,’g’ 4
• While |V| < N :
<latexit sha1_base64="f0Eq4chOXdaCL4akh6X1BvQxXWM=">AAAB+HicbVDLSsNAFL2pr1ofjbp0M1gEVyURqy6LblxWsA9oQ5lMJ+3QySTMTISa9kvcuFDErZ/izr9x0mahrQcGDufcyz1z/JgzpR3n2yqsrW9sbhW3Szu7e/tl++CwpaJEEtokEY9kx8eKciZoUzPNaSeWFIc+p21/fJv57UcqFYvEg57E1AvxULCAEayN1LfLU9QLsR4RzNPWDE37dsWpOnOgVeLmpAI5Gn37qzeISBJSoQnHSnVdJ9ZeiqVmhNNZqZcoGmMyxkPaNVTgkCovnQefoVOjDFAQSfOERnP190aKQ6UmoW8ms5Bq2cvE/7xuooNrL2UiTjQVZHEoSDjSEcpaQAMmKdF8YggmkpmsiIywxESbrkqmBHf5y6ukdV51L6u1+4tK/SavowjHcAJn4MIV1OEOGtAEAgk8wyu8WU/Wi/VufSxGC1a+cwR/YH3+AIm1kwc=</latexit>
<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>
‘p’, ‘u’ 3
n := |V| + 1
<latexit sha1_base64="qHDdAiIl6F+PIAHnOC1uenCsXQI=">AAACAHicbVDLSsNAFL3xWesr6sKFm8EiCEJJxBeCUHTjsoJ9QBvKZDpph04mYWYilJiNv+LGhSJu/Qx3/o2TtgttPTBw5px7ufceP+ZMacf5tubmFxaXlgsrxdW19Y1Ne2u7rqJEElojEY9k08eKciZoTTPNaTOWFIc+pw1/cJP7jQcqFYvEvR7G1AtxT7CAEayN1LF3Bbq8Qo+oHWLdJ5in9cz8jtyOXXLKzgholrgTUoIJqh37q92NSBJSoQnHSrVcJ9ZeiqVmhNOs2E4UjTEZ4B5tGSpwSJWXjg7I0IFRuiiIpHlCo5H6uyPFoVLD0DeV+Z5q2svF/7xWooMLL2UiTjQVZDwoSDjSEcrTQF0mKdF8aAgmkpldEeljiYk2mRVNCO70ybOkflx2z8qndyelyvUkjgLswT4cggvnUIFbqEINCGTwDK/wZj1ZL9a79TEunbMmPTvwB9bnD1QnlPg=</latexit>
• Let ‘ ‘, ‘p’ 3
• Get counts of all bigrams in D
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
‘h’, ‘u’ 2
• For the most frequent bigram vi , vj (breaking
<latexit sha1_base64="8qiJFR39nNAp8DHiEq3U32WlZhY=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgQcKu+DoGvXiMYB6SLMvsZJKMmZldZmYDYclXePGgiFc/x5t/42yyB40WNBRV3XR3hTFn2rjul1NYWl5ZXSuulzY2t7Z3yrt7TR0litAGiXik2iHWlDNJG4YZTtuxoliEnLbC0U3mt8ZUaRbJezOJqS/wQLI+I9hY6WEcsBM0Dh5LQbniVt0Z0F/i5aQCOepB+bPbi0giqDSEY607nhsbP8XKMMLptNRNNI0xGeEB7VgqsaDaT2cHT9GRVXqoHylb0qCZ+nMixULriQhtp8BmqBe9TPzP6ySmf+WnTMaJoZLMF/UTjkyEsu9RjylKDJ9Ygoli9lZEhlhhYmxGWQje4st/SfO06l1Uz+/OKrXrPI4iHMAhHIMHl1CDW6hDAwgIeIIXeHWU8+y8Oe/z1oKTz+zDLzgf36+tj7E=</latexit>
… …
ties arbitrarily)
<latexit sha1_base64="FcsoGfdiXUbCtS2heHfv2tDDoMA=">AAACB3icbVDJSgNBEO2JW4zbqEdBGoMQQcKMuCEIQS8eI5gFkjD0dDpJm56eobsmGIbcvPgrXjwo4tVf8Obf2FkOmvig4PFeFVX1/EhwDY7zbaXm5hcWl9LLmZXVtfUNe3OrrMNYUVaioQhV1SeaCS5ZCTgIVo0UI4EvWMXvXg/9So8pzUN5B/2INQLSlrzFKQEjefZuz5P44hLXgT1AQkNpjEGu5/FD3PPuDzw76+SdEfAscSckiyYoevZXvRnSOGASqCBa11wngkZCFHAq2CBTjzWLCO2SNqsZKknAdCMZ/THA+0Zp4laoTEnAI/X3REICrfuBbzoDAh097Q3F/7xaDK3zRsJlFAOTdLyoFQsMIR6GgptcMQqibwihiptbMe0QRSiY6DImBHf65VlSPsq7p/mT2+Ns4WoSRxrtoD2UQy46QwV0g4qohCh6RM/oFb1ZT9aL9W59jFtT1mRmG/2B9fkDnQqYhg==</latexit>
• Let vn := concat(vi , vj )
Change all instances in D of vi , vj to vn
<latexit sha1_base64="0oaQ5SYUEal2a8mzvr8+b8yYcYk=">AAACIXicbVDLSgMxFM34rOOr6tJNsEgrSJmRqkUQim5cVrBVaEubSdM2NJMZkjvFMsyvuPFX3LhQpDvxZ0wfgq8DgXPPuZebe7xQcA2O827NzS8sLi2nVuzVtfWNzfTWdlUHkaKsQgMRqDuPaCa4ZBXgINhdqBjxPcFuvf7l2L8dMKV5IG9gGLKGT7qSdzglYKRmujhoxm4hwWfnuA7sHmIaSOMluWnVirKHuNXNJgf4q6EVmdK2m+mMk3cmwH+JOyMZNEO5mR7V2wGNfCaBCqJ1zXVCaMREAaeCJXY90iwktE+6rGaoJD7TjXhyYYL3jdLGnUCZJwFP1O8TMfG1Hvqe6fQJ9PRvbyz+59Ui6BQbMZdhBEzS6aJOJDAEeBwXbnPFKIihIYQqbv6KaY8oQsGEOg7B/X3yX1I9yrsn+ePrQqZ0MYsjhXbRHsohF52iErpCZVRBFD2gJ/SCXq1H69l6s0bT1jlrNrODfsD6+ATRb6IM</latexit>
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="kiVMEoeDnjNUXHKvcbrow7t6lH4=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEmYns8mQ2dllpjcQlnyCFw+KePWLvPk3TpI9aLSgoajqprsrSKQw6LpfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWj25nfGnNtRKwecZJwP6IDJULBKFrpYdxTvXLFrbpzkL/Ey0kFctR75c9uP2ZpxBUySY3peG6CfkY1Cib5tNRNDU8oG9EB71iqaMSNn81PnZITq/RJGGtbCslc/TmR0ciYSRTYzoji0Cx7M/E/r5NieO1nQiUpcsUWi8JUEozJ7G/SF5ozlBNLKNPC3krYkGrK0KZTsiF4yy//Jc2zqndZvbg/r9Ru8jiKcATHcAoeXEEN7qAODWAwgCd4gVdHOs/Om/O+aC04+cwh/ILz8Q1oWI3l</latexit>
and add vn to V
<latexit sha1_base64="kiVMEoeDnjNUXHKvcbrow7t6lH4=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEmYns8mQ2dllpjcQlnyCFw+KePWLvPk3TpI9aLSgoajqprsrSKQw6LpfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWj25nfGnNtRKwecZJwP6IDJULBKFrpYdxTvXLFrbpzkL/Ey0kFctR75c9uP2ZpxBUySY3peG6CfkY1Cib5tNRNDU8oG9EB71iqaMSNn81PnZITq/RJGGtbCslc/TmR0ciYSRTYzoji0Cx7M/E/r5NieO1nQiUpcsUWi8JUEozJ7G/SF5ozlBNLKNPC3krYkGrK0KZTsiF4yy//Jc2zqndZvbg/r9Ru8jiKcATHcAoeXEEN7qAODWAwgCd4gVdHOs/Om/O+aC04+cwh/ILz8Q1oWI3l</latexit>
<latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit>
42 Lecture 3: Tokenization
Byte-pair encoding: example
D = { [‘i’] , [‘ ’, ‘h’, ‘u’, ‘g’] , [‘ ’, ‘p’, ‘u’, ‘g’, ‘s’] ,
<latexit sha1_base64="JYwVy/JOJ32LM/WIMvX0Fdcktek=">AAADm3icpVJda9swFFXsfXTZR9PuZTAGYmHrHkqwx9b2ZaOsg5Wxhw6WthCZVFbkWESWjXRdGox/1P7K3vZvJqVhxHE3Brugw+Hec++RxI0LKQwEwc+O59+6fefuxr3u/QcPH232trZPTV5qxocsl7k+j6nhUig+BAGSnxea0yyW/CyeHbn62SXXRuTqG8wLHmV0qkQiGAWbGm91vpOMQsqorD7W+B0m1UsieQIjAvwKqguxUxMtpilEu7hRwDu7+CJ1UDqY/lVXrOgsmBUxJqTbtEyb4gUIB+q/jJzPDR2ipWtJkt9D1Z90Ys2oPSRzQB3MHPB/fIZqXg+Tujvu9YNBsAjcJuGS9NEyTsa9H2SSszLjCpikxozCoICoohoEk7zuktLwgrIZnfKRpYpm3ETVYrdq/MJmJjjJtT0K8CK72lHRzJh5FlulWySzXnPJm2qjEpKDqBKqKIErdm2UlBJDjt2i4onQnIGcW0KZFvaumKVUUwZ2nd0nhOtPbpPT14Nwb/D265v+4Yfld2ygp+g5eoVCtI8O0TE6QUPEvCfee++Td+w/84/8z/6Xa6nXWfY8Ro3wh78AsdoXeA==</latexit>
Required:
• Documents D [‘h’, ‘u’, ‘g’, ‘g’, ‘i’, ‘n’, ‘g’] , [‘ ’, ‘p’, ‘u’, ‘g’, ‘s’] ,
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>
n := |V| + 1
<latexit sha1_base64="qHDdAiIl6F+PIAHnOC1uenCsXQI=">AAACAHicbVDLSsNAFL3xWesr6sKFm8EiCEJJxBeCUHTjsoJ9QBvKZDpph04mYWYilJiNv+LGhSJu/Qx3/o2TtgttPTBw5px7ufceP+ZMacf5tubmFxaXlgsrxdW19Y1Ne2u7rqJEElojEY9k08eKciZoTTPNaTOWFIc+pw1/cJP7jQcqFYvEvR7G1AtxT7CAEayN1LF3Bbq8Qo+oHWLdJ5in9cz8jtyOXXLKzgholrgTUoIJqh37q92NSBJSoQnHSrVcJ9ZeiqVmhNOs2E4UjTEZ4B5tGSpwSJWXjg7I0IFRuiiIpHlCo5H6uyPFoVLD0DeV+Z5q2svF/7xWooMLL2UiTjQVZDwoSDjSEcrTQF0mKdF8aAgmkpldEeljiYk2mRVNCO70ybOkflx2z8qndyelyvUkjgLswT4cggvnUIFbqEINCGTwDK/wZj1ZL9a79TEunbMmPTvwB9bnD1QnlPg=</latexit>
• Let vn := concat(vi , vj ) V = {‘ ’, ‘a’, ‘e’, ‘f’, ‘g’, ‘h’, ‘i’, ‘k’, ‘m’,
<latexit sha1_base64="qeCxr9wOJoEQEBG2Ud9ytVA8q8M=">AAACZ3icbZFNbxMxEIa9WyhlgSalCCFxMUSQHqJoFxXoBamCC8cikQ8pGyVeZzax4vWu7Nmq0Xb7I3vrnQv/AjsNUkkZyY9evePRjMdJIYXBMLzx/J0HD3cf7T0Onjx9tt9oHjzvm7zUHHo8l7keJsyAFAp6KFDCsNDAskTCIFl+c/nBOWgjcvUTVwWMMzZXIhWcobUmzas4Y7jgTFb9mn6hcfU+RrjAakrbHTplDuCQOswdFg7CYemQtTs1jePgb51yZuFgHMo15u06rjv08k6zS9stOp40W2E3XAe9L6KNaJFNnE2a1/Es52UGCrlkxoyisMBxxTQKLqEO4tJAwfiSzWFkpWIZmHG13lNN31lnRtNc26OQrt27FRXLjFllib3pBjXbOWf+LzcqMT0ZV0IVJYLit43SUlLMqVs6nQkNHOXKCsa1sLNSvmCacbRfE9glRNtPvi/6H7rRp+7HH8et06+bdeyR1+QtOSIR+UxOyXdyRnqEk19e4B16L7zffsN/6b+6vep7m5pD8k/4b/4Aq6uwfw==</latexit>
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="kiVMEoeDnjNUXHKvcbrow7t6lH4=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEmYns8mQ2dllpjcQlnyCFw+KePWLvPk3TpI9aLSgoajqprsrSKQw6LpfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWj25nfGnNtRKwecZJwP6IDJULBKFrpYdxTvXLFrbpzkL/Ey0kFctR75c9uP2ZpxBUySY3peG6CfkY1Cib5tNRNDU8oG9EB71iqaMSNn81PnZITq/RJGGtbCslc/TmR0ciYSRTYzoji0Cx7M/E/r5NieO1nQiUpcsUWi8JUEozJ7G/SF5ozlBNLKNPC3krYkGrK0KZTsiF4yy//Jc2zqndZvbg/r9Ru8jiKcATHcAoeXEEN7qAODWAwgCd4gVdHOs/Om/O+aC04+cwh/ILz8Q1oWI3l</latexit>
and add vn to V
<latexit sha1_base64="kiVMEoeDnjNUXHKvcbrow7t6lH4=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEmYns8mQ2dllpjcQlnyCFw+KePWLvPk3TpI9aLSgoajqprsrSKQw6LpfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWj25nfGnNtRKwecZJwP6IDJULBKFrpYdxTvXLFrbpzkL/Ey0kFctR75c9uP2ZpxBUySY3peG6CfkY1Cib5tNRNDU8oG9EB71iqaMSNn81PnZITq/RJGGtbCslc/TmR0ciYSRTYzoji0Cx7M/E/r5NieO1nQiUpcsUWi8JUEozJ7G/SF5ozlBNLKNPC3krYkGrK0KZTsiF4yy//Jc2zqndZvbg/r9Ru8jiKcATHcAoeXEEN7qAODWAwgCd4gVdHOs/Om/O+aC04+cwh/ILz8Q1oWI3l</latexit>
<latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit>
Required:
• Documents D [‘h’, ‘ug’, ‘g’, ‘i’, ‘n’, ‘g’] , [‘ ’, ‘p’, ‘ug’, ‘s’] ,
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
‘ ‘, ‘p’ 3
• Convert D into a list of tokens (characters)
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
‘p’, ‘ug’ 2
• While |V| < N :
<latexit sha1_base64="f0Eq4chOXdaCL4akh6X1BvQxXWM=">AAAB+HicbVDLSsNAFL2pr1ofjbp0M1gEVyURqy6LblxWsA9oQ5lMJ+3QySTMTISa9kvcuFDErZ/izr9x0mahrQcGDufcyz1z/JgzpR3n2yqsrW9sbhW3Szu7e/tl++CwpaJEEtokEY9kx8eKciZoUzPNaSeWFIc+p21/fJv57UcqFYvEg57E1AvxULCAEayN1LfLU9QLsR4RzNPWDE37dsWpOnOgVeLmpAI5Gn37qzeISBJSoQnHSnVdJ9ZeiqVmhNNZqZcoGmMyxkPaNVTgkCovnQefoVOjDFAQSfOERnP190aKQ6UmoW8ms5Bq2cvE/7xuooNrL2UiTjQVZHEoSDjSEcpaQAMmKdF8YggmkpmsiIywxESbrkqmBHf5y6ukdV51L6u1+4tK/SavowjHcAJn4MIV1OEOGtAEAgk8wyu8WU/Wi/VufSxGC1a+cwR/YH3+AIm1kwc=</latexit>
<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>
‘ug’, ’s' 2
n := |V| + 1
<latexit sha1_base64="qHDdAiIl6F+PIAHnOC1uenCsXQI=">AAACAHicbVDLSsNAFL3xWesr6sKFm8EiCEJJxBeCUHTjsoJ9QBvKZDpph04mYWYilJiNv+LGhSJu/Qx3/o2TtgttPTBw5px7ufceP+ZMacf5tubmFxaXlgsrxdW19Y1Ne2u7rqJEElojEY9k08eKciZoTTPNaTOWFIc+pw1/cJP7jQcqFYvEvR7G1AtxT7CAEayN1LF3Bbq8Qo+oHWLdJ5in9cz8jtyOXXLKzgholrgTUoIJqh37q92NSBJSoQnHSrVcJ9ZeiqVmhNOs2E4UjTEZ4B5tGSpwSJWXjg7I0IFRuiiIpHlCo5H6uyPFoVLD0DeV+Z5q2svF/7xWooMLL2UiTjQVZDwoSDjSEcrTQF0mKdF8aAgmkpldEeljiYk2mRVNCO70ybOkflx2z8qndyelyvUkjgLswT4cggvnUIFbqEINCGTwDK/wZj1ZL9a79TEunbMmPTvwB9bnD1QnlPg=</latexit>
• Let
‘u’, ’n' 2
• Get counts of all bigrams in D
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
… …
• For the most frequent bigram vi , vj
<latexit sha1_base64="8qiJFR39nNAp8DHiEq3U32WlZhY=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgQcKu+DoGvXiMYB6SLMvsZJKMmZldZmYDYclXePGgiFc/x5t/42yyB40WNBRV3XR3hTFn2rjul1NYWl5ZXSuulzY2t7Z3yrt7TR0litAGiXik2iHWlDNJG4YZTtuxoliEnLbC0U3mt8ZUaRbJezOJqS/wQLI+I9hY6WEcsBM0Dh5LQbniVt0Z0F/i5aQCOepB+bPbi0giqDSEY607nhsbP8XKMMLptNRNNI0xGeEB7VgqsaDaT2cHT9GRVXqoHylb0qCZ+nMixULriQhtp8BmqBe9TPzP6ySmf+WnTMaJoZLMF/UTjkyEsu9RjylKDJ9Ygoli9lZEhlhhYmxGWQje4st/SfO06l1Uz+/OKrXrPI4iHMAhHIMHl1CDW6hDAwgIeIIXeHWU8+y8Oe/z1oKTz+zDLzgf36+tj7E=</latexit>
(breaking
ties arbitrarily) <latexit sha1_base64="Y7++K9WD8+bHKDDnO2E28JxuhGw=">AAACH3icbVDLSgMxFM3UV62vUZdugkVaQcqMWBVBKLpxqWAf0JY2k6ZtaCYzJHfEMsyfuPFX3LhQRNz5N6YPQasHAueecy8393ih4Boc59NKzc0vLC6llzMrq2vrG/bmVkUHkaKsTAMRqJpHNBNcsjJwEKwWKkZ8T7CqN7gc+dU7pjQP5C0MQ9b0SU/yLqcEjNSyj+9asVtM8Nk5bgC7h5gG0nhJflK1ce4At8Ncso+/G9rYlC076xScMfBf4k5JFk1x3bI/Gp2ARj6TQAXRuu46ITRjooBTwZJMI9IsJHRAeqxuqCQ+0814fF+C94zSwd1AmScBj9WfEzHxtR76nun0CfT1rDcS//PqEXRPmzGXYQRM0smibiQwBHgUFu5wxSiIoSGEKm7+immfKELBRJoxIbizJ/8llcOCe1wo3hxlSxfTONJoB+2iPHLRCSqhK3SNyoiiB/SEXtCr9Wg9W2/W+6Q1ZU1nttEvWJ9fVbehTQ==</latexit>
<latexit sha1_base64="FcsoGfdiXUbCtS2heHfv2tDDoMA=">AAACB3icbVDJSgNBEO2JW4zbqEdBGoMQQcKMuCEIQS8eI5gFkjD0dDpJm56eobsmGIbcvPgrXjwo4tVf8Obf2FkOmvig4PFeFVX1/EhwDY7zbaXm5hcWl9LLmZXVtfUNe3OrrMNYUVaioQhV1SeaCS5ZCTgIVo0UI4EvWMXvXg/9So8pzUN5B/2INQLSlrzFKQEjefZuz5P44hLXgT1AQkNpjEGu5/FD3PPuDzw76+SdEfAscSckiyYoevZXvRnSOGASqCBa11wngkZCFHAq2CBTjzWLCO2SNqsZKknAdCMZ/THA+0Zp4laoTEnAI/X3REICrfuBbzoDAh097Q3F/7xaDK3zRsJlFAOTdLyoFQsMIR6GgptcMQqibwihiptbMe0QRSiY6DImBHf65VlSPsq7p/mT2+Ns4WoSRxrtoD2UQy46QwV0g4qohCh6RM/oFb1ZT9aL9W59jFtT1mRmG/2B9fkDnQqYhg==</latexit>
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="kiVMEoeDnjNUXHKvcbrow7t6lH4=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEmYns8mQ2dllpjcQlnyCFw+KePWLvPk3TpI9aLSgoajqprsrSKQw6LpfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWj25nfGnNtRKwecZJwP6IDJULBKFrpYdxTvXLFrbpzkL/Ey0kFctR75c9uP2ZpxBUySY3peG6CfkY1Cib5tNRNDU8oG9EB71iqaMSNn81PnZITq/RJGGtbCslc/TmR0ciYSRTYzoji0Cx7M/E/r5NieO1nQiUpcsUWi8JUEozJ7G/SF5ozlBNLKNPC3krYkGrK0KZTsiF4yy//Jc2zqndZvbg/r9Ru8jiKcATHcAoeXEEN7qAODWAwgCd4gVdHOs/Om/O+aC04+cwh/ILz8Q1oWI3l</latexit>
and add vn to V
<latexit sha1_base64="kiVMEoeDnjNUXHKvcbrow7t6lH4=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEmYns8mQ2dllpjcQlnyCFw+KePWLvPk3TpI9aLSgoajqprsrSKQw6LpfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWj25nfGnNtRKwecZJwP6IDJULBKFrpYdxTvXLFrbpzkL/Ey0kFctR75c9uP2ZpxBUySY3peG6CfkY1Cib5tNRNDU8oG9EB71iqaMSNn81PnZITq/RJGGtbCslc/TmR0ciYSRTYzoji0Cx7M/E/r5NieO1nQiUpcsUWi8JUEozJ7G/SF5ozlBNLKNPC3krYkGrK0KZTsiF4yy//Jc2zqndZvbg/r9Ru8jiKcATHcAoeXEEN7qAODWAwgCd4gVdHOs/Om/O+aC04+cwh/ILz8Q1oWI3l</latexit>
<latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit>
44 Lecture 3: Tokenization
Byte-pair encoding: example
D = { [‘i’] , [‘ ’, ‘h’, ‘ug’] , [‘ ’, ‘p’, ‘ug’, ‘s’] ,
<latexit sha1_base64="iX2xsNJ/MThCJfnr/fTrPxm80Ho=">AAADi3icnVJdb9MwFHUTPkph0MEjLxYVjIepSvjahECa+JB4HBLdJtVR57hOY9VxIvsGUUX5Mfwl3vg32GnE1mYgxJVydHLv8T2+1o0LKQwEwc+e51+7fuNm/9bg9p2du/eGu/dPTF5qxicsl7k+i6nhUig+AQGSnxWa0yyW/DRevnf1069cG5GrL7AqeJTRhRKJYBRsarbb+04yCimjsvpQ47eYVE+I5AlMCfBvUJ2LvZposUgh2scbBby3j89TB+Xir5qi1Vg0l4SYkMGmVXohbEA4UOvf/zBw/a9Qi46uI0mahmv3P+jEllG3SeaAOlg64P8wwu95L66HST2YDUfBOGgCd0nYkhFq43g2/EHmOSszroBJasw0DAqIKqpBMMnrASkNLyhb0gWfWqpoxk1UNbtU48c2M8dJru2nADfZyycqmhmzymKrdItjtmsueVVtWkJyGFVCFSVwxdZGSSkx5NgtJp4LzRnIlSWUaWHvillKNWVg19c9Qrg9cpecPBuHr8YvP78YHb1rn6OPHqJH6CkK0QE6Qp/QMZog5vW9sXfgHfo7/nP/tf9mLfV67ZkHaCP8j78Ar3YTjA==</latexit>
Required:
• Documents D
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
Algorithm:
• Pre-tokenize D by splitting into words (split before
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>
n := |V| + 1
<latexit sha1_base64="qHDdAiIl6F+PIAHnOC1uenCsXQI=">AAACAHicbVDLSsNAFL3xWesr6sKFm8EiCEJJxBeCUHTjsoJ9QBvKZDpph04mYWYilJiNv+LGhSJu/Qx3/o2TtgttPTBw5px7ufceP+ZMacf5tubmFxaXlgsrxdW19Y1Ne2u7rqJEElojEY9k08eKciZoTTPNaTOWFIc+pw1/cJP7jQcqFYvEvR7G1AtxT7CAEayN1LF3Bbq8Qo+oHWLdJ5in9cz8jtyOXXLKzgholrgTUoIJqh37q92NSBJSoQnHSrVcJ9ZeiqVmhNOs2E4UjTEZ4B5tGSpwSJWXjg7I0IFRuiiIpHlCo5H6uyPFoVLD0DeV+Z5q2svF/7xWooMLL2UiTjQVZDwoSDjSEcrTQF0mKdF8aAgmkpldEeljiYk2mRVNCO70ybOkflx2z8qndyelyvUkjgLswT4cggvnUIFbqEINCGTwDK/wZj1ZL9a79TEunbMmPTvwB9bnD1QnlPg=</latexit>
<latexit sha1_base64="FcsoGfdiXUbCtS2heHfv2tDDoMA=">AAACB3icbVDJSgNBEO2JW4zbqEdBGoMQQcKMuCEIQS8eI5gFkjD0dDpJm56eobsmGIbcvPgrXjwo4tVf8Obf2FkOmvig4PFeFVX1/EhwDY7zbaXm5hcWl9LLmZXVtfUNe3OrrMNYUVaioQhV1SeaCS5ZCTgIVo0UI4EvWMXvXg/9So8pzUN5B/2INQLSlrzFKQEjefZuz5P44hLXgT1AQkNpjEGu5/FD3PPuDzw76+SdEfAscSckiyYoevZXvRnSOGASqCBa11wngkZCFHAq2CBTjzWLCO2SNqsZKknAdCMZ/THA+0Zp4laoTEnAI/X3REICrfuBbzoDAh097Q3F/7xaDK3zRsJlFAOTdLyoFQsMIR6GgptcMQqibwihiptbMe0QRSiY6DImBHf65VlSPsq7p/mT2+Ns4WoSRxrtoD2UQy46QwV0g4qohCh6RM/oFb1ZT9aL9W59jFtT1mRmG/2B9fkDnQqYhg==</latexit>
• Let vn := concat(vi , vj )
• Change all instances in D of vi , vj to vn
<latexit sha1_base64="8qiJFR39nNAp8DHiEq3U32WlZhY=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgQcKu+DoGvXiMYB6SLMvsZJKMmZldZmYDYclXePGgiFc/x5t/42yyB40WNBRV3XR3hTFn2rjul1NYWl5ZXSuulzY2t7Z3yrt7TR0litAGiXik2iHWlDNJG4YZTtuxoliEnLbC0U3mt8ZUaRbJezOJqS/wQLI+I9hY6WEcsBM0Dh5LQbniVt0Z0F/i5aQCOepB+bPbi0giqDSEY607nhsbP8XKMMLptNRNNI0xGeEB7VgqsaDaT2cHT9GRVXqoHylb0qCZ+nMixULriQhtp8BmqBe9TPzP6ySmf+WnTMaJoZLMF/UTjkyEsu9RjylKDJ9Ygoli9lZEhlhhYmxGWQje4st/SfO06l1Uz+/OKrXrPI4iHMAhHIMHl1CDW6hDAwgIeIIXeHWU8+y8Oe/z1oKTz+zDLzgf36+tj7E=</latexit>
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="kiVMEoeDnjNUXHKvcbrow7t6lH4=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEmYns8mQ2dllpjcQlnyCFw+KePWLvPk3TpI9aLSgoajqprsrSKQw6LpfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWj25nfGnNtRKwecZJwP6IDJULBKFrpYdxTvXLFrbpzkL/Ey0kFctR75c9uP2ZpxBUySY3peG6CfkY1Cib5tNRNDU8oG9EB71iqaMSNn81PnZITq/RJGGtbCslc/TmR0ciYSRTYzoji0Cx7M/E/r5NieO1nQiUpcsUWi8JUEozJ7G/SF5ozlBNLKNPC3krYkGrK0KZTsiF4yy//Jc2zqndZvbg/r9Ru8jiKcATHcAoeXEEN7qAODWAwgCd4gVdHOs/Om/O+aC04+cwh/ILz8Q1oWI3l</latexit>
<latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit>
45 Lecture 3: Tokenization
Byte-pair encoding: example
Repeat until |V| = N …
<latexit sha1_base64="y68mdarm10+XMeklLrKbgNCl/2U=">AAAB+nicbVDLSsNAFL2pr1pfqS7dDBbBVUnE10YounElFewD2lAm00k7dDIJMxOlpP0UNy4UceuXuPNvnLRZaPXAwOGce7lnjh9zprTjfFmFpeWV1bXiemljc2t7xy7vNlWUSEIbJOKRbPtYUc4EbWimOW3HkuLQ57Tlj64zv/VApWKRuNfjmHohHggWMIK1kXp2edINsR4SzNPmdIIu0W3PrjhVZwb0l7g5qUCOes/+7PYjkoRUaMKxUh3XibWXYqkZ4XRa6iaKxpiM8IB2DBU4pMpLZ9Gn6NAofRRE0jyh0Uz9uZHiUKlx6JvJLKda9DLxP6+T6ODCS5mIE00FmR8KEo50hLIeUJ9JSjQfG4KJZCYrIkMsMdGmrZIpwV388l/SPK66Z9XTu5NK7Sqvowj7cABH4MI51OAG6tAAAo/wBC/wak2sZ+vNep+PFqx8Zw9+wfr4BrZek6Y=</latexit>
Required:
• Documents D
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>
Algorithm:
• Pre-tokenize D by splitting into words (split before
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>
n := |V| + 1
<latexit sha1_base64="qHDdAiIl6F+PIAHnOC1uenCsXQI=">AAACAHicbVDLSsNAFL3xWesr6sKFm8EiCEJJxBeCUHTjsoJ9QBvKZDpph04mYWYilJiNv+LGhSJu/Qx3/o2TtgttPTBw5px7ufceP+ZMacf5tubmFxaXlgsrxdW19Y1Ne2u7rqJEElojEY9k08eKciZoTTPNaTOWFIc+pw1/cJP7jQcqFYvEvR7G1AtxT7CAEayN1LF3Bbq8Qo+oHWLdJ5in9cz8jtyOXXLKzgholrgTUoIJqh37q92NSBJSoQnHSrVcJ9ZeiqVmhNOs2E4UjTEZ4B5tGSpwSJWXjg7I0IFRuiiIpHlCo5H6uyPFoVLD0DeV+Z5q2svF/7xWooMLL2UiTjQVZDwoSDjSEcrTQF0mKdF8aAgmkpldEeljiYk2mRVNCO70ybOkflx2z8qndyelyvUkjgLswT4cggvnUIFbqEINCGTwDK/wZj1ZL9a79TEunbMmPTvwB9bnD1QnlPg=</latexit>
• Let V = {‘ ’, ‘a’, ‘e’, ‘f’, ‘g’, ‘h’, ‘i’, ‘k’, ‘m’,‘n’, ‘p’, ‘s’, ‘u’,
<latexit sha1_base64="KOOrUjEbeZqzWAw9vn5Se+XTe/M=">AAACmHicbVFdb9MwFHXCgFG+OhAv24tFBeWhqpIJBhJCqmDT4G0g2k1qqtZxb1qrjhPZN4gqy2/iv/DGv5mdBm1sXMnHJ+feY99cx7kUBoPgj+ff2rp95+72vdb9Bw8fPW7vPBmZrNAchjyTmT6LmQEpFAxRoISzXANLYwmn8eqTy5/+AG1Epr7jOodJyhZKJIIztNK0/StKGS45k+Wooh9oVL6MEH5iOaPdHp0xB+AgcbBwsHQgHKwcpN3eTDmSOzAOim6volHUao6yQm2kdcXy78flvjHVh1CXrqKqV/vPrzR3brvbD1rTdifoB3XQmyRsSIc0cTJt/47mGS9SUMglM2YcBjlOSqZRcAlVKyoM5Iyv2ALGliqWgpmU9WAr+sIqc5pk2i6FtFavOkqWGrNOY1vpOjXXc078X25cYPJuUgqVFwiKby5KCkkxo+6V6Fxo4CjXljCuhe2V8iXTjKN9SzeE8Pov3ySj/X540H/z9XVn8LEZxzbZI8/JKxKSt2RAPpMTMiTce+a99w69I3/XH/jH/pdNqe81nqfkn/C/XQDlHL2y</latexit>
|V| = 20
• Let vn := concat(vi , vj )
• Change all instances in D of vi , vj to vn
<latexit sha1_base64="8qiJFR39nNAp8DHiEq3U32WlZhY=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgQcKu+DoGvXiMYB6SLMvsZJKMmZldZmYDYclXePGgiFc/x5t/42yyB40WNBRV3XR3hTFn2rjul1NYWl5ZXSuulzY2t7Z3yrt7TR0litAGiXik2iHWlDNJG4YZTtuxoliEnLbC0U3mt8ZUaRbJezOJqS/wQLI+I9hY6WEcsBM0Dh5LQbniVt0Z0F/i5aQCOepB+bPbi0giqDSEY607nhsbP8XKMMLptNRNNI0xGeEB7VgqsaDaT2cHT9GRVXqoHylb0qCZ+nMixULriQhtp8BmqBe9TPzP6ySmf+WnTMaJoZLMF/UTjkyEsu9RjylKDJ9Ygoli9lZEhlhhYmxGWQje4st/SfO06l1Uz+/OKrXrPI4iHMAhHIMHl1CDW6hDAwgIeIIXeHWU8+y8Oe/z1oKTz+zDLzgf36+tj7E=</latexit>
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="kiVMEoeDnjNUXHKvcbrow7t6lH4=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEmYns8mQ2dllpjcQlnyCFw+KePWLvPk3TpI9aLSgoajqprsrSKQw6LpfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWj25nfGnNtRKwecZJwP6IDJULBKFrpYdxTvXLFrbpzkL/Ey0kFctR75c9uP2ZpxBUySY3peG6CfkY1Cib5tNRNDU8oG9EB71iqaMSNn81PnZITq/RJGGtbCslc/TmR0ciYSRTYzoji0Cx7M/E/r5NieO1nQiUpcsUWi8JUEozJ7G/SF5ozlBNLKNPC3krYkGrK0KZTsiF4yy//Jc2zqndZvbg/r9Ru8jiKcATHcAoeXEEN7qAODWAwgCd4gVdHOs/Om/O+aC04+cwh/ILz8Q1oWI3l</latexit>
and add vn to V
<latexit sha1_base64="kiVMEoeDnjNUXHKvcbrow7t6lH4=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEmYns8mQ2dllpjcQlnyCFw+KePWLvPk3TpI9aLSgoajqprsrSKQw6LpfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWj25nfGnNtRKwecZJwP6IDJULBKFrpYdxTvXLFrbpzkL/Ey0kFctR75c9uP2ZpxBUySY3peG6CfkY1Cib5tNRNDU8oG9EB71iqaMSNn81PnZITq/RJGGtbCslc/TmR0ciYSRTYzoji0Cx7M/E/r5NieO1nQiUpcsUWi8JUEozJ7G/SF5ozlBNLKNPC3krYkGrK0KZTsiF4yy//Jc2zqndZvbg/r9Ru8jiKcATHcAoeXEEN7qAODWAwgCd4gVdHOs/Om/O+aC04+cwh/ILz8Q1oWI3l</latexit>
<latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit>
46 Lecture 3: Tokenization
Byte-pair encoding: example
CHANGES FROM START
D = { [‘i’] , [‘ hug’] , [‘ pugs’] ,
<latexit sha1_base64="AI3TasaEh9M2PWJuawZpx6qmP5A=">AAADWXicnVJLi9swEJbt7jZ1X9nusRfR0MdhCXbp61JY2h563EKzuxCZrKyMHRFZNtK4NBj/yR4KpX+lh1qJKZvdhkIH9PFpvnlJTFopaTGKfnh+cGNv/+bgVnj7zt1794cHD05tWRsBE1Gq0pyn3IKSGiYoUcF5ZYAXqYKzdPne6WdfwFhZ6s+4qiApeK5lJgXHzjU78EpWcFwIrpoPLX1LWfOEKchwyhC+YnMhn7bMyHyByRHdEuiizndqVZ3byyILt6u63CN6sQbpQG+u/1uP/ilkd9Zwauag1rti5D97FA64g6UD2D1x32lrJMracDYcReNobfQ6iXsyIr2dzIbf2LwUdQEaheLWTuOowqThBqVQ0IastlBxseQ5TDuqeQE2adab0dLHnWdOs9J0RyNdey9nNLywdlWkXaRbA3tVc86/adMaszdJI3VVI2ixaZTVimJJ3ZrRuTQgUK06woWR3axULLjhArtldJ8QX33ydXL6fBy/Gr/89GJ0/K7/jgF5SB6RZyQmr8kx+UhOyIQI77v3y9/z9/2fgRcMgnAT6nt9ziHZsuDwN+AXCnE=</latexit>
kept adding vocabulary until 8 : ‘k’, 9 : ‘m’, 10 : ‘n’, 11 : ‘p’, 12 : ‘s’, 13 : ‘u’,
you couldn’t anymore? 14 : ‘ug’, 15 : ‘ p’, 16 : ‘hug’, 17 : ‘ pug’, 18 : ‘ pugs’,
19 : ‘un’, 20 : ‘ hug’}
47 Lecture 3: Tokenization
Byte-pair encoding: tokenization/encoding
With this vocabulary, can you represent (or, tokenize/encode):
• “apple”?
• No, there is no ‘l’ in the vocabulary
• “huge”?
V = {1 : ‘ ’, 2 : ‘a’, 3 : ‘e’, 4 : ‘f’, 5 : ‘g’, 6 : ‘h’, 7 : ‘i’,
<latexit sha1_base64="lHm+oUzz/MCY2ZwNogXxdq6cAiY=">AAADTnicbZJNb9MwGMedjJdSXtaNIxeLisEBVUnZ1g4JaYILxyHRblJTFcd1WquOE/kFUUX5hFwQNz4GFw4gBLEbnG7jkSw9v78fP/7HeeKcUamC4Jvn79y4eet260777r37D3Y7e/tjmWmByQhnLBMXMZKEUU5GiipGLnJBUBozch6v3pj9849ESJrx92qdk2mKFpwmFCNVSbM9bx6lSC0xYsW4hK9gVByEL2GkyCdVfIBPy+ew7xAZfOGQGDx0mBg8crgweOxwaXDgkBqMovbB0EkrI504TA2GgWNuufGWW27MScuNO/3vhrCxqK2psDEJN122bNYlg62SWhpekqRr31jW1mO/8Qxtu6hszzrdoBfYgNeTsE66oI6zWedrNM+wTglXmCEpJ2GQq2mBhKKYkbIdaUlyhFdoQSZVylFK5LSw41DCJ5Uyh0kmqsUVtOr2iQKlUq7TuKo0/15e3TPi//YmWiXDaUF5rhXheHNRohlUGTSzBedUEKzYukoQFrTyCvESCYRVNYHmEcKrn3w9Gfd74XHv6N1h9/R1/Rwt8Ag8Bs9ACAbgFLwFZ2AEsPfZ++799H75X/wf/m//z6bU9+ozD8Gl2Gn9BePSAiA=</latexit>
• Yes - [16, 4]
8 : ‘k’, 9 : ‘m’, 10 : ‘n’, 11 : ‘p’, 12 : ‘s’, 13 : ‘u’,
• “ huge”?
14 : ‘ug’, 15 : ‘ p’, 16 : ‘hug’, 17 : ‘ pug’, 18 : ‘ pugs’,
• Yes - [18, 4]
19 : ‘un’, 20 : ‘ hug’}
• “ hugest”?
• No, there is no ’t’ in the vocabulary
• “unassumingness”?
• Yes - [19, 2, 12, 12, 13, 9, 7, 10, 5, 10, 3, 12, 12]
48 Lecture 3: Tokenization
Byte-pair encoding: tokenization/encoding
V = {1 : ‘ ’, 2 : ‘a’, 3 : ‘e’, 4 : ‘f’, 5 : ‘g’, 6 : ‘h’, 7 : ‘i’,
<latexit sha1_base64="lHm+oUzz/MCY2ZwNogXxdq6cAiY=">AAADTnicbZJNb9MwGMedjJdSXtaNIxeLisEBVUnZ1g4JaYILxyHRblJTFcd1WquOE/kFUUX5hFwQNz4GFw4gBLEbnG7jkSw9v78fP/7HeeKcUamC4Jvn79y4eet260777r37D3Y7e/tjmWmByQhnLBMXMZKEUU5GiipGLnJBUBozch6v3pj9849ESJrx92qdk2mKFpwmFCNVSbM9bx6lSC0xYsW4hK9gVByEL2GkyCdVfIBPy+ew7xAZfOGQGDx0mBg8crgweOxwaXDgkBqMovbB0EkrI504TA2GgWNuufGWW27MScuNO/3vhrCxqK2psDEJN122bNYlg62SWhpekqRr31jW1mO/8Qxtu6hszzrdoBfYgNeTsE66oI6zWedrNM+wTglXmCEpJ2GQq2mBhKKYkbIdaUlyhFdoQSZVylFK5LSw41DCJ5Uyh0kmqsUVtOr2iQKlUq7TuKo0/15e3TPi//YmWiXDaUF5rhXheHNRohlUGTSzBedUEKzYukoQFrTyCvESCYRVNYHmEcKrn3w9Gfd74XHv6N1h9/R1/Rwt8Ag8Bs9ACAbgFLwFZ2AEsPfZ++799H75X/wf/m//z6bU9+ozD8Gl2Gn9BePSAiA=</latexit>
49 Lecture 3: Tokenization
Byte-pair encoding: tokenization/encoding
V = {1 : ‘ ’, 2 : ‘a’, 3 : ‘e’, 4 : ‘f’, 5 : ‘g’, 6 : ‘h’, 7 : ‘i’,
<latexit sha1_base64="lHm+oUzz/MCY2ZwNogXxdq6cAiY=">AAADTnicbZJNb9MwGMedjJdSXtaNIxeLisEBVUnZ1g4JaYILxyHRblJTFcd1WquOE/kFUUX5hFwQNz4GFw4gBLEbnG7jkSw9v78fP/7HeeKcUamC4Jvn79y4eet260777r37D3Y7e/tjmWmByQhnLBMXMZKEUU5GiipGLnJBUBozch6v3pj9849ESJrx92qdk2mKFpwmFCNVSbM9bx6lSC0xYsW4hK9gVByEL2GkyCdVfIBPy+ew7xAZfOGQGDx0mBg8crgweOxwaXDgkBqMovbB0EkrI504TA2GgWNuufGWW27MScuNO/3vhrCxqK2psDEJN122bNYlg62SWhpekqRr31jW1mO/8Qxtu6hszzrdoBfYgNeTsE66oI6zWedrNM+wTglXmCEpJ2GQq2mBhKKYkbIdaUlyhFdoQSZVylFK5LSw41DCJ5Uyh0kmqsUVtOr2iQKlUq7TuKo0/15e3TPi//YmWiXDaUF5rhXheHNRohlUGTSzBedUEKzYukoQFrTyCvESCYRVNYHmEcKrn3w9Gfd74XHv6N1h9/R1/Rwt8Ag8Bs9ACAbgFLwFZ2AEsPfZ++799H75X/wf/m//z6bU9+ozD8Gl2Gn9BePSAiA=</latexit>
<latexit sha1_base64="NNhp4HiCDIt6HWe1guuCXXVMPqg=">AAAB6HicbVDLSgNBEOz1GeMr6tHLYBA8hV3xdQx68ZiAeUCyhNlJbzJmdnaZmRXCki/w4kERr36SN//GSbIHTSxoKKq66e4KEsG1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6m/qtJ1Sax/LBjBP0IzqQPOSMGivVda9UdivuDGSZeDkpQ45ar/TV7ccsjVAaJqjWHc9NjJ9RZTgTOCl2U40JZSM6wI6lkkao/Wx26IScWqVPwljZkobM1N8TGY20HkeB7YyoGepFbyr+53VSE974GZdJalCy+aIwFcTEZPo16XOFzIixJZQpbm8lbEgVZcZmU7QheIsvL5PmecW7qlzWL8rV2zyOAhzDCZyBB9dQhXuoQQMYIDzDK7w5j86L8+58zFtXnHzmCP7A+fwB4eWNAQ==</latexit>
50 Lecture 3: Tokenization
Byte-pair encoding: tokenization/encoding
V = {1 : ‘ ’, 2 : ‘a’, 3 : ‘e’, 4 : ‘f’, 5 : ‘g’, 6 : ‘h’, 7 : ‘i’,
<latexit sha1_base64="lHm+oUzz/MCY2ZwNogXxdq6cAiY=">AAADTnicbZJNb9MwGMedjJdSXtaNIxeLisEBVUnZ1g4JaYILxyHRblJTFcd1WquOE/kFUUX5hFwQNz4GFw4gBLEbnG7jkSw9v78fP/7HeeKcUamC4Jvn79y4eet260777r37D3Y7e/tjmWmByQhnLBMXMZKEUU5GiipGLnJBUBozch6v3pj9849ESJrx92qdk2mKFpwmFCNVSbM9bx6lSC0xYsW4hK9gVByEL2GkyCdVfIBPy+ew7xAZfOGQGDx0mBg8crgweOxwaXDgkBqMovbB0EkrI504TA2GgWNuufGWW27MScuNO/3vhrCxqK2psDEJN122bNYlg62SWhpekqRr31jW1mO/8Qxtu6hszzrdoBfYgNeTsE66oI6zWedrNM+wTglXmCEpJ2GQq2mBhKKYkbIdaUlyhFdoQSZVylFK5LSw41DCJ5Uyh0kmqsUVtOr2iQKlUq7TuKo0/15e3TPi//YmWiXDaUF5rhXheHNRohlUGTSzBedUEKzYukoQFrTyCvESCYRVNYHmEcKrn3w9Gfd74XHv6N1h9/R1/Rwt8Ag8Bs9ACAbgFLwFZ2AEsPfZ++799H75X/wf/m//z6bU9+ozD8Gl2Gn9BePSAiA=</latexit>
<latexit sha1_base64="NNhp4HiCDIt6HWe1guuCXXVMPqg=">AAAB6HicbVDLSgNBEOz1GeMr6tHLYBA8hV3xdQx68ZiAeUCyhNlJbzJmdnaZmRXCki/w4kERr36SN//GSbIHTSxoKKq66e4KEsG1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6m/qtJ1Sax/LBjBP0IzqQPOSMGivVda9UdivuDGSZeDkpQ45ar/TV7ccsjVAaJqjWHc9NjJ9RZTgTOCl2U40JZSM6wI6lkkao/Wx26IScWqVPwljZkobM1N8TGY20HkeB7YyoGepFbyr+53VSE974GZdJalCy+aIwFcTEZPo16XOFzIixJZQpbm8lbEgVZcZmU7QheIsvL5PmecW7qlzWL8rV2zyOAhzDCZyBB9dQhXuoQQMYIDzDK7w5j86L8+58zFtXnHzmCP7A+fwB4eWNAQ==</latexit>
= “misshapenness”
52 Lecture 3: Tokenization
Byte-pair encoding: properties
• Ef cient to run (greedy vs. global optimization)
• Lossless compression
• Potentially some shared representations - e.g., the token “hug” could
be used both in “hug” and “hugging”
53 Lecture 3: Tokenization
fi
Weird properties of tokenizers
• Token != word
• Spaces are part of token
• “run” is a different token than “ run”
• Not invariant to case changes
• “Run” is a different token than “run”
54 Lecture 3: Tokenization
Weird properties of tokenizers
• Token != word
• Spaces are part of token
• “run” is a different token than “ run”
• Not invariant to case changes
• “Run” is a different token than “run”
• Tokenization ts statistics of your data
• e.g., while these words are multiple tokens…
• These words are all 1 token in GPT-3’s tokenizer!
• Why?
• Reddit usernames and certain code attributes appeared
enough in the corpus to surface as its own token!
Example from https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation
55 Lecture 3: Tokenization
fi
Other tokenization variants
56 Lecture 3: Tokenization
Variants: no spaces in tokens
• The way we presented BPE, we included whitespace with the following word. (E.g., “ pug”)
• This is most common in modern LMs space
• However, in another BPE variant, you instead strip whitespace (e.g., “pug”) and add spaces
between words at decoding time no space
• This was the original BPE paper’s implementation!
• Example:
• [“I”, “hug”, “pugs”] -> “I hug pugs” (w/out whitespace)
• [“I”, “ hug”, “ pugs”] -> “I hug pugs” (w/ whitespace)
Original (w/ whitespace) Updated (w/out whitespace)
Required: Required:
• Documents D • Documents D
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit> <latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
• Desired vocabulary size N (greater than chars in D) • Desired vocabulary size N (greater than chars in D)
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit> <latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
Algorithm: Algorithm:
- Pre-tokenize D by splitting into words (split + Pre-tokenize D by splitting into words
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
Variants: no spaces in tokens
• For sub-word tokens, need to add “continue word” special character
• E.g., for the word “Tokenization”, if the subword tokens are “Token”
and “ization”,
• W/out special character: [“Token”, “ization”] -> “Token ization”
• W/ special character #: [“Token”, “#ization”] -> Tokenization”
• When decoding, if does not have special character add a space
• Example:
• [“I”, “li”, “#ke”, “to”, “hug”, “pug”, “#s”] -> “I like to hug pugs”
58 Lecture 3: Tokenization
Variants: no spaces in tokens
• Loses some whitespace information (lossy compression!)
• E.g., Tokenize(“I eat cake.”) == Tokenize(“ I eat cake .”)
• Especially problematic for code (e.g., Python) - why?
(Example using
GPT’s tokenizer,
which does not
include spaces in
the token)
59 Lecture 3: Tokenization
Variants: no pre-tokenization
• In the variant we proposed, we start by splitting into words
• This guarantees that each token will be no longer than one word
• However, this does not work so well for character-based languages.
Why?
60 Lecture 3: Tokenization
Variants: no pre-tokenization
• Instead, we could not pre-tokenize, and treat the entire document or
sentence as a single list of tokens
• Allows for tokens to span multiple words/characters
• Sometimes called SentencePiece tokenization* (Kudo, 2018)
* (not to be confused with the
SentencePiece library, which Paper: https://arxiv.org/abs/1808.06226
is an implementation of many Library: https://github.com/google/sentencepiece
kinds of tokenization)
• Documents D
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>
<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>
Algorithm:
Algorithm:
- Pre-tokenize D by splitting into words
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
+ Do not pre-tokenize D
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
61
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
Lecture 3: Tokenization
Variants: no pre-tokenization
• Allows sequences of words/characters to become tokens
62 Lecture 3: Tokenization
fl
Variants: byte-based
• Originally, we presented BPE as dealing with characters as the smallest unit
• However, there are many characters - especially if you want to support:
• character-based languages (e.g., Chinese has >100k characters!)
• non-alphanumeric characters like emojis (Unicode 15 has ~150k
characters!) *Only 256 bytes!
• Instead, can initialize tokens as set of bytes! (e.g., with UTF-8*) Each Unicode
Original (w/ characters) Modi ed (w/ bytes) char is 1-4 bytes
Required: Required:
• Documents D • Documents D
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit> <latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
• Desired vocabulary size N (greater than chars in D) • Desired vocabulary size N (greater than chars in D)
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit> <latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
Algorithm: Algorithm:
• Pre-tokenize D by splitting into words (split before • Pre-tokenize D by splitting into words (split before
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit> <latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
whitespace/punctuation) whitespace/punctuation)
- Initialize V as the set of characters in D + Initialize V as the set of bytes in D
<latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit> <latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit>
- Convert D into a list of tokens (characters) + Convert D into a list of tokens (bytes)
<latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit> <latexit sha1_base64="Mp+fyy1nsKBbBTnrEZouWnT6YhE=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjPia1nUhcsK9gHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2llo64HA4Zx7ybknTAQ34LrfTmlldW19o7xZ2dre2d2r7h+0jUo1ZS2qhNLdkBgmuGQt4CBYN9GMxKFgnXB8m/udJ6YNV/IRJgkLYjKUPOKUgJX8XkxgRInI7qb9as2tuzPgZeIVpIYKNPvVr95A0TRmEqggxviem0CQEQ2cCjat9FLDEkLHZMh8SyWJmQmyWeQpPrHKAEdK2ycBz9TfGxmJjZnEoZ3MI5pFLxf/8/wUousg4zJJgUk6/yhKBQaF8/vxgGtGQUwsIVRzmxXTEdGEgm2pYkvwFk9eJu2zundZv3g4rzVuijrK6Agdo1PkoSvUQPeoiVqIIoWe0St6c8B5cd6dj/loySl2DtEfOJ8/eB2RZA==</latexit>
64 Lecture 3: Tokenization
Variants: WordPiece objective
• To merge, we selected the bigram with highest <latexit sha1_base64="mqKJtv+0MGObOc3GHM5/8l5Wm/Y=">AAAB8nicbVDLSsNAFJ3UV62vqks3g0WoICURX8uiG5cV7APSECbTSTt2MgkzN4US+hluXCji1q9x5984bbPQ1gMXDufcy733BIngGmz72yqsrK6tbxQ3S1vbO7t75f2Dlo5TRVmTxiJWnYBoJrhkTeAgWCdRjESBYO1geDf12yOmNI/lI4wT5kWkL3nIKQEjuUl15PMzPPKfTv1yxa7ZM+Bl4uSkgnI0/PJXtxfTNGISqCBau46dgJcRBZwKNil1U80SQoekz1xDJYmY9rLZyRN8YpQeDmNlSgKeqb8nMhJpPY4C0xkRGOhFbyr+57kphDdexmWSApN0vihMBYYYT//HPa4YBTE2hFDFza2YDogiFExKJROCs/jyMmmd15yr2uXDRaV+m8dRREfoGFWRg65RHd2jBmoiimL0jF7RmwXWi/VufcxbC1Y+c4j+wPr8ARTZkHw=</latexit>
frequency p(vi , vj )
• This is the same as bigram with highest probability!
• Instead, we could choose the bigram which would Modi ed (Word Piece)
maximize the likelihood of the data after the
merge is made (also called WordPiece!)
…
+ For the bigram that would
Original (BPE)
maximize likelihood of the training
… data once the change is made v i , v j
<latexit sha1_base64="8qiJFR39nNAp8DHiEq3U32WlZhY=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgQcKu+DoGvXiMYB6SLMvsZJKMmZldZmYDYclXePGgiFc/x5t/42yyB40WNBRV3XR3hTFn2rjul1NYWl5ZXSuulzY2t7Z3yrt7TR0litAGiXik2iHWlDNJG4YZTtuxoliEnLbC0U3mt8ZUaRbJezOJqS/wQLI+I9hY6WEcsBM0Dh5LQbniVt0Z0F/i5aQCOepB+bPbi0giqDSEY607nhsbP8XKMMLptNRNNI0xGeEB7VgqsaDaT2cHT9GRVXqoHylb0qCZ+nMixULriQhtp8BmqBe9TPzP6ySmf+WnTMaJoZLMF/UTjkyEsu9RjylKDJ9Ygoli9lZEhlhhYmxGWQje4st/SfO06l1Uz+/OKrXrPI4iHMAhHIMHl1CDW6hDAwgIeIIXeHWU8+y8Oe/z1oKTz+zDLzgf36+tj7E=</latexit>
<latexit sha1_base64="mqKJtv+0MGObOc3GHM5/8l5Wm/Y=">AAAB8nicbVDLSsNAFJ3UV62vqks3g0WoICURX8uiG5cV7APSECbTSTt2MgkzN4US+hluXCji1q9x5984bbPQ1gMXDufcy733BIngGmz72yqsrK6tbxQ3S1vbO7t75f2Dlo5TRVmTxiJWnYBoJrhkTeAgWCdRjESBYO1geDf12yOmNI/lI4wT5kWkL3nIKQEjuUl15PMzPPKfTv1yxa7ZM+Bl4uSkgnI0/PJXtxfTNGISqCBau46dgJcRBZwKNil1U80SQoekz1xDJYmY9rLZyRN8YpQeDmNlSgKeqb8nMhJpPY4C0xkRGOhFbyr+57kphDdexmWSApN0vihMBYYYT//HPa4YBTE2hFDFza2YDogiFExKJROCs/jyMmmd15yr2uXDRaV+m8dRREfoGFWRg65RHd2jBmoiimL0jF7RmwXWi/VufcxbC1Y+c4j+wPr8ARTZkHw=</latexit>
p(vi , vj )
maximizes - p(v i , v j ) ) p(vi )p(vj ) )
65 Lecture 3: Tokenization
fi
Variants: WordPiece objective
•
<latexit sha1_base64="mqKJtv+0MGObOc3GHM5/8l5Wm/Y=">AAAB8nicbVDLSsNAFJ3UV62vqks3g0WoICURX8uiG5cV7APSECbTSTt2MgkzN4US+hluXCji1q9x5984bbPQ1gMXDufcy733BIngGmz72yqsrK6tbxQ3S1vbO7t75f2Dlo5TRVmTxiJWnYBoJrhkTeAgWCdRjESBYO1geDf12yOmNI/lI4wT5kWkL3nIKQEjuUl15PMzPPKfTv1yxa7ZM+Bl4uSkgnI0/PJXtxfTNGISqCBau46dgJcRBZwKNil1U80SQoekz1xDJYmY9rLZyRN8YpQeDmNlSgKeqb8nMhJpPY4C0xkRGOhFbyr+57kphDdexmWSApN0vihMBYYYT//HPa4YBTE2hFDFza2YDogiFExKJROCs/jyMmmd15yr2uXDRaV+m8dRREfoGFWRg65RHd2jBmoiimL0jF7RmwXWi/VufcxbC1Y+c4j+wPr8ARTZkHw=</latexit>
p(vi , vj )
data after the merge is made p(vi )p(vj )
• Maximizes the probability of the bigram, normalized by the
probability of the unigrams
66 Lecture 3: Tokenization
Variants: WordPiece encoding
At inference time, instead of applying the merge rules in order, tokens are
selected left-to-right greedily:
Encoding algorithm
Given string s and (unordered) vocab V ,
<latexit sha1_base64="HbazfJBF5dM5E53nV3qGREA3sbg=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiK+lkU3LivYB6ShTKbTduhkJszcCCX0M9y4UMStX+POv3HSZqGtBwYO59zLnHuiRHCDnvftrKyurW9slrbK2zu7e/uVg8OWUammrEmVULoTEcMEl6yJHAXrJJqROBKsHY3vcr/9xLThSj7iJGFhTIaSDzglaKWgGxMcUSKy1rRXqXo1bwZ3mfgFqUKBRq/y1e0rmsZMIhXEmMD3EgwzopFTwablbmpYQuiYDFlgqSQxM2E2izx1T63SdwdK2yfRnam/NzISGzOJIzuZRzSLXi7+5wUpDm7CjMskRSbp/KNBKlxUbn6/2+eaURQTSwjV3GZ16YhoQtG2VLYl+IsnL5PWec2/ql0+XFTrt0UdJTiGEzgDH66hDvfQgCZQUPAMr/DmoPPivDsf89EVp9g5gj9wPn8Ak3eRdg==</latexit>
<latexit sha1_base64="NNhp4HiCDIt6HWe1guuCXXVMPqg=">AAAB6HicbVDLSgNBEOz1GeMr6tHLYBA8hV3xdQx68ZiAeUCyhNlJbzJmdnaZmRXCki/w4kERr36SN//GSbIHTSxoKKq66e4KEsG1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6m/qtJ1Sax/LBjBP0IzqQPOSMGivVda9UdivuDGSZeDkpQ45ar/TV7ccsjVAaJqjWHc9NjJ9RZTgTOCl2U40JZSM6wI6lkkao/Wx26IScWqVPwljZkobM1N8TGY20HkeB7YyoGepFbyr+53VSE974GZdJalCy+aIwFcTEZPo16XOFzIixJZQpbm8lbEgVZcZmU7QheIsvL5PmecW7qlzWL8rV2zyOAhzDCZyBB9dQhXuoQQMYIDzDK7w5j86L8+58zFtXnHzmCP7A+fwB4eWNAQ==</latexit>
<latexit sha1_base64="XHQTIRHnlBQPtzHv4o3Sgmq43IM=">AAAB/HicbVDLSsNAFJ3UV62vaJduBovgqiTiC0EounFZoS9IQplMJ+3QyYOZGyGE+ituXCji1g9x5984bbPQ1gMXDufcy733+IngCizr2yitrK6tb5Q3K1vbO7t75v5BR8WppKxNYxHLnk8UEzxibeAgWC+RjIS+YF1/fDf1u49MKh5HLcgS5oVkGPGAUwJa6pvVFr6+wa5gATjYlXw4Aq9v1qy6NQNeJnZBaqhAs29+uYOYpiGLgAqilGNbCXg5kcCpYJOKmyqWEDomQ+ZoGpGQKS+fHT/Bx1oZ4CCWuiLAM/X3RE5CpbLQ150hgZFa9Kbif56TQnDl5TxKUmARnS8KUoEhxtMk8IBLRkFkmhAqub4V0xGRhILOq6JDsBdfXiad07p9UT9/OKs1bos4yugQHaETZKNL1ED3qInaiKIMPaNX9GY8GS/Gu/Exby0ZxUwV/YHx+QMorJPX</latexit>
<latexit sha1_base64="NNhp4HiCDIt6HWe1guuCXXVMPqg=">AAAB6HicbVDLSgNBEOz1GeMr6tHLYBA8hV3xdQx68ZiAeUCyhNlJbzJmdnaZmRXCki/w4kERr36SN//GSbIHTSxoKKq66e4KEsG1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6m/qtJ1Sax/LBjBP0IzqQPOSMGivVda9UdivuDGSZeDkpQ45ar/TV7ccsjVAaJqjWHc9NjJ9RZTgTOCl2U40JZSM6wI6lkkao/Wx26IScWqVPwljZkobM1N8TGY20HkeB7YyoGepFbyr+53VSE974GZdJalCy+aIwFcTEZPo16XOFzIixJZQpbm8lbEgVZcZmU7QheIsvL5PmecW7qlzWL8rV2zyOAhzDCZyBB9dQhXuoQQMYIDzDK7w5j86L8+58zFtXnHzmCP7A+fwB4eWNAQ==</latexit>
• Let T := T + [ti ]
<latexit sha1_base64="6ZBvzTpUEX6Fl15RpumG92iFTo8=">AAACBHicbVDLSsNAFJ3UV62vqMtuBosgCCURXwhC0Y3LCn1BEsJkOmmHTh7M3AildOHGX3HjQhG3foQ7/8Zpm4W2HrhwOOde7r0nSAVXYFnfRmFpeWV1rbhe2tjc2t4xd/daKskkZU2aiER2AqKY4DFrAgfBOqlkJAoEaweD24nffmBS8SRuwDBlXkR6MQ85JaAl3yw38NU1buBj7AoWgoPB59iVvNcHzzcrVtWaAi8SOycVlKPum19uN6FZxGKggijl2FYK3ohI4FSwccnNFEsJHZAeczSNScSUN5o+McaHWuniMJG6YsBT9ffEiERKDaNAd0YE+mrem4j/eU4G4aU34nGaAYvpbFGYCQwJniSCu1wyCmKoCaGS61sx7RNJKOjcSjoEe/7lRdI6qdrn1bP700rtJo+jiMroAB0hG12gGrpDddREFD2iZ/SK3own48V4Nz5mrQUjn9lHf2B8/gCwgJZC</latexit>
<latexit sha1_base64="NNhp4HiCDIt6HWe1guuCXXVMPqg=">AAAB6HicbVDLSgNBEOz1GeMr6tHLYBA8hV3xdQx68ZiAeUCyhNlJbzJmdnaZmRXCki/w4kERr36SN//GSbIHTSxoKKq66e4KEsG1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6m/qtJ1Sax/LBjBP0IzqQPOSMGivVda9UdivuDGSZeDkpQ45ar/TV7ccsjVAaJqjWHc9NjJ9RZTgTOCl2U40JZSM6wI6lkkao/Wx26IScWqVPwljZkobM1N8TGY20HkeB7YyoGepFbyr+53VSE974GZdJalCy+aIwFcTEZPo16XOFzIixJZQpbm8lbEgVZcZmU7QheIsvL5PmecW7qlzWL8rV2zyOAhzDCZyBB9dQhXuoQQMYIDzDK7w5j86L8+58zFtXnHzmCP7A+fwB4eWNAQ==</latexit>
• Return T
<latexit sha1_base64="EqeA+mbLkNRXGR722P9vRZdlcXk=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8cE8oJkCbOT3mTM7OwyMyuEkC/w4kERr36SN//GSbIHTSxoKKq66e4KEsG1cd1vJ7e2vrG5ld8u7Ozu7R8UD4+aOk4VwwaLRazaAdUouMSG4UZgO1FIo0BgKxjdz/zWEyrNY1k34wT9iA4kDzmjxkq1eq9YcsvuHGSVeBkpQYZqr/jV7ccsjVAaJqjWHc9NjD+hynAmcFrophoTykZ0gB1LJY1Q+5P5oVNyZpU+CWNlSxoyV39PTGik9TgKbGdEzVAvezPxP6+TmvDWn3CZpAYlWywKU0FMTGZfkz5XyIwYW0KZ4vZWwoZUUWZsNgUbgrf88ippXpS96/JV7bJUucviyMMJnMI5eHADFXiAKjSAAcIzvMKb8+i8OO/Ox6I152Qzx/AHzucPsumM4g==</latexit>
67 Lecture 3: Tokenization
Variants: unigram objective
• BPE starts with a small vocabulary (characters) and builds up until the
desired vocabulary size N
<latexit sha1_base64="Myr3Kcmiy/fK9XQuulchFknRygM=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQiydJwDwgWcLspJOMmZ1dZmaFsOQLvHhQxKuf5M2/cZLsQRMLGoqqbrq7glhwbVz328mtrK6tb+Q3C1vbO7t7xf2Dho4SxbDOIhGpVkA1Ci6xbrgR2IoV0jAQ2AxGt1O/+YRK80g+mHGMfkgHkvc5o8ZKtftuseSW3RnIMvEyUoIM1W7xq9OLWBKiNExQrdueGxs/pcpwJnBS6CQaY8pGdIBtSyUNUfvp7NAJObFKj/QjZUsaMlN/T6Q01HocBrYzpGaoF72p+J/XTkz/2k+5jBODks0X9RNBTESmX5MeV8iMGFtCmeL2VsKGVFFmbDYFG4K3+PIyaZyVvcvyRe28VLnJ4sjDERzDKXhwBRW4gyrUgQHCM7zCm/PovDjvzse8NedkM4fwB87nD6nRjNw=</latexit>
68 Lecture 3: Tokenization
Examples of LLMs and their tokenizers
SentencePiece
Model/Tokenizer
(treat whitespace
Objective
like char)BPE
Spaces part of token?
(w/spaces)
Pre-tokenization Smallest unit
GPT-2/3/4, ChatGPT,
BPE Yes Yes Byte-level
Llama(2), Falcon, …
No. “SentencePiece” -
Jurassic BPE Yes treat whitespace like Byte-level
char
Bert, DistilBert,
Electra WordPiece No Yes Character-level
No. “SentencePiece” -
T5, ALBERT, XLNet,
Unigram Yes treat whitespace like Character-level
Marian
char*
*For non-English languages
69 Lecture 3: Tokenization
Next lecture: Recurrent neural networks