Introduction To Regular Expressions: Katharine Jarmul
Introduction To Regular Expressions: Katharine Jarmul
Introduction to regular
expressions
Katharine Jarmul
Founder, kjamistan
DataCamp Introduction to Natural Language Processing in Python
In [1]: import re
\d digit 9
DataCamp Introduction to Natural Language Processing in Python
\d digit 9
\s space ''
DataCamp Introduction to Natural Language Processing in Python
\d digit 9
\s space ''
.* wildcard 'username74'
DataCamp Introduction to Natural Language Processing in Python
\d digit 9
\s space ''
.* wildcard 'username74'
\d digit 9
\s space ''
.* wildcard 'username74'
\d digit 9
\s space ''
.* wildcard 'username74'
Python's re Module
re module
Let's practice!
DataCamp Introduction to Natural Language Processing in Python
Introduction to
tokenization
Katharine Jarmul
Founder, kjamistan
DataCamp Introduction to Natural Language Processing in Python
What is tokenization?
Turning a string or document into tokens (smaller chunks)
One step in preparing a text for NLP
Many different theories and rules
You can create your own rules using regular expressions
Some examples:
Breaking out words or sentences
Separating punctuation
Separating all hashtags in a tweet
DataCamp Introduction to Natural Language Processing in Python
nltk library
nltk: natural language toolkit
In [1]: from nltk.tokenize import word_tokenize
Why tokenize?
Easier to map part of speech
Matching common words
Removing unwanted tokens
"I don't like Sam's shoes."
"I", "do", "n't", "like", "Sam", "'s", "shoes", "."
DataCamp Introduction to Natural Language Processing in Python
Let's practice!
DataCamp Introduction to Natural Language Processing in Python
Advanced tokenization
with regex
Katharine Jarmul
Founder, kjamistan
DataCamp Introduction to Natural Language Processing in Python
In [2]: my_str = 'match lowercase spaces nums like 12, but no commas'
Let's practice!
DataCamp Introduction to Natural Language Processing in Python
Katharine Jarmul
Founder, kjamistan
DataCamp Introduction to Natural Language Processing in Python
In [3]: plt.show()
DataCamp Introduction to Natural Language Processing in Python
Generated Histogram
DataCamp Introduction to Natural Language Processing in Python
In [5]: plt.hist(word_lengths)
Out[5]: (array([ 2., 0., 1., 0., 0., 0., 3., 0., 0., 1.]),
array([ 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. , 5.5,
6. ]),
<a list of 10 Patch objects>)
In [6]: plt.show()
DataCamp Introduction to Natural Language Processing in Python
Let's practice!