The document discusses different types of text data and methods for processing text data for machine learning models. It covers bag of words representation, removing stopwords, stemming and lemmatization, and topic modeling. The final section will provide a summary.
The document discusses different types of text data and methods for processing text data for machine learning models. It covers bag of words representation, removing stopwords, stemming and lemmatization, and topic modeling. The final section will provide a summary.
Machine Learning Road map ◼ Overview ◼ Types of Data Represented as Strings ◼ Bag of Words ◼ Stopwords ◼ Stemming & Lemmatization ◼ Topic Modeling ◼ Summary
Chapter 5: Working with Text Data 2
Overview ◼ If we want to classify an email message as either a legitimate email or spam, the content of the email will certainly contain important information for this classification task. ◼ Or maybe we want to learn about the opinion of a politician on the topic of immigration. ◼ Individual’s speeches or tweets might provide useful information. ◼ In customer service, we often want to find out if a message is a complaint or an inquiry. Chapter 5: Working with Text Data 3 Cont.. ◼ We can use the subject line and content of a message to automatically determine the customer’s intent, which allows us to send the message to the appropriate department, or even send a fully automatic reply. ◼ Text data is usually represented as strings, made up of characters. In any of the examples just given, the length of the text data will vary. ◼ This feature is clearly very different from the numeric features that we’ve discussed so far, and we will need to process the data before we can apply our machine learning algorithms to it. Chapter 5: Working with Text Data 4 Road map ◼ Overview ◼ Types of Data Represented as Strings ◼ Bag of Words ◼ Stopwords ◼ Stemming & Lemmatization ◼ Topic Modeling ◼ Summary
Chapter 5: Working with Text Data 5
Types of Data Represented as Strings There are four kinds of string data you might see: ◼ Categorical data
◼ Free strings that can be semantically
mapped to categories ◼ Structured string data
◼ Text data
Chapter 5: Working with Text Data 6
Categorical Data ◼ Categorical data is data that comes from a fixed list. Say you collect data via a survey where you ask people their favorite color, with a drop-down menu that allows them to select from “red,” “green,” “blue,” “yellow,” “black,” “white,” “purple,” and “pink.” ◼ This will result in a dataset with exactly eight different possible values, which clearly encode a categorical variable. Chapter 5: Working with Text Data 7 Free strings ◼ Now imagine instead of providing a drop-down menu, you provide a text field for the users to provide their own favorite colors. ◼ Many people might respond with a color name like “black” or “blue.” Others might make typographical errors, use different spellings like “gray” and “grey,” or use more evocative and specific names like “mid‐ night blue.” ◼ You will also have some very strange entries, which are hard to map to colors automatically (or at all). The responses you can obtain from a text field belong to the second category in the list, free strings that can be semantically mapped to categories.
Chapter 5: Working with Text Data 8
Structured string data ◼ Often, manually entered values do not correspond to fixed categories, but still have some underlying structure, like addresses, names of places or people, dates, telephone numbers, or other identifiers. ◼ These kinds of strings are often very hard to parse, and their treatment is highly dependent on context and domain. ◼ A systematic treatment of these cases is beyond the scope of this Course.
Chapter 5: Working with Text Data 9
Text Data ◼ The final category of string data is freeform text data that consists of phrases or sentences. ◼ Examples include tweets, chat logs, and hotel reviews, as well as the collected works of Shakespeare, the content of Wikipedia, or the Project Gutenberg collection of 50,000 e-books. ◼ All of these collections contain information mostly as sentences composed of words. ◼ For simplicity’s sake, let’s assume all our documents are in one language, English. ◼ In the context of text analysis, the dataset is often called the corpus, and each data point, represented as a single text, is called a document. ◼ These terms come from the information retrieval (IR) and natural language processing (NLP) community, which both deal mostly in text data.
Chapter 5: Working with Text Data 10
Road map ◼ Overview ◼ Types of Data Represented as Strings ◼ Bag of Words ◼ Stopwords ◼ Stemming & Lemmatization ◼ Topic Modeling ◼ Summary
Chapter 5: Working with Text Data 11
Bag of Words ◼ One of the most simple but effective and commonly used ways to represent text for machine learning is using the bag-of-words representation. ◼ When using this representation, we discard most of the structure of the input text, like chapters, paragraphs, sentences, and formatting, and only count how often each word appears in each text in the corpus. ◼ Discarding the structure and counting only word occurrences leads to the mental image of representing text as a “bag. Chapter 5: Working with Text Data 12 Cont.. One of the Computing the bag-of-words representation for a corpus of documents consists of the following three steps: ◼ Tokenization. Split each document into the words that appear in it (called tokens), for example by splitting them on whitespace and punctuation. ◼ Vocabulary building. Collect a vocabulary of all words that appear in any of the documents, and number them (say, in alphabetical order). ◼ Encoding. For each document, count how often each of the words in the vocabulary appear in this document. Chapter 5: Working with Text Data 13 Road map ◼ Overview ◼ Types of Data Represented as Strings ◼ Bag of Words ◼ Stopwords ◼ Stemming & Lemmatization ◼ Topic Modeling ◼ Summary
Chapter 5: Working with Text Data 14
Stopwords ◼ Another way that we can get rid of uninformative words is by discarding words that are too frequent to be informative. ◼ There are two main approaches: using a language specific list of stopwords, or discarding words that appear too frequently. ◼ scikit-learn has a built-in list of English stopwords in the feature_extraction.text module:
Chapter 5: Working with Text Data 15
Where to find a stop word list? There are established stop word lists that you could easily plug in and use. Some of the stop word lists come out of NLP research work, and some are just manually curated by different people. Here are a few for you to try in different languages: ◼ English stop words (https://github.com/igorbrigadir/stopwords/blob/master/en/terrier.txt ) ◼ Russian stop words (https://github.com/stopwords-iso/stopwords-ru/blob/master/stopwords-ru.txt) ◼ French stop words (https://github.com/stopwords-iso/stopwords-fr/blob/master/stopwords-fr.txt) ◼ Spanish stop words (https://github.com/stopwords-iso/stopwords-es/blob/master/stopwords-es.txt) ◼ German stop words (https://github.com/stopwords-iso/stopwords-de/blob/master/stopwords-de.txt)
Chapter 5: Working with Text Data 16
Road map ◼ Overview ◼ Types of Data Represented as Strings ◼ Bag of Words ◼ Stopwords ◼ Stemming & Lemmatization ◼ Topic Modeling ◼ Summary
Chapter 5: Working with Text Data 17
Stemming and Lemmatization ◼ As we know, the vocabulary includes words like "replace", "replaced", "replacement", "replaces", and "replacing", which are different verb forms and a noun relating to the verb “to replace.” ◼ Similarly to having singular and plural forms of a noun, treating different verb forms and related words as distinct tokens is disadvantageous for building a model that generalizes well. ◼ For the purposes of a bag-of-words model, the semantics of “replace" and “replaces" are so close that distinguishing them will only increase overfitting, and not allow the model to fully exploit the training data.
Chapter 5: Working with Text Data 18
Cont.. ◼ This problem can be overcome by representing each word using its word stem, which involves identifying all the words that have the same word stem. If this is done by using a rule- based heuristic, like dropping common suffixes, it is usually referred to as stemming. ◼ If instead a dictionary of known word forms is used (an explicit and human-verified system), and the role of the word in the sentence is taken into account, the process is referred to as lemmatization and the standardized form of the word is referred to as the lemma. ◼ Both processing methods, lemmatization and stemming, are forms of normalization that try to extract some normal form of a word. Another interesting case of normalization is spelling correction, which can be helpful in practice but is outside of the scope of this Course. Chapter 5: Working with Text Data 19 Cont.. ◼ To get a better understanding of normalization, let’s compare a method for stemming; the Porter stemmer, a widely used collection of heuristics (here imported from the NLTK package) to lemmatization as implemented in the spacy package: Input:
Chapter 5: Working with Text Data 20
Cont.. ◼ We will compare lemmatization and the Porter stemmer on a sentence designed to show some of the differences: Input:
Output:
◼ Stemming is always restricted to trimming the word to a stem,
so "was" becomes "wa", while lemmatization can retrieve the correct base verb form, "be". Similarly, lemmatization can normalize "worse" to "bad", while stemming produces "wors".
Chapter 5: Working with Text Data 21
Cont.. ◼ Another major difference is that stemming reduces both occurrences of "meeting" to "meet". ◼ Using lemmatization, the first occurrence of "meeting" is recognized as a noun and left as is, while the second occurrence is recognized as a verb and reduced to "meet". ◼ In general, lemmatization is a much more involved process than stemming, but it usually produces better results than stemming when used for normalizing tokens for machine learning.
Chapter 5: Working with Text Data 22
Road map ◼ Overview ◼ Types of Data Represented as Strings ◼ Bag of Words ◼ Stopwords ◼ Stemming & Lemmatization ◼ Topic Modeling ◼ Summary
Chapter 5: Working with Text Data 23
Topic Modeling and Document Clustering ◼ One particular technique that is often applied to text data is topic modeling, which is an umbrella term describing the task of assigning each document to one or multiple topics, usually without supervision. ◼ A good example for this is news data, which might be categorized into topics like “politics,” “sports,” “finance,” and so on. ◼ If each document is assigned a single topic, this is the task of clustering that we discussed in Chapter III. Chapter 5: Working with Text Data 24 Cont.. ◼ Each of the components we learn then corresponds to one topic, and the coefficients of the components in the representation of a document tell us how strongly related that document is to a particular topic. ◼ Often, when people talk about topic modeling, they refer to one particular decomposition method called Latent Dirichlet Allocation (often LDA for short) Chapter 5: Working with Text Data 25 Latent Dirichlet Allocation (LDA) ◼ Intuitively, the LDA model tries to find groups of words (the topics) that appear together frequently. LDA also requires that each document can be understood as a “mixture” of a subset of the topics. ◼ Going back to the example of news articles, we might have a collection of articles about sports, politics, and finance, written by two specific authors. ◼ In a politics article, we might expect to see words like “governor,” “vote,” “party,” etc., while in a sports article we might expect words like “team,” “score,” and “season.” Words in each of these groups will likely appear together, while it’s less likely that, for example, “team” and “governor” will appear together. Chapter 5: Working with Text Data 26 Road map ◼ Overview ◼ Types of Data Represented as Strings ◼ Bag of Words ◼ Stopwords ◼ Stemming & Lemmatization ◼ Topic Modeling ◼ Summary
Chapter 5: Working with Text Data 27
Summary ◼ In this chapter we talked about the basics of processing text, also known as natural language processing (NLP), with an example application classifying movie reviews. ◼ The tools discussed here should serve as a great starting point when trying to process text data. ◼ In particular for text classification tasks such as spam and fraud detection or sentiment analysis, bag-of-words representations provide a simple and powerful solution.
Chapter 5: Working with Text Data 28
Cont.. ◼ As is often the case in machine learning, the representation of the data is key in NLP applications, and inspecting the tokens and n- grams that are extracted can give powerful insights into the modeling process. ◼ In text-processing applications, it is often possible to introspect models in a meaningful way, as we saw in this chapter, for both supervised and unsupervised tasks. ◼ You should take full advantage of this ability when using NLP-based methods in practice. Chapter 5: Working with Text Data 29 Thank You!