Midterm 1
Midterm 1
more test.txt
grep pattern filename.txt Find the lines that contain the given
● man sort | grep file |wc pattern
● Get the manual page of sort,
displaying lines containing the word
file, and display the line count/word
count
● sed ‘s/ /\’$’\n/g’ < Replace all whitespaces with newline in the
scarletLetter.txt|sort|sed ‘/^$/d’|uniq scarletLetter.txt file. Sort the words
-c|sort -nr |head -10 alphabetically, then delete all empty lines and
display the frequencies for each word. Sort
the frequencies numerically and view the top
10.
Laws of Text
There are certain statistical characteristics of text that are consistent → we can apply laws to
them
→ These laws apply to different languages/text domains
Text Processing
● Token:
○ Instance of a word occurring in a doc. Example: Bush, bush, friends
○ Tokenizing: figuring out how to divide up the documents into fragments
■ Each fragment is an indexing unit
■ Can be challenging in English and other languages (like whether to
● lowercase everything
● separate famous names
● Or hyphens
○ Term: entry in dictionary (unique tokens)
● Text preprocessing:
○ Remove punctuations
○ Stopping
■ Exclude commonest words from dictionary
■ But is becoming less necessary as query optimization and compression
techniques advance
○ Stemming
■ Algorithmic (example, Porter): produce many false positives (detect
relationship when there is none: I and is)
■ Dictionary-based: check words against a large dictionary of relations →
static (especially since language evolves)
■ Combined: look up → if not found, strip suffix → look up again
● Lower false positive
● Higher false negative
○ Lowercase
○ Normalization: find a many to one mapping from tokens to terms
■ More matched documents → Less precision
■ Smaller dictionary → Less specificity (cannot differentiate between
plurals and non-plurals)
■ Exact quotes
○ Challenges:
■ Language issues:
● Chinese doesn’t split words by spaces
● German nouns are not segmented
● Japanese has multiple alphabets
● Arabic is written from right to left, numbers left to right
● keys(): returns keys as dict_keys view object → have to cast to list to use
● values(): returns values as dict_values view object
● items(): returns a list of k, v pairs as view object
● a.update(b): update content of dict a with b
● copy(): create a copy of the dict
String methods:
● strip(): remove the newline character from the end of the line
● rstrip(): strip off newline + white space
● puncTranslator = str.maketrans('', '', string.punctuation): Create translation table for
removing punctuation
● lowercaseTranslator = str.maketrans(string.ascii_uppercase,
string.ascii_lowercase): Create a translation table to replace upper- with lower-case
Tf-idf weighting
● Increase with
○ tf
○ The rarity of the word in the collection
Documents as vectors
● Think of terms as axes
● Documents are vectors or points in that multidimensional space
○ Query as well
⇒ Use cosine of the angle between the document and query to rank the similarity
● Then return top K to user