0% found this document useful (0 votes)
3 views

Midterm 1

The document is a Unix tutorial that covers basic commands for file management, text processing, and data manipulation. It also discusses statistical laws of text, text preprocessing techniques, and the use of Python dictionaries for file handling. Additionally, it explains the concept of tf-idf weighting and how to represent documents as vectors for similarity ranking.

Uploaded by

do24l
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Midterm 1

The document is a Unix tutorial that covers basic commands for file management, text processing, and data manipulation. It also discusses statistical laws of text, text preprocessing techniques, and the use of Python dictionaries for file handling. Additionally, it explains the concept of tf-idf weighting and how to represent documents as vectors for similarity ranking.

Uploaded by

do24l
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Unix Tutorial

ls -F Display extra stuff to signify the type of the


item (/ for dirs for instance)

ls *.txt List all .txt files

touch file.txt Create an empty file


Remember when naming the file, avoid:
●​ Spaces
●​ / : reserved for pathnames
●​ Begin with - : reserved for flags

pwd (not cwd) Return current working dir (absolute path)

unzip data.zip Unzip a compressed file

cp source target Copy source into a target


●​ Target can be name of new file, or a
dir

mv source target Rename/move source to target (so you end


up with 1 file only)

rm -i file Delete but prompt your confirmation before


deletion

clear Clear the current terminal window

wc test.txt Return the number of lines, words, and bytes


contained in input file(s)

more test.txt

less test.txt ●​ Search by doing /


●​ Type n to forward to the next
occurrence of a keyword
●​ Type N to backward to the previous
occurrence of a keyword

| (pipe) Send the output of one command to the input


of another

Filter commands: Transform input


●​ more
●​ less
●​ grep
●​ sort
●​ uniq
sort file.txt Sort the .txt

uniq file.txt Filter out repeated lines in a file


●​ uniq -c file.txt ●​ Returns the number of occurrences of
a line in the text

grep pattern filename.txt Find the lines that contain the given
●​ man sort | grep file |wc pattern
●​ Get the manual page of sort,
displaying lines containing the word
file, and display the line count/word
count

head -3 filename.txt View the first 3 lines of filename

tail -3 filename.txt Default: view the last 10

sed ‘s/pattern/replacement/flags’ ●​ s: sed replacement command


○​ d: for delete
●​ pattern is the string to replace
●​ replacement is the string to replace it
with
●​ flags are optional flags to use.

●​ sed ‘s/scarlet/*PURPLE*/g’ < Replace all instances of scarlet with


scarletLetter.txt | grep PURPLE | head *PURPLE* in the file scarletLetter.txt file.
Then search for lines containing PURPLE
and view the first 10 lines

●​ sed ‘s/ /\’$’\n/g’ < Replace all whitespaces with newline in the
scarletLetter.txt|sort|sed ‘/^$/d’|uniq scarletLetter.txt file. Sort the words
-c|sort -nr |head -10 alphabetically, then delete all empty lines and
display the frequencies for each word. Sort
the frequencies numerically and view the top
10.

Laws of Text

There are certain statistical characteristics of text that are consistent → we can apply laws to
them
→ These laws apply to different languages/text domains

●​ Frequency of each word:


○​ When we graph frequency vs rank, we obtain a hard exponential decay
○​ Zip’s Law: Frequency of a word = [A constant] * [Total word count] / [Rank of
word]
■​ Not accurate at tails
●​ How fast does vocab grow with the size of the corpus/text?
○​ Heap’s Law: At each index, record number of unique words up till that point vs
number of words
■​ Accurate for large N

Text Processing

●​ IR systems do the following to the information needed/document:


○​ Convert them into some representation (like metadata)
■​ Non-context: bag of words
■​ word2vec: ML algorithm to see the probability of seeing a word next to
another
○​ Now parsing both the document and information needed in the same manner
■​ Document → Tokenization → Text preprocessing → Indexed objects
■​ Information needed → Query
○​ Comparison
○​ Retrieved objects

●​ Token:
○​ Instance of a word occurring in a doc. Example: Bush, bush, friends
○​ Tokenizing: figuring out how to divide up the documents into fragments
■​ Each fragment is an indexing unit
■​ Can be challenging in English and other languages (like whether to
●​ lowercase everything
●​ separate famous names
●​ Or hyphens
○​ Term: entry in dictionary (unique tokens)
●​ Text preprocessing:
○​ Remove punctuations
○​ Stopping
■​ Exclude commonest words from dictionary
■​ But is becoming less necessary as query optimization and compression
techniques advance
○​ Stemming
■​ Algorithmic (example, Porter): produce many false positives (detect
relationship when there is none: I and is)
■​ Dictionary-based: check words against a large dictionary of relations →
static (especially since language evolves)
■​ Combined: look up → if not found, strip suffix → look up again
●​ Lower false positive
●​ Higher false negative
○​ Lowercase
○​ Normalization: find a many to one mapping from tokens to terms
■​ More matched documents → Less precision
■​ Smaller dictionary → Less specificity (cannot differentiate between
plurals and non-plurals)
■​ Exact quotes
○​ Challenges:
■​ Language issues:
●​ Chinese doesn’t split words by spaces
●​ German nouns are not segmented
●​ Japanese has multiple alphabets
●​ Arabic is written from right to left, numbers left to right

●​ Inverted index: map a word to the documents it appears


○​ Term: entry in dictionary (normalized). Example: bush, friend
○​ Type: class of all tokens containing the same character sequence (Example: “to
be or not to be” has 6 tokens, 4 terms)
Python Dictionary

●​ keys(): returns keys as dict_keys view object → have to cast to list to use
●​ values(): returns values as dict_values view object
●​ items(): returns a list of k, v pairs as view object
●​ a.update(b): update content of dict a with b
●​ copy(): create a copy of the dict

3 ways to read the file: Assume f = open(‘my_file.txt’, ‘r’)


●​ f.read(): return the entire content as a single string
●​ f.readline(): return the next line
○​ Use line != ‘’ as a condition for the loop
●​ f.readlines(): return a list of lines
○​ Not recommended for large files since main memory may not have enough space
for it
○​ Each line has \n appended to it
●​ Just do for line in f → this is calling readline() under the hood

To write to a file: Assume f = open(‘my_file.txt’, ‘w’)


●​ f.write(‘some string\nNew line’)

String methods:
●​ strip(): remove the newline character from the end of the line
●​ rstrip(): strip off newline + white space
●​ puncTranslator = str.maketrans('', '', string.punctuation): Create translation table for
removing punctuation
●​ lowercaseTranslator = str.maketrans(string.ascii_uppercase,
string.ascii_lowercase): Create a translation table to replace upper- with lower-case

Tf-idf weighting

●​ Tf: the frequency of the word in a document


●​ Idf: the frequency of a word in the whole corpus

The weight is the product of the 2

●​ Increase with
○​ tf
○​ The rarity of the word in the collection

●​ Highest when term occurs frequently in a small number of collections


●​ Lower when term occurs less frequently or in many collections
●​ Lowest when term occurs in virtually all of the documents

Documents as vectors
●​ Think of terms as axes
●​ Documents are vectors or points in that multidimensional space
○​ Query as well

⇒ Use cosine of the angle between the document and query to rank the similarity
●​ Then return top K to user

You might also like