0% found this document useful (0 votes)
2 views

Natural Language Processing 1

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Natural Language Processing 1

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Module 1: NLP

Aarti Dharmani
Main approaches in NLP
1. Rule-based Approach:
• This approach relies on predefined linguistic rules and patterns to
process text.
• Linguistic experts and programmers manually create rules that
encode knowledge about language.
• These rules are used to perform tasks such as tokenization, part-of-
speech tagging, parsing, and information extraction.
• Rule-based systems are based on explicit, handcrafted rules and are
effective for domains with well-defined rules.
2. Statistical Approach:
• Statistical NLP, also known as data-driven or machine learning based
NLP, utilizes statistical models and algorithms to learn patterns and
structures from large amounts of annotated text data.
• These models use probabilistic techniques to make predictions about
linguistic phenomena based on observed patterns in the training
data.
• Statistical NLP techniques include machine translation, named entity
recognition, sentiment analysis, and text classification.
• Statistical models require large amounts of annotated training data
and can automatically extract relevant features from the data.
3. Neural Network Approach:
• Neural networks have revolutionized NLP in recent years.
• Deep learning models, such as recurrent neural networks (RNNs) and
transformers, have shown remarkable performance in various NLP
tasks.
• These models can learn hierarchical representations of text data and
capture complex linguistic patterns.
• They excel in tasks such as language modeling, machine translation,
sentiment analysis, and question answering.
• Neural network approaches require substantial computational
resources and large amounts of annotated data for training.
4. Pre-trained Language Models:
• Pre-trained language models, such as BERT (Bidirectional Encoder
Representations from Transformers) and GPT (Generative Pre-trained
Transformer), have gained popularity in recent years.
• These models are trained on large-scale text data and capture rich
linguistic representations.
• They can be fine-tuned for specific NLP tasks, requiring less task
specific data for training and achieving state-of-the-art performance
in various tasks.
Function v/s Content words
“He’s interested in taking Economics class”

• interested, taking, and economics all receive stress on their strong


syllables, while he’s and in do not.
• That’s because content words (e.g., words that carry the most
meaning when we speak, such as nouns, verbs, adjectives, and
adverbs) typically receive stress in phrases, while function words (e.g.,
words that have very little meaning, such as prepositions, articles,
pronouns, and auxiliary verbs) do not.
It takes almost same time to speak all these statements as the content
words remain the same!!
Tokenization
• Before understanding Tokenization, we need to understand
Sentence Segmentation
• How to find sentences in a document/paragraph?
• Challenges involved
• Binary approach
Tokenization
• Tokenization is breaking the raw text
into small chunks.
• It’s the process of breaking a stream
of textual data into words, terms,
sentences, symbols, or some other
meaningful elements called tokens.
• The tokenization helps in
interpreting the meaning of the text
by analyzing the sequence of the
words.
TTR(Type Token Ratio)
• Token: Occurence of word
• Type: Unique words

Eg: ‘I have bottles of cold drinks as well as a bottle


of cold red wine’

Tokens: 15, Type: 12, Type Token Ratio= 12/15= 0.8


80% unique words in the given sentence
indicates how often,on average, a new ‘word form’
appears in the text
Rate highest to lowest TTR
• Conversation?
• News?
• Academic content?
• Fiction?

“The value varies with the size of the text!!”


Issues with tokenization
• Dots
Eg: Mr. , Dr. , etc. , U.S.A. , 53.4, 19.0760° N, 72.8777° E
• Hyphens
Eg: End of Line (long sentences go to next line), lexical (co-curricular),
Sententially determined (hand-delivered,case-based)
• Spacing
Eg: San Francisco, 你好吗 (how are you),
Zipf’s 1st Law
• Count the frequency of each word type in a large corpus
• List the word types in decreasing order of their frequency
• A relationship between the frequency of a word (f) and itsposition in
the list (its rank r)
1
𝑓∝
𝑟
or there is a constant, say k, such that: f.r=k
i.e. the 50th most common word should occur with 3 times the frequency of the 150th
most common word.
Zipf’s 2nd Law
• The number of meanings m that a word has obeys the simple law,
that m is proportion to the square of roots it is frequency,f
1
𝑚∝ 𝑓 , i.e., 𝑚 ∝
𝑟
• which means that as we have words with higher and higher
frequencies, the number of the different senses or the meanings
with which they can be used also increases.
≈50% words are stop words

≈50% includes the words that occur rarely


(meaningful statistical analysis is difficult)

There is another law by Zipf that correlates the length of the word; that means, the
number of characters the word has with the frequency of the word and it says the
frequencies of the word is inversely proportional to its length (English)
Heap’s Law
• This law can be described like as the number of words in a document
increases, the rate of the count of distinct words available in the
document slows down.
• It relates the volabulary size with number of tokens (document),
which can be evaluated with the following formula:
𝑉 = 𝐾. 𝑁𝛽
where V: Vocabulary size, K:positive constant (value between 10-100), N: number
of tokens, ᵝ: ranges between 0.4-0.6

You might also like