notes
notes
Types of NLP:
NLP can be broadly classified into Classical NLP and Deep Learning-based NLP.
1. Classical NLP
Classical NLP follows a rule-based approach, relying on linguistic rules and statistical
methods. It involves several steps:
Language Detection – Identifying the language of the input text.
Preprocessing – Tokenization, stemming, and stop-word removal.
Feature Extraction – Identifying key words and patterns.
Modeling – Using statistical models to process text.
Output Generation – Performing tasks like sentiment analysis, classification,
translation, and topic modeling.
For example, Google Translate initially used classical NLP methods for translation by
analyzing dictionaries and grammatical rules.
Working of NLP:
NLP primarily works through three main components:
1. Natural Language Processing (NLP) – It involves computers reading and analyzing
human language. Tasks include parsing, tokenization, and syntactic analysis.
2. Natural Language Generation (NLG) – It enables machines to generate human-like
text. This is used in applications like automated content creation, chatbots, and report
generation.
3. Natural Language Understanding (NLU) – It allows machines to comprehend the
meaning of the text, considering context, semantics, and intent. Examples include
virtual assistants like Siri and Alexa, which interpret voice commands.
.....................................................................................................…………….
Language is often ambiguous, meaning a single word or sentence can have multiple meanings
based on context. Ambiguity is a major challenge in NLU. There are different types of
ambiguity:
a) Lexical Ambiguity
Lexical ambiguity arises when a word has multiple meanings depending on the
context.
Examples:
b) Syntactic Ambiguity
When the structure of a sentence makes its meaning unclear. Syntactic ambiguity
occurs when a sentence has multiple possible structures or interpretations.
Example:
She saw the man with a telescope. (Did she use a telescope to see the man,
or did the man have a telescope?)
Old men and women were taken to the hospital." (Are both old men and old
women taken, or just old men?)
Visiting relatives can be boring." (Is the act of visiting relatives boring, or
are the relatives who visit boring?)
c) Semantic Ambiguity
When the meaning of a sentence is unclear due to how words are interpreted.
Semantic ambiguity arises when a sentence has multiple interpretations due to word
meanings.
Example:
I saw her duck. (Did you see her lower her head, or did you see her pet
duck?)
He gave her cat food. (Did he give food to her cat, or did he give her some
cat food?)
She cannot bear children. (Does she find children intolerable, or is she
unable to have children?)
d) Pragmatic Ambiguity
When the context changes the meaning of a sentence. Pragmatic ambiguity arises
when the meaning of a sentence depends on the context or intention.
Example:
Can you pass the salt? (Is it a literal question about ability, or a request to
pass the salt?)
The students will present the project next week. (Are the students showing
their project, or are they offering it as a gift?)
We need to discuss your salary. (Is this a positive or negative statement? It
could mean an increase or a potential issue.)
2. Phonology
The study of sounds in a language and how they are used. Every language has specific sounds
that form words. Phonology focuses on pronunciation, intonation, and speech patterns.
Example: The word "read" can be pronounced differently in "I read a book" (past) and "I will
read a book" (future).
4. Morphology
The study of word formation and structure. It looks at how prefixes, suffixes, and root words
change meaning.
Example: "Happy" → "Unhappy" (Adding "un-" changes the meaning to the opposite.)
5. Syntax
The study of sentence structure and word order. Every language has rules for arranging words.
Example: "He eats an apple" is correct, but "Eats he apple an" is incorrect.
6. Semantics
The study of word meanings and how they combine in sentences. It focuses on the literal
meaning of words and sentences.
Example:
"She kicked the bucket" literally means someone kicked a bucket, but it can also mean "she
passed away" (idiomatically).
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
Lecture 2
Corpus and text normalization
Corpus:
A corpus is a large collection of texts, often designed to represent a specific population or
language. For example, the Brown Corpus is a well-known collection of written American
English texts. Corpora are used in Natural Language Processing (NLP) to study language
patterns, frequencies, and structures.
It allows researchers and developers to analyse language usage patterns.
A well-structured corpus ensures that NLP models are trained on balanced data.
It helps in quickly accessing word counts, phrase structures, and linguistic patterns.
......................................................................................
Rules:
o A token is typically a string of contiguous alphanumeric characters with space
on either side.
o Tokens may include hyphens (-) and apostrophes ('), but no other punctuation
marks.
o Example: The sentence "I love NLP!" would be tokenized into ["I", "love",
"NLP", "!"].
Challenges in Tokenization:
o Typographical Hyphens: These are used to improve the right margins of a
document. They typically occur at syllable boundaries and should be removed
during normalization.
o Lexical Hyphens: These are part of the word itself (e.g., "co-operate," "so-
called," "pro-university"). They should be retained during tokenization.
o Word Grouping: Some hyphenated phrases act as single units (e.g., "take-it-
or-leave-it," "once-in-a-lifetime"). These should be treated as single tokens.
o Example: "data-base" could be tokenized as ["data", "base"] or ["data-base"],
depending on the context.
o Contractions: Words like "I'm" or "isn't" need to be handled carefully. They
can be split into ["I", "am"] or ["is", "not"].
o Abbreviations: Words like "etc." or "Calif." can cause issues, especially when
they appear at the end of a sentence. Treat abbreviations as single tokens (e.g.,
"etc." → ["etc."]). Be cautious with abbreviations that end with a period, as the
period might also serve as a sentence boundary (e.g., "He lives in Calif.").
o Homographs: Words that are spelled the same but have different meanings
(e.g., "saw" as a tool vs. "saw" as the past tense of "see") can cause ambiguity.
Use context to determine the correct meaning of homographs.
o No Space Between Words: In some languages like Chinese, words are not
separated by spaces, making tokenization more challenging. Use language-
specific tokenization methods for languages without spaces between words.
Rules:
o Lowercasing: Convert all letters to lowercase to ensure consistency. However,
this can sometimes lead to ambiguity (e.g., "Richard Brown" vs. "brown paint").
o Sentence Start: One heuristic is to lowercase the letters at the beginning of a
sentence, except for proper nouns.
o Example: "The Quick Brown Fox" would be normalized to "the quick brown
fox."
3. Sentence Segmentation
Sentence segmentation is the process of dividing text into individual sentences. Sentences are
the basic units of meaning in text, and segmenting them correctly is crucial for further
processing.
Rules:
o Sentences typically end with a period (.), question mark (?), or exclamation
mark (!).
o However, there are exceptions, such as abbreviations (e.g., "Prof." or "Dr.") that
might appear in the middle of a sentence.
o Haplology: Haplology is a linguistic phenomenon where a word or phrase is
shortened by omitting one or more syllables or sounds that are repeated or
similar. In the context of text normalization and sentence segmentation,
haplology specifically refers to cases where a single punctuation mark (like a
period) serves multiple purposes, such as marking both an abbreviation and the
end of a sentence.
Heuristic Algorithm for Sentence Boundary Detection:
1. Place putative sentence boundaries after all occurrences of ., ?, or !.
2. Move the boundary after following quotation marks, if any.
3. Disqualify a period boundary if:
It is preceded by a known abbreviation (e.g., Prof. or vs.) that is
commonly followed by a capitalized proper name (e.g., Prof. Kifor).
It is preceded by a known abbreviation (e.g., etc.) and followed by
lowercase word (like does).
4. Disqualify a boundary with a ? or ! if it is followed by a lowercase letter or a
known name.
5. Regard other putative sentence boundaries as sentence boundaries.
Example:
Ali lives in Calif. He is student of Prof. Kifor and he is interested in India vs. Pakistan cricket
match. He eats apple, orange, mango etc. Does he eat kiwi too? is a question he is often asked
about. Aaah! what a fruit is kiwi. I simply love it!
Solution:
Place putative sentence boundaries after every ., ?, or !. The text is initially split into
the following segments:
1. Ali lives in Calif.
2. He is student of Prof.
3. Kifor and he is interested in India vs.
4. Pakistan cricket match.
5. He eats apple, orange, mango etc.
6. Does he eat kiwi too?
7. is a question he is often asked about.
8. Aaah!
9. what a fruit is kiwi.
10. I simply love it!