0% found this document useful (0 votes)
2 views

notes

Natural Language Processing (NLP) is a field that focuses on the interaction between computers and human languages, encompassing both classical and deep learning approaches. Key components of NLP include Natural Language Understanding (NLU), Natural Language Generation (NLG), and various challenges such as ambiguity in language. Text normalization is essential for processing raw text data, involving steps like tokenization, sentence segmentation, and normalization of word formats.

Uploaded by

Tooba Liaquat
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

notes

Natural Language Processing (NLP) is a field that focuses on the interaction between computers and human languages, encompassing both classical and deep learning approaches. Key components of NLP include Natural Language Understanding (NLU), Natural Language Generation (NLG), and various challenges such as ambiguity in language. Text normalization is essential for processing raw text data, involving steps like tokenization, sentence segmentation, and normalization of word formats.

Uploaded by

Tooba Liaquat
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Lecture 1

Natural language processing


NLP:
Natural language processing (NLP) is a subfield of linguistics, computer science, information
engineering, and artificial intelligence concerned with the interactions between computers and
human (natural) languages, in particular how to program computers to process and analyze
large amounts of natural language data.

• Natural language processing (NLP) is a subfield of artificial intelligence and


computational linguistics. It studies the problems of automated generation and
understanding of natural human languages.
• The process of computer analysis of input provided in a human language (natural
language), and conversion of this input into a useful form of representation.
.....................................................................................................…………….

Types of NLP:
NLP can be broadly classified into Classical NLP and Deep Learning-based NLP.

1. Classical NLP
Classical NLP follows a rule-based approach, relying on linguistic rules and statistical
methods. It involves several steps:
 Language Detection – Identifying the language of the input text.
 Preprocessing – Tokenization, stemming, and stop-word removal.
 Feature Extraction – Identifying key words and patterns.
 Modeling – Using statistical models to process text.
 Output Generation – Performing tasks like sentiment analysis, classification,
translation, and topic modeling.
For example, Google Translate initially used classical NLP methods for translation by
analyzing dictionaries and grammatical rules.

2. Deep Learning-based NLP


Deep Learning has revolutionized NLP by enabling machines to learn language patterns
through large datasets. Instead of relying on pre-defined rules, deep learning models:
 Use Dense Embeddings – Transform text into numerical representations.
 Employ Neural Networks – Process text using hidden layers and complex
connections.
 Generate Outputs – Perform sentiment analysis, classification, translation, and other
tasks.
For example, GPT models (like ChatGPT) use deep learning to generate human-like
responses by predicting the next word based on previous text.
.....................................................................................................…………….

Working of NLP:
NLP primarily works through three main components:
1. Natural Language Processing (NLP) – It involves computers reading and analyzing
human language. Tasks include parsing, tokenization, and syntactic analysis.
2. Natural Language Generation (NLG) – It enables machines to generate human-like
text. This is used in applications like automated content creation, chatbots, and report
generation.
3. Natural Language Understanding (NLU) – It allows machines to comprehend the
meaning of the text, considering context, semantics, and intent. Examples include
virtual assistants like Siri and Alexa, which interpret voice commands.
.....................................................................................................…………….

Natural Language Understanding (NLU)


Natural Language Understanding (NLU) is a subfield of Natural Language Processing (NLP)
that focuses on enabling machines to comprehend, interpret, and analyze human language in a
meaningful way. It allows computers to process and extract useful information from spoken or
written language.
1. Ambiguity in Language

Language is often ambiguous, meaning a single word or sentence can have multiple meanings
based on context. Ambiguity is a major challenge in NLU. There are different types of
ambiguity:
a) Lexical Ambiguity
Lexical ambiguity arises when a word has multiple meanings depending on the
context.
Examples:

 She went to the bank to withdraw money. (Bank as a financial institution)


vs. He sat on the bank of the river. (Bank as the edge of a river)
 She saw a bat flying at night. (Bat as an animal) vs. He brought his bat to
the cricket match. (Bat as a sports equipment)
 He will lead the team to victory. (Lead as a verb meaning to guide) vs. The
pipe is made of lead. (Lead as a metal).

b) Syntactic Ambiguity
When the structure of a sentence makes its meaning unclear. Syntactic ambiguity
occurs when a sentence has multiple possible structures or interpretations.
Example:
 She saw the man with a telescope. (Did she use a telescope to see the man,
or did the man have a telescope?)
 Old men and women were taken to the hospital." (Are both old men and old
women taken, or just old men?)
 Visiting relatives can be boring." (Is the act of visiting relatives boring, or
are the relatives who visit boring?)

c) Semantic Ambiguity
When the meaning of a sentence is unclear due to how words are interpreted.
Semantic ambiguity arises when a sentence has multiple interpretations due to word
meanings.
Example:
 I saw her duck. (Did you see her lower her head, or did you see her pet
duck?)
 He gave her cat food. (Did he give food to her cat, or did he give her some
cat food?)
 She cannot bear children. (Does she find children intolerable, or is she
unable to have children?)

d) Pragmatic Ambiguity
When the context changes the meaning of a sentence. Pragmatic ambiguity arises
when the meaning of a sentence depends on the context or intention.
Example:
 Can you pass the salt? (Is it a literal question about ability, or a request to
pass the salt?)
 The students will present the project next week. (Are the students showing
their project, or are they offering it as a gift?)
 We need to discuss your salary. (Is this a positive or negative statement? It
could mean an increase or a potential issue.)

2. Phonology
The study of sounds in a language and how they are used. Every language has specific sounds
that form words. Phonology focuses on pronunciation, intonation, and speech patterns.
Example: The word "read" can be pronounced differently in "I read a book" (past) and "I will
read a book" (future).

3. Pragmatics (Contextual Analysis)


The study of how context influences language meaning. Words or sentences can have different
meanings based on who is speaking, where, and why.
Example: If someone asks, "What do you want to eat?" and another responds, "Ice cream is
good this time of year," they are implying they want ice cream without directly saying it.

4. Morphology
The study of word formation and structure. It looks at how prefixes, suffixes, and root words
change meaning.
Example: "Happy" → "Unhappy" (Adding "un-" changes the meaning to the opposite.)

5. Syntax
The study of sentence structure and word order. Every language has rules for arranging words.
Example: "He eats an apple" is correct, but "Eats he apple an" is incorrect.

6. Semantics
The study of word meanings and how they combine in sentences. It focuses on the literal
meaning of words and sentences.
Example:
"She kicked the bucket" literally means someone kicked a bucket, but it can also mean "she
passed away" (idiomatically).

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

Lecture 2
Corpus and text normalization
Corpus:
A corpus is a large collection of texts, often designed to represent a specific population or
language. For example, the Brown Corpus is a well-known collection of written American
English texts. Corpora are used in Natural Language Processing (NLP) to study language
patterns, frequencies, and structures.
It allows researchers and developers to analyse language usage patterns.
A well-structured corpus ensures that NLP models are trained on balanced data.
It helps in quickly accessing word counts, phrase structures, and linguistic patterns.
......................................................................................

Text Normalization in NLP:


Text normalization is a critical step in Natural Language Processing (NLP) that involves
transforming text into a consistent and standardized format. This process is essential because
raw text data is often messy, inconsistent, and contains various anomalies that can hinder
effective processing and analysis. Below, we will discuss the key components of text
normalization, including tokenization, sentence segmentation, handling abbreviations,
contractions, hyphens, and other challenges.
1. Tokenization
Tokenization is the process of breaking down a text into smaller units called tokens. Tokens
can be words, numbers, punctuation marks, or other meaningful elements. Tokenization is the
first step in text processing, as it converts raw text into a format that can be easily analyzed by
NLP systems.

 Rules:
o A token is typically a string of contiguous alphanumeric characters with space
on either side.

o Tokens may include hyphens (-) and apostrophes ('), but no other punctuation
marks.
o Example: The sentence "I love NLP!" would be tokenized into ["I", "love",
"NLP", "!"].
 Challenges in Tokenization:
o Typographical Hyphens: These are used to improve the right margins of a
document. They typically occur at syllable boundaries and should be removed
during normalization.
o Lexical Hyphens: These are part of the word itself (e.g., "co-operate," "so-
called," "pro-university"). They should be retained during tokenization.
o Word Grouping: Some hyphenated phrases act as single units (e.g., "take-it-
or-leave-it," "once-in-a-lifetime"). These should be treated as single tokens.
o Example: "data-base" could be tokenized as ["data", "base"] or ["data-base"],
depending on the context.
o Contractions: Words like "I'm" or "isn't" need to be handled carefully. They
can be split into ["I", "am"] or ["is", "not"].
o Abbreviations: Words like "etc." or "Calif." can cause issues, especially when
they appear at the end of a sentence. Treat abbreviations as single tokens (e.g.,
"etc." → ["etc."]). Be cautious with abbreviations that end with a period, as the
period might also serve as a sentence boundary (e.g., "He lives in Calif.").
o Homographs: Words that are spelled the same but have different meanings
(e.g., "saw" as a tool vs. "saw" as the past tense of "see") can cause ambiguity.
Use context to determine the correct meaning of homographs.
o No Space Between Words: In some languages like Chinese, words are not
separated by spaces, making tokenization more challenging. Use language-
specific tokenization methods for languages without spaces between words.

2. Normalizing Word Formats


Normalizing word formats involves converting text into a standard format, such as converting
all letters to lowercase or handling uppercase letters at the beginning of sentences.
Normalization helps in reducing variability in text, making it easier to process and analyze.

 Rules:
o Lowercasing: Convert all letters to lowercase to ensure consistency. However,
this can sometimes lead to ambiguity (e.g., "Richard Brown" vs. "brown paint").
o Sentence Start: One heuristic is to lowercase the letters at the beginning of a
sentence, except for proper nouns.

o Example: "The Quick Brown Fox" would be normalized to "the quick brown
fox."
3. Sentence Segmentation

Sentence segmentation is the process of dividing text into individual sentences. Sentences are
the basic units of meaning in text, and segmenting them correctly is crucial for further
processing.
 Rules:
o Sentences typically end with a period (.), question mark (?), or exclamation
mark (!).
o However, there are exceptions, such as abbreviations (e.g., "Prof." or "Dr.") that
might appear in the middle of a sentence.
o Haplology: Haplology is a linguistic phenomenon where a word or phrase is
shortened by omitting one or more syllables or sounds that are repeated or
similar. In the context of text normalization and sentence segmentation,
haplology specifically refers to cases where a single punctuation mark (like a
period) serves multiple purposes, such as marking both an abbreviation and the
end of a sentence.
 Heuristic Algorithm for Sentence Boundary Detection:
1. Place putative sentence boundaries after all occurrences of ., ?, or !.
2. Move the boundary after following quotation marks, if any.
3. Disqualify a period boundary if:
 It is preceded by a known abbreviation (e.g., Prof. or vs.) that is
commonly followed by a capitalized proper name (e.g., Prof. Kifor).
 It is preceded by a known abbreviation (e.g., etc.) and followed by
lowercase word (like does).
4. Disqualify a boundary with a ? or ! if it is followed by a lowercase letter or a
known name.
5. Regard other putative sentence boundaries as sentence boundaries.
 Example:
Ali lives in Calif. He is student of Prof. Kifor and he is interested in India vs. Pakistan cricket
match. He eats apple, orange, mango etc. Does he eat kiwi too? is a question he is often asked
about. Aaah! what a fruit is kiwi. I simply love it!

Solution:

 Place putative sentence boundaries after every ., ?, or !. The text is initially split into
the following segments:
1. Ali lives in Calif.
2. He is student of Prof.
3. Kifor and he is interested in India vs.
4. Pakistan cricket match.
5. He eats apple, orange, mango etc.
6. Does he eat kiwi too?
7. is a question he is often asked about.
8. Aaah!
9. what a fruit is kiwi.
10. I simply love it!

 Disqualify a period boundary if it is preceded by a known abbreviation and followed


by an uppercase word.
"Calif." is followed by an uppercase word ("He"), so the period not disqualified.
"Prof." is followed by an uppercase word ("Kifor"), so the period is disqualified.
"vs." is followed by an uppercase word ("Pakistan"), so the period is disqualified.
"etc." is followed by a uppercase word ("Does"), so the period not disqualified. Ali
lives in Calif.
1. Ali lives in Calif.
2. He is student of Prof. Kifor and he is interested in India vs. Pakistan cricket
match.
3. He eats apple, orange, mango etc.
4. Does he eat kiwi too?
5. is a question he is often asked about.
6. Aaah!
7. what a fruit is kiwi.
8. I simply love it!
 Disqualify a boundary with ? or ! if it is followed by a lowercase letter or a known
name.
"Does he eat kiwi too?" is followed by a lowercase word ("is"), so the boundary is
disqualified. The sentence continues after "too?".
"Aaah!" is followed by a lowercase word ("what"), so the boundary is disqualified.
The sentence continues after "Aaah!".
1. Ali lives in Calif.
2. He is student of Prof. Kifor and he is interested in India vs. Pakistan cricket
match.
3. He eats apple, orange, mango etc.
4. Does he eat kiwi too? is a question he is often asked about.
5. Aaah! what a fruit is kiwi.
6. I simply love it!

You might also like