0% found this document useful (0 votes)
46 views15 pages

NLP - Viva - Que & Ans

Computer Networks Lab Experiments
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views15 pages

NLP - Viva - Que & Ans

Computer Networks Lab Experiments
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

BASICS

 What is NLP?
NLP (Natural Language Processing) enables computers to understand, interpret, and
generate human language.

 What is Processed Under NLP?


NLP processes text, speech, syntax, semantics, entities, and sentiment.

 Difference Between NLU and NLG?


- NLU : Understanding language (meaning and intent).
- NLG : Generating natural language from data.

 Real-time Examples:
- NLU : Siri understanding a question.
- NLG : A chatbot generating a response.

 Major Problem in Understanding Language?


Ambiguity , where words or phrases have multiple meanings depending on context.

 What is Context?
Context refers to the surrounding text or situation that helps clarify meaning.

 Steps in NLP:
1. Tokenization
2. POS Tagging
3. Lemmatization/Stemming
4. Named Entity Recognition
5. Parsing
6. Sentiment Analysis
7. Machine Translation
 What is Semantics?
Semantics is the study of meaning in language.

 What is Syntax?
Syntax is the structure of sentences and the grammatical rules governing word order.

 What is Discourse?
Discourse is how sentences relate to form meaningful paragraphs or conversations.

 What is Pragmatics?
Pragmatics deals with language use in context, interpreting beyond literal meanings.

 Example of Language Ambiguity:


"The chicken is ready to eat" could mean either the bird is hungry or the food is
prepared.

 What is Grammar?
Grammar is the set of rules that governs sentence structure in a language.

Steps in NLP - Lexical analysis


 Lexical Analysis:
Lexical analysis, also known as scanning or tokenization, is the process of breaking up a
stream of text into individual words, phrases, or tokens. It's the first stage of compiler or
interpreter design, where the source code is analyzed to identify the basic building
blocks of the language.

 Lexical (Token):
A lexical, or token, is a basic unit of meaning in a language, such as:
- Keywords (e.g., if, while)
- Identifiers (e.g., variable names)
- Literals (e.g., numbers, strings)

- Operators (e.g., +, -, *)
- Symbols (e.g., parentheses, brackets)

 Goals of Lexical Analysis:


The primary goals of lexical analysis are:
1. Identify valid tokens
2. Ignore irrelevant characters (e.g., whitespace, comments)
3. Detect syntax errors
4. Prepare input for syntax analysis (parsing)

 Levenshtein Distance:
Levenshtein distance, also known as edit distance, measures the minimum number of
operations (insertions, deletions, substitutions) required to transform one string into
another.
Example:
- "kitten" → "sitting" (Levenshtein distance = 3)
- Substitute "k" with "s"
- Substitute "e" with "i"
- Append "g"

 Applications of Levenshtein Distance:


Levenshtein distance has various applications:
1. Spell checking: Suggest corrections for misspelled words.
2. Text similarity measurement: Compare similarity between texts.
3. Data compression: Measure compression efficiency.
4. Plagiarism detection: Identify similarities between documents.
5. Speech recognition: Measure similarity between spoken words.
6. Bioinformatics: Compare DNA or protein sequences.
7. Natural Language Processing (NLP): Measure semantic similarity.
Other applications include:
- Auto-complete features
- Error detection and correction
- Information retrieval
- Machine learning
The Levenshtein distance algorithm is widely used in many areas where text or
sequence comparison is necessary.

Syntax analysis
 What is checked in syntax analysis?
Syntax analysis checks if the sequence of tokens (words) generated from the lexical
analysis forms a valid structure as per the grammar of the language. It ensures that the
source code follows the language's rules, like matching brackets or correct order of
operators.

 What do we want to ensure by doing syntactic analysis?


- By doing syntactic analysis (parsing), we want to ensure that the program is
syntactically correct, meaning it follows the correct structure, like where keywords,
operators, and variables should appear.

 What is the role of grammar in syntax analysis?


- Grammar defines the rules of how statements and expressions should be structured
in the programming language. Syntax analysis uses this grammar to determine if the
input code is valid.

 What are terminals and non-terminals in a grammar?


- Terminals : These are the actual characters or symbols from the language (e.g.,
keywords, operators).
- Non-terminals : These represent combinations of terminals, used to define the
structure (e.g., expressions, statements).

 What are parse trees, and how many types are there?
- A parse tree is a tree structure that shows how a string (source code) is derived from
a grammar by breaking it down into terminals and non-terminals.
- There are two main types: leftmost derivation and rightmost derivation trees.
The difference is the order in which non-terminals are expanded.

 Which parse tree is good and why?


- The "good" parse tree is usually the one that reflects the most efficient or correct
structure as per the language's semantic rules. For example, in math expressions,
respecting operator precedence is important, so a tree that does this is preferred.

 How do you decide if a language is possible by a given grammar?


- If a grammar can generate all valid strings (statements) of a language, then it defines
that language. By deriving valid statements from the grammar, you can check if a
language is possible.

 What is context-free grammar (CFG)?


- A context-free grammar is a type of grammar where the production rules define a
single non-terminal on the left-hand side. It can generate languages that are more
complex than regular expressions.

 What are the rules of Chomsky Normal Form (CNF)?


- In Chomsky Normal Form, every production rule must be of one of these forms:
- A → BC (where A, B, and C are non-terminals)
- A → a (where A is a non-terminal and a is a terminal)
- A → ε (only for the start symbol and only if the language includes the empty string)

 What is the need for the CKY algorithm?


- The CKY (Cocke-Kasami-Younger) algorithm is used to efficiently parse strings that
belong to a context-free grammar, especially when the grammar is in Chomsky Normal
Form. It helps in deciding if a string can be generated by the grammar.

 What is PCFG and when is it useful?


- Probabilistic Context-Free Grammar (PCFG) assigns probabilities to each production
rule. It's useful when a language can have multiple valid parse trees, and we want to
choose the most likely one based on real-world data.
 If a language has multiple parse trees, how do you decide which parse tree is good?
- If a language has multiple parse trees, we can use *PCFG* to assign probabilities and
choose the tree that represents the most likely or appropriate interpretation. Operator
precedence and associativity rules also help in deciding.

 How can you figure out that a language is possible to derive using multiple parse
trees?
- If a grammar allows multiple ways to derive the same string, you can generate all
possible parse trees and check if the string is derived from the grammar. Ambiguity in
grammar leads to multiple parse trees.

 Can a language have multiple parse trees?


- Yes, some languages are *ambiguous*, meaning a single sentence (or string) can be
parsed in more than one way, leading to multiple parse trees. For example,
mathematical expressions without clear operator precedence can be ambiguous.

Language modelling
 What is language modeling?
Language modeling is the task of predicting the next word or sequence of words in a
sentence based on the previous words. It helps in various NLP tasks like speech
recognition, translation, and text generation.

 What do we want to achieve in this task?


In language modeling, the goal is both language understanding (grasping patterns in
the text) and language generation (producing coherent text).

 What is n-gram modeling?


N-gram modeling is a simple language model that predicts the next word based on
the previous n-1 words. For example, a bigram model uses the previous one word,
and a trigram model uses the previous two words.

 What is conditional probability?


Conditional probability is the probability of an event occurring, given that another
event has already occurred. In language modeling, it's the probability of a word given
the previous word(s).
 What is the probability chain rule?
The probability chain rule breaks down the probability of a sequence of words into
the product of conditional probabilities. For example, for words
𝑤1, 𝑤2, 𝑤3: 𝑃(𝑤1, 𝑤2, 𝑤3) = 𝑃(𝑤1) ⋅ 𝑃(𝑤2∣𝑤1) ⋅ 𝑃(𝑤3∣𝑤1,𝑤2)

 What is the Markov assumption?


The Markov assumption simplifies the language model by assuming that the
probability of a word depends only on a fixed number of previous words (not all
previous words). In an n-gram model, the nth word depends only on the previous n-
1 words.

 How do you calculate the probability of a word using a bigram model?


In a bigram model, the probability of a word 𝑤𝑛 given the previous word 𝑤𝑛−1 is
calculated as:
𝑃(𝑤𝑛∣𝑤𝑛−1) = Count(𝑤𝑛−1,𝑤𝑛) / Count(𝑤𝑛−1)
This is the ratio of the frequency of the word pair to the frequency of the first word.

 What do you mean by a corpus?


A corpus is a large collection of text used for training language models. It contains
various sentences and is essential for learning patterns in language.

 What is smoothing, and why is it needed?


Smoothing is a technique used to handle unseen word combinations (n-grams) in
the training data by assigning a small probability to these combinations. It helps to
avoid zero probabilities in the model when encountering new word pairs.

 How do you say that your LM is good? Which metric is used to evaluate a language
model?
A good language model predicts text well. Perplexity is a common metric used to
evaluate LMs. Lower perplexity means the model is better at predicting the next
word.

 What is Maximum Likelihood Estimation (MLE)?


MLE is a method to estimate the model parameters that maximize the likelihood of
the observed data. In language modeling, it involves choosing probabilities for words
that make the training text most likely.

Parts of Speech Tagging-POS tagging


 What is POS tagging?
POS (Part-of-Speech) tagging is the process of labeling each word in a sentence with
its appropriate part of speech, such as noun, verb, adjective, etc.

 Why do we need POS tagging?


POS tagging helps computers understand the structure of a sentence, enabling them
to process language for tasks like translation, sentiment analysis, and information
retrieval.

 How many tags are in usage in current times for the English language?
The number of POS tags depends on the tagging system used. Common tag sets like
the Penn Treebank use around 36 tags, while more detailed systems may have more.

 What is transmission probability?


Transmission probability is the likelihood of one POS tag following another in a
sequence. It helps in predicting the correct tags based on context.

 What is emission probability?


Emission probability is the likelihood of a specific word being associated with a
particular POS tag. It links words to their possible parts of speech.

 What is the purpose of the Viterbi algorithm?


The Viterbi algorithm is used to find the most likely sequence of POS tags for a
sentence based on transmission and emission probabilities.

 What is the sole aim of the Viterbi algorithm?


Its sole aim is to identify the most probable sequence of hidden states (POS tags)
that could generate the observed data (the sentence).
 How is POS tagging related to Natural Language Processing (NLP)?
POS tagging is a fundamental task in NLP that helps in understanding the grammatical
structure of sentences, which is critical for various language processing tasks like
machine translation, speech recognition, and information extraction.

Text representations
 Why do we need text representations?
Text representations convert words or documents into numerical formats so that
machine learning models can process and analyze them.

 What is one-hot vectorization and how do we do that?


One-hot vectorization represents each word as a vector of binary values, where only one
element (the word's position) is 1, and all others are 0. It captures whether a word is
present but loses word relationships.

 What is Bag of Words (BoW) and why do we call it that?


BoW is a text representation that counts how many times each word appears in a
document. We call it "Bag of Words" because it treats a document as an unordered
collection of words without considering their order.

 What is count vectorization?


Count vectorization converts text into vectors based on word frequencies, where each
word is assigned a count of how often it appears in a document.
 What exactly does IDF tell about a word?
IDF tells how unique or rare a word is across a collection of documents. Higher IDF
means the word is less common and more significant.

 What is N-gram based representation?


N-gram representation captures sequences of 'n' consecutive words instead of
treating words independently. For example, bigrams (n=2) capture two-word
combinations, providing some word order context.

 What are the drawbacks of Bag of Words representation?


 Ignores word order and context.
 Doesn't capture semantics (meaning).
 High dimensionality as vocabulary size increases.

 What do you mean by semantics?


Semantics refers to the meaning of words and how they relate to each other in
context.

 What do we want to achieve by representing text/documents in Bag of Words?


We aim to transform text into a numerical format for processing, while preserving
word frequency information, so that it can be used in machine learning models.
 What do you mean by dimensionality reduction?
Dimensionality reduction refers to techniques that reduce the number of features
(dimensions) in data, simplifying it while preserving important information.

 What is Latent Semantic Analysis (LSA) and how do we perform it?


LSA is a technique that reduces dimensionality by identifying relationships between
words and documents based on their co-occurrence patterns. It’s performed using
Singular Value Decomposition (SVD) on a term-document matrix.

 What does SVD stand for?


SVD stands for Singular Value Decomposition, a mathematical method used to
decompose a matrix into three smaller matrices, helping in dimensionality
reduction.

Topic Modeling
 What do we want to achieve via topic modeling?
We aim to discover hidden themes or topics within a large collection of documents,
helping us organize and understand the data better.

 What is the outcome of topic modeling?


The outcome is a set of topics, where each topic is represented by a group of words,
and each document is associated with a mixture of these topics.

 What is LDA (Latent Dirichlet Allocation)?


LDA is a popular topic modeling algorithm that assumes documents are mixtures of
topics, and each topic is a mixture of words. It assigns topics to words in documents
based on word co-occurrences.

 What is a document-topic matrix?


The document-topic matrix shows how much each topic contributes to each
document. Rows represent documents, and columns represent topics.

 What is a word-topic matrix?


The word-topic matrix shows how strongly each word is associated with each topic.
Rows represent words, and columns represent topics.

 How do you decide the number of topics?


The number of topics is typically decided through experimentation or by using
techniques like cross-validation. You might also use domain knowledge to estimate
the ideal number of topics.

 What is the distribution of distributions under Dirichlet Allocation?


Dirichlet distribution is a probability distribution over distributions. In LDA, it helps
in defining a distribution of topics for each document and a distribution of words for
each topic. It controls the sparsity of these distributions.

Word sense disambiguation

 What do you mean by word sense disambiguation?


Word sense disambiguation (WSD) is the process of determining the correct
meaning (sense) of a word in a given context when the word has multiple meanings.

 What is the relation between a dictionary and sense disambiguation?


A dictionary provides the different meanings (senses) of a word, and word sense
disambiguation helps in selecting the right sense from the dictionary based on the
context in which the word is used.

 What is the Lesk algorithm


The Lesk algorithm is a method for word sense disambiguation that assigns the
correct meaning to a word by comparing the dictionary definitions of the word's
senses with the context in which the word appears. It chooses the sense with the
most overlapping words between the definition and the surrounding context.

Lab based viva questions


 What is text preprocessing?
Text preprocessing involves cleaning and transforming raw text into a usable format
for analysis or modeling. It includes steps like tokenization, removing stopwords,
and converting to lowercase.

 What is stemming?
Stemming reduces words to their root form by cutting off prefixes and suffixes, e.g.,
"running" becomes "run."

 What is lemmatization?
Lemmatization reduces words to their base or dictionary form (lemma), considering
the word’s meaning, e.g., "running" becomes "run" but keeps the correct meaning.

 What does NLTK stand for


*NLTK* stands for *Natural Language Toolkit*, a Python library for working with
human language data.

 What is spaCy
spaCy is an advanced Python library for Natural Language Processing (NLP), providing
tools for tokenization, part-of-speech tagging, and more.

 What are stopwords?


Stopwords are common words (like "the", "is", "in") that are usually removed from text
during preprocessing because they don't carry much meaning.

 What are X and Y in machine learning?


 X represents the input data (features).
 Y represents the output or target (labels).

 What does the IMDB review dataset contain?


The IMDB dataset contains movie reviews, labeled as positive or negative sentiments.
We use it to train models for sentiment analysis (positive/negative classification).

 What is data and what is a label?


 Data refers to the input features (e.g., text, images).
 Label refers to the correct output or target associated with the data (e.g.,
sentiment, category).

 What is the relation between data and label?


Data provides the input that the model uses, and the label is the expected outcome that
helps the model learn during training.

 What happens in the training phase?


In the training phase, a model learns patterns from the input data (X) and
corresponding labels (Y) to make predictions on unseen data.

 What do you mean by classification?


Classification is the process of predicting the category or label of new data, based on
patterns learned during training.

 How do you evaluate a model?


Models are evaluated using metrics like accuracy, precision, recall, F1 score, and
confusion matrix to measure how well they predict on test data.

 What is the difference between text representation and model?


 Text representation converts text into a format (e.g., vectors) that models can
understand.
 Model refers to the algorithm that learns patterns from the represented text.

 What is a confusion matrix?


A confusion matrix is a table used to evaluate a model's performance, showing the true
positives, true negatives, false positives, and false negatives.

 What are TP, TN, FP, FN?


 TP : True Positives (correct positive predictions)
 TN : True Negatives (correct negative predictions)
 FP : False Positives (incorrectly predicted positives)
 FN : False Negatives (incorrectly predicted negatives)
 What is the real-time application of text classification?
Text classification is used in spam detection, sentiment analysis, email
categorization, chatbot responses, and customer service automation.

You might also like