NLP - Viva - Que & Ans
NLP - Viva - Que & Ans
What is NLP?
NLP (Natural Language Processing) enables computers to understand, interpret, and
generate human language.
Real-time Examples:
- NLU : Siri understanding a question.
- NLG : A chatbot generating a response.
What is Context?
Context refers to the surrounding text or situation that helps clarify meaning.
Steps in NLP:
1. Tokenization
2. POS Tagging
3. Lemmatization/Stemming
4. Named Entity Recognition
5. Parsing
6. Sentiment Analysis
7. Machine Translation
What is Semantics?
Semantics is the study of meaning in language.
What is Syntax?
Syntax is the structure of sentences and the grammatical rules governing word order.
What is Discourse?
Discourse is how sentences relate to form meaningful paragraphs or conversations.
What is Pragmatics?
Pragmatics deals with language use in context, interpreting beyond literal meanings.
What is Grammar?
Grammar is the set of rules that governs sentence structure in a language.
Lexical (Token):
A lexical, or token, is a basic unit of meaning in a language, such as:
- Keywords (e.g., if, while)
- Identifiers (e.g., variable names)
- Literals (e.g., numbers, strings)
- Operators (e.g., +, -, *)
- Symbols (e.g., parentheses, brackets)
Levenshtein Distance:
Levenshtein distance, also known as edit distance, measures the minimum number of
operations (insertions, deletions, substitutions) required to transform one string into
another.
Example:
- "kitten" → "sitting" (Levenshtein distance = 3)
- Substitute "k" with "s"
- Substitute "e" with "i"
- Append "g"
Syntax analysis
What is checked in syntax analysis?
Syntax analysis checks if the sequence of tokens (words) generated from the lexical
analysis forms a valid structure as per the grammar of the language. It ensures that the
source code follows the language's rules, like matching brackets or correct order of
operators.
What are parse trees, and how many types are there?
- A parse tree is a tree structure that shows how a string (source code) is derived from
a grammar by breaking it down into terminals and non-terminals.
- There are two main types: leftmost derivation and rightmost derivation trees.
The difference is the order in which non-terminals are expanded.
How can you figure out that a language is possible to derive using multiple parse
trees?
- If a grammar allows multiple ways to derive the same string, you can generate all
possible parse trees and check if the string is derived from the grammar. Ambiguity in
grammar leads to multiple parse trees.
Language modelling
What is language modeling?
Language modeling is the task of predicting the next word or sequence of words in a
sentence based on the previous words. It helps in various NLP tasks like speech
recognition, translation, and text generation.
How do you say that your LM is good? Which metric is used to evaluate a language
model?
A good language model predicts text well. Perplexity is a common metric used to
evaluate LMs. Lower perplexity means the model is better at predicting the next
word.
How many tags are in usage in current times for the English language?
The number of POS tags depends on the tagging system used. Common tag sets like
the Penn Treebank use around 36 tags, while more detailed systems may have more.
Text representations
Why do we need text representations?
Text representations convert words or documents into numerical formats so that
machine learning models can process and analyze them.
Topic Modeling
What do we want to achieve via topic modeling?
We aim to discover hidden themes or topics within a large collection of documents,
helping us organize and understand the data better.
What is stemming?
Stemming reduces words to their root form by cutting off prefixes and suffixes, e.g.,
"running" becomes "run."
What is lemmatization?
Lemmatization reduces words to their base or dictionary form (lemma), considering
the word’s meaning, e.g., "running" becomes "run" but keeps the correct meaning.
What is spaCy
spaCy is an advanced Python library for Natural Language Processing (NLP), providing
tools for tokenization, part-of-speech tagging, and more.