Tsa Ut III Tsa Notes
Tsa Ut III Tsa Notes
INTRODUCTION
Question answering (QA) is a branch of artificial intelligence within the natural language processing and information
retrieval fields; building systems that answer questions posed in a natural language by humans. QA is a computer science
discipline within the fields of information retrieval and natural language processing (NLP) that is concerned with building
systems that automatically answer questions that are posed by humans in a natural language.
Question-Answering System
This is a very adaptable design, and we have found that it can ask for a wide range of queries. Instead of having a list of
options for each question, systems must choose the best answer from all potential spans in the passage, which means they
must deal with a vast number of possibilities. Spans have the extra benefit of being simple to evaluate.
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by VIGNESH M ([email protected])
lOMoARcPSD|28298717
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by VIGNESH M ([email protected])
lOMoARcPSD|28298717
The QA setting, depending on the span is extremely natural. Open-domain QA systems can typically discover the right
papers that hold the solution to many user questions sent into search engines. The task is to discover the shortest fragment
of text in the passage or document that answers the query, which is the ultimate phase of “answer extraction.”
Problem Description for Question-Answering System
The purpose is to locate the text for any new question that has been addressed, as well as the context. This is a closed
dataset, so the answer to a query is always a part of the context and that the context spans a continuous span. Here,
divided the problem into two pieces as shown above.
INFORMATION RETRIEVAL
Information retrieval (IR) can be defined as a software program that deals with the organization, storage, retrieval, and
evaluation of information from document repositories particularly textual information. The system assists users in finding
the information they require but it does not explicitly return the answers to the questions. It informs the existence and
location of documents that might consist of the required information. The documents that satisfy user’s requirement are
called relevant documents. A perfect IR system will retrieve only relevant documents.
With the help of the following diagram, we can understand the process of information retrieval (IR) −
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by VIGNESH M ([email protected])
lOMoARcPSD|28298717
It is clear from the above diagram that a user who needs information will have to formulate a request in the form of query
in natural language. Then the IR system will respond by retrieving the relevant output, in the form of documents, about
the required information.
Classical Problem in Information Retrieval (IR) System
The main goal of IR research is to develop a model for retrieving information from the repositories of documents. Here,
we are going to discuss a classical problem, named ad-hoc retrieval problem, related to the IR system.
In ad-hoc retrieval, the user must enter a query in natural language that describes the required information. Then the IR
system will return the required documents related to the desired information. For example, suppose we are searching
something on the Internet and it gives some exact pages that are relevant as per our requirement but there can be some
non-relevant pages too. This is due to the ad-hoc retrieval problem.
Aspects of Ad-hoc Retrieval
The information retrieval model needs to provide the framework for the system to work and define the many aspects of
the retrieval procedure of the retrieval engines
The IR model has to provide a system for how the documents in the collection and user’s queries
are transformed.
The IR model also needs to ingrain the functionality for how the system identifies the relevancy of the
documents based on the query provided by the user.
The system in the information retrieval model also needs to incorporate the logic for ranking the retrieved
documents based on the relevancy.
Information Retrieval (IR) Model
Mathematically, models are used in many scientific areas having objective to understand some phenomenon in the real
world. A model of information retrieval predicts and explains what a user will find in relevance to the given query. IR
model is basically a pattern that defines the above-mentioned aspects of retrieval procedure and consists of the following
−
A model for documents.
A model for queries.
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by VIGNESH M ([email protected])
lOMoARcPSD|28298717
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by VIGNESH M ([email protected])
lOMoARcPSD|28298717
Examples of non-classical Information Retrieval models include Information Logic models, Situation Theory
models, and Interaction models.
Alternative IR Model
It is the enhancement of classical IR model making use of some specific techniques from some other fields. Cluster
model, fuzzy model, and latent semantic indexing (LSI) models are the example of alternative IR model.
Design features of Information retrieval (IR) systems
Let us now learn about the design features of IR systems −
Inverted Index
The primary data structure of most of the IR systems is in the form of inverted index. We can define an inverted index as a
data structure that list, for every word, all documents that contain it and frequency of the occurrences in document. It
makes it easy to search for ‘hits’ of a query word.
Stop Word Elimination
Stop words are those high frequency words that are deemed unlikely to be useful for searching. They have less semantic
weights. All such kind of words are in a list called stop list. For example, articles “a”, “an”, “the” and prepositions like
“in”, “of”, “for”, “at” etc. are the examples of stop words. The size of the inverted index can be significantly reduced by
stop list. As per Zipf’s law, a stop list covering a few dozen words reduces the size of inverted index by almost half. On
the other hand, sometimes the elimination of stop word may cause elimination of the term that is useful for searching. For
example, if we eliminate the alphabet “A” from “Vitamin A” then it would have no significance.
Stemming
Stemming, the simplified form of morphological analysis, is the heuristic process of extracting the base form of words by
chopping off the ends of words. For example, the words laughing, laughs, laughed would be stemmed to the root word
laugh.
Some important and useful IR models.
The Boolean Model
It is the oldest information retrieval (IR) model. The model is based on set theory and the Boolean algebra, where
documents are sets of terms and queries are Boolean expressions on terms.
The Boolean model in information retrieval is based on the set theory and boolean algebra. We can pose any query in
the form of a Boolean expression of terms where the terms are logically combined using the Boolean operators AND,
OR, and NOT in the Boolean retrieval model.
Using the Boolean operators, the terms in the query and the concerned documents can be combined to form a
whole new set of documents.
o The Boolean AND of two logical statements x and y means that both x AND y must be satisfied and will
be a set of documents that will smaller or equal to the document set
o while the Boolean OR of these same two statements means that at least one of these statements must be
satisfied and will fetch a set of documents that will be greater or equal to the document set otherwise.
o Any number of logical statements can be combined using the three Boolean operators.
The queries are designed as boolean expressions which have precise semantics and the retrieval strategy is based
on binary decision criterion.
The Boolean model can also be explained well by mapping the terms in the query with a set of documents.
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by VIGNESH M ([email protected])
lOMoARcPSD|28298717
The most famous web search engine in recent times Google also ranks the web page result set based on a two-stage
system: In the first step, a Simple Boolean Retrieval model** returns matching documents** in no particular order, and
in the next step ranking is done according to some estimator of relevance.
Aspects of Boolean Information Retrieval Model
Indexing: Indexing is one of the core functionalities of the information retrieval models and the first step in building an
IR system assisting with the efficient retrieval of information.
Indexing is majorly an offline operation that collects data about which words occur in the text corpus so that at
search time we only have to access the pre-compiled index done beforehand.
The boolean model builds the indices for the terms in the query considering that index terms are present or
absent in a document.
Term-Document Incidence matrix: This is one of the basic mathematical models to represent text data and can be
used to answer Boolean expression queries using the Boolean Retrieval Model. It can be used to answer any query as a
Boolean expression.
It views the document as the set of terms and creates the indexing required for the Boolean retrieval model.
The text data is represented in the form of a matrix where rows of the matrix represent the sentences and
the columns of the matrix represent the word for the data which needs to be analyzed and retrieved and
the values of the matrix represent the number of occurrences of the words.
This model has good precision as the documents are retrieved if the condition is matched but, it doesn't scale
well with the size of the corpus, and an inverted index can be used as a good alternative method.
Processing the data for Boolean retrieval model
We should strip unwanted characters/markup like HTML tags, punctuation marks, numbers, etc. before breaking
the corpus into tokens/keywords on whitespace.
Stemming needs to be done and then common stopwords are to be removed depending on the application need
The term document incidence matrix or inverted index (with the keyword a list of docs containing it) is built.
Then the common queries/phrases may be detected using a domain-specific dictionary if needed.
Example of Information Retrieval in Boolean Model
For example, the term Peanut Butter individually (or Jelly individually) defines all the documents with the term
Peanut Butter (or Jelly) alone and indexes them.
If the information needed is based on Peanut Butter AND Jelly, we will be giving a set of documents that
contain both the words and so the query with the keywords Peanut Butter AND Jelly will be giving a set
of documents that are having the both the words Peanut Butter AND Jelly.
Using OR the search will return documents containing either Peanut Butter or documents containing Jelly or
documents containing both Peanut Butter and Jelly.
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by VIGNESH M ([email protected])
lOMoARcPSD|28298717
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by VIGNESH M ([email protected])
lOMoARcPSD|28298717
Consider the following important points to understand more about the Vector Space Model −
The index representations (documents) and the queries are considered as vectors embedded in a high-dimensional
Euclidean space.
The similarity measure of a document vector to a query vector is usually the cosine of the angle between them.
Cosine Similarity Measure Formula
Cosine is a normalized dot product, which can be calculated with the help of the following formula −
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by VIGNESH M ([email protected])
lOMoARcPSD|28298717
The top ranked document in response to the terms car and insurance will be the document d2 because the angle
between q and d2 is the smallest. The reason behind this is that both the concepts car and insurance are salient in d 2 and
hence have the high weights. On the other side, d1 and d3 also mention both the terms but in each case, one of them is not
a centrally important term in the document.
Index Creation for Terms in the Vector Space Model
The creation of the indices for the vector space model involves lexical scanning, morphological analysis, and term value
computation.
Lexical scanning is the creation of individual word documents to identify the significant terms and
morphological analysis reduces to reduce different word forms to common stems and then compute the values
of terms on the basis of stemmed words.
The terms of the query are also weighted to take into account their importance, and they are computed by using
the statistical distributions of the terms in the collection and in the documents.
The vector space model assigns a high ranking score to a document that contains only a few of the query terms if
these terms occur infrequently in the collection of the original corpus but frequently in the document.
Assumptions of the Vector Space Model
The more similar a document vector is to a query vector, the more likely it is that the document is relevant to
that query.
The words used to define the dimensions of the space are orthogonal or independent.
The similarity assumption is an approximation and realistic whereas the assumption that words are
pairwise independent doesn't hold true in realistic scenarios.
Disadvantages of Vector Space Model
Long documents are poorly represented because they have poor similarity values due to a small scalar product
and a large dimensionality of the terms in the model.
Search keywords must be precisely designed to match document terms and the word substrings might result in
a false positive match.
Semantic sensitivity: Documents with similar context but different term vocabulary won't be associated resulting
in false negative matches.
The order in which the terms appear in the document is lost in the vector space representation.
Weighting is intuitive but not represented formally in the model.
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by VIGNESH M ([email protected])
lOMoARcPSD|28298717
Issues with implementation: Due to the need for the similarity metric calculation and in turn storage of all the
values of all vector components, it is problematic in case of incremental updates of the index
o Adding a single new document changes the document frequencies of terms that occur in the document,
which changes the vector lengths of every document that contains one or more of these terms.
Probabilistic Model
Probabilistic models provide the foundation for reasoning under uncertainty in the realm of information retrieval.
Let us understand why there is uncertainty while retrieving documents and the basis for probability models in
information retrieval.
Uncertainty in retrieval models: The probabilistic models in information retrieval are built on the idea that the process
of retrieval is inherently uncertain from multiple standpoints:
There is uncertainty in the understanding of user’s information needs - We can not sure that the user mapped
their needs into the query they have presented.
Even if the query represents the need well, there is uncertainty in the estimation of document relevance for the
query which stems from either the uncertainty from the selection of the document representation or
the uncertainty from matching the query and documents.
Basis of probabilistic retrieval model: Probabilistic model is based on the Probability Ranking Principle which states
that an information retrieval system is supposed to rank the documents based on their probability of relevance to the
query given all the other pieces of evidence available.
Probabilistic information retrieval models estimate how likely it is that a document is relevant for a query.
There may be a variety of sources of evidence that are used by the probabilistic retrieval methods and the most
common one is the statistical distribution of the terms in both the relevant and non-relevant documents.
Probabilistic information models are also among the oldest and best performing and most widely used IR
models.
Types of Probabilistic information retrieval models: The classic probabilistic models (BIM, Two Poisson, BM11,
BM25), The Language models for information retrieval, and the Bayesian networks-based models for information
retrieval.
User Query Improvement
The primary goal of any information retrieval system must be accuracy − to produce relevant documents as per the user’s
requirement. However, the question that arises here is how can we improve the output by improving user’s query
formation style. Certainly, the output of any IR system is dependent on the user’s query and a well-formatted query will
produce more accurate results. The user can improve his/her query with the help of relevance feedback, an important
aspect of any IR model.
Relevance Feedback
Relevance feedback takes the output that is initially returned from the given query. This initial output can be used to
gather user information and to know whether that output is relevant to perform a new query or not. The feedbacks can be
classified as follows −
Explicit Feedback
It may be defined as the feedback that is obtained from the assessors of relevance. These assessors will also indicate the
relevance of a document retrieved from the query. In order to improve query retrieval performance, the relevance
feedback information needs to be interpolated with the original query.
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by VIGNESH M ([email protected])
lOMoARcPSD|28298717
Assessors or other users of the system may indicate the relevance explicitly by using the following relevance systems −
Binary relevance system − This relevance feedback system indicates that a document is either relevant (1)
or irrelevant (0) for a given query.
Graded relevance system − The graded relevance feedback system indicates the relevance of a document,
for a given query, on the basis of grading by using numbers, letters or descriptions. The description can be
like “not relevant”, “somewhat relevant”, “very relevant” or “relevant”.
Implicit Feedback
It is the feedback that is inferred from user behavior. The behavior includes the duration of time user spent viewing a
document, which document is selected for viewing and which is not, page browsing and scrolling actions, etc. One of the
best examples of implicit feedback is dwell time, which is a measure of how much time a user spends viewing the page
linked to in a search result.
Pseudo Feedback
It is also called Blind feedback. It provides a method for automatic local analysis. The manual part of relevance feedback
is automated with the help of Pseudo relevance feedback so that the user gets improved retrieval performance without an
extended interaction. The main advantage of this feedback system is that it does not require assessors like in an explicit
relevance feedback system.
Consider the following steps to implement this feedback −
Step 1 − First, the result returned by initial query must be taken as relevant result. The range of relevant result
must be in top 10-50 results.
Step 2 − Now, select the top 20-30 terms from the documents using for instance term frequency(tf)-inverse
document frequency(idf) weight.
Step 3 − Add these terms to the query and match the returned documents. Then return the most relevant
documents.
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by VIGNESH M ([email protected])
lOMoARcPSD|28298717
IR-based QA System
Question Processing
The main goal of the question-processing phase is to extract the query: the keywords passed to the IR system to match
potential documents. Some systems additionally extract further information such as:
• answer type: the entity type (person, location, time, etc.) of the answer.
• focus: the string of words in the question that is likely to be replaced by the answer in any answer string found.
• question type: is this a definition question, a math question, or a list question?
For example, for the question
“Which US state capital has the largest population?” the query processing might produce:
Query formulation is the task of creating a query—a list of tokens— to send to an information retrieval system to
retrieve documents that might contain answer strings. For question answering from the web, we can simply pass the entire
question
to the web search engine, at most perhaps leaving out the question word (where, when, etc.). When it comes to answering
questions from smaller collections of documents, such as corporate information sites or Wikipedia, it is common practice
to employ an information retrieval (IR) engine for the purpose of indexing and searching these articles. Typically, this
involves utilizing the standard TF-IDF cosine matching technique. Query expansion is a necessary step in information
retrieval as the diverse nature of web content often leads to several variations of an answer to a given question. While the
likelihood of finding a matching response to a question is higher on the web due to its vastness, smaller document sets
may have just a single occurrence of the desired answer. Query expansion methods involve the addition of query
keywords with the aim of improving the likelihood of finding a relevant answer. This can be achieved by including
morphological variations of the content words in the inquiry or using synonyms obtained from a dictionary.
Ex:The question “When was the laser invented?” might be reformulated as “the laser was invented”; the question
“Where is the Valley of the Kings?” as “the Valley of the Kings is located in”
Question&Answer Type Detection:Some systems make use of question classification, the task of finding the answer
classification answer type, and the named entity categorizing the answer. A question like “Who founded Virgin Airlines?”
expects an answer of type PERSON. A question like “What Canadian city has the largest population?” expects an
answer of type CITY. If we know thatthe answer type for a question is a person, we can avoid examining every sentence
in the document collection, instead focusing on sentences mentioning people. We can also use a larger hierarchical answer
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by VIGNESH M ([email protected])
lOMoARcPSD|28298717
type set of answer types called an answer type taxonomy. Such taxonomies can be built taxonomy automatically, from
resources like WordNet, they can be designed by hand. In this hierarchicaltagset, each question can be labeled with a
coarse-grained tag like HUMAN or a fine-grained tag like HUMAN:DESCRIPTION, HUMAN:GROUP, HUMAN:IND,
and so on.
The HUMAN:DESCRIPTION type is often called a BIOGRAPHY question because theanswer is required to give a brief
biography of the person rather than just a name.Question classifiers can be built by hand-writing rules like the following
rulefor detecting the answer type BIOGRAPHY:who (was / are / were): PERSON.
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by VIGNESH M ([email protected])
lOMoARcPSD|28298717
named entity tagger on the candidate passage and return whatever span in the passage is the correct answer type. Thus, in
the following examples, the underlined named entities would be extracted from the passages as the answer to the
HUMAN and DISTANCE-QUANTITY questions:
Unfortunately, the answers to many questions, such as DEFINITION questions, don’t tend to be of a particular named
entity type. For this reason modern work on answer extraction uses more sophisticated algorithms, generally based on
supervised
learning.
Feature-based Answer Extraction
Supervised learning approaches to answer extraction train classifiers to decide if aspan or a sentence contains an answer.
One obviously useful feature is the answer type feature of the above baseline algorithm.features in such classifiers
include:
Answer type match: True if the candidate answer contains a phrase with the correct answer type.
Pattern match: The identity of a pattern that matches the candidate answer.
Number of matched question keywords: How many question keywords are contained in the candidate answer.
Keyword distance: The distance between the candidate answer and query keywords.
Novelty factor: True if at least one word in the candidate answer is novel, that is, not in the query.
Apposition features: True if the candidate answer is an appositive to a phrase containingmany question terms.
Can be approximated by the number of questionterms separated from the candidate answer through at most three
words andone comma
Punctuation location: True if the candidate answer is immediately followed by a comma, period, quotation
marks, semicolon, or exclamation mark.
Sequences of question terms: The length of the longest sequence of question terms that occurs in the candidate
answer.
KNOWLEDGE-BASED QUESTION ANSWERING
Knowledge based question answering (KBQA) is a complex task for natural language understanding. KBQA is the task of
finding answers to questions by processing a structured knowledge base.Like the textbasedparadigm for question
answering, this approach dates back to the earliest daysof natural language processing, with systems like BASEBALL that
answered questions from a structured database of baseball games and stats.Systems for mapping from a text string to any
logical form are called semantic parsers. Semantic parsers for question answering usually map either to someversion of
predicate calculus or a query language like SQL or SPARQL.
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by VIGNESH M ([email protected])
lOMoARcPSD|28298717
A knowledge base (KB) is a structured database that contains a collection of facts in the form <subject, relation, object> ,
where each fact can have properties attached called qualifiers.
For example, the sentence “Barack Obama got married to Michelle Obama on 3 October 1992 at Trinity United
Church” can be represented by the tuple <Barack Obama, Spouse, Michelle Obama> , with the qualifiers start time = 3
October 1992 and place of marriage = Trinity United Church .
Recently, attention shifted to answering complex questions. Generally, complex questions involve multi-hop reasoning
over the KB, constrained relations, numerical operations, or some combination of the above.
Let’s see an example of complex KBQA with the question “Who is the first wife of the TV producer that was nominated
for The Jeff Probst Show?”. This question requires:
Constrained relations: We are looking for the TV producer that was nominated for The Jeff Probst Show, thus
we are looking for an entity with a nominee link to the The Jeff Probst Show and that is a TV producer.
Multi-hop reasoning: Once we find the TV producer, we need to find his wives.
Numerical operations: Once we find the TV producer's wives, we are looking for the first wife, thus we need to
compare numbers and generate a ranking.
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by VIGNESH M ([email protected])
lOMoARcPSD|28298717
An example of complex KBQA for the question “Who is the first wife of the TV producer that was nominated for The Jeff
Probst Show?”. Multi-hop reasoning, constrained relations, and numerical operation are highlighted in black dotted boxes.
KBQA approaches
There are two mainstream approaches for complex KBQA. Both these two approaches start by recognizing the subject in
the question and linking it to an entity in the KB, which will be called topic entity. Then, they derive the answers within
the KB neighborhood of the topic entity:
By executing a parsed logic form, typical of semantic parsing-based methods (SP-based methods). It follows
a parse-then-execute paradigm.
By reasoning in a question-specific graph extracted from the KB and ranking all the entities in the extracted graph
based on their relevance to the question, typical of information retrieval-based methods (IR-based methods). It
follows a retrieval-and-rank paradigm.
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by VIGNESH M ([email protected])
lOMoARcPSD|28298717
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by VIGNESH M ([email protected])
lOMoARcPSD|28298717
The algorithms are responsible for creating rules for the context in natural language. The models are prepared for
the prediction of words by learning the features and characteristics of a language. With this learning, the model
prepares itself for understanding phrases and predicting the next words in sentences.
For training a language model, a number of probabilistic approaches are used. These approaches vary on the basis
of the purpose for which a language model is created. The amount of text data to be analyzed and the math
applied for analysis makes a difference in the approach followed for creating and training a language model.
For example, a language model used for predicting the next word in a search query will be absolutely different
from those used in predicting the next word in a long document (such as Google Docs). The approach followed to
train the model would be unique in both cases.
Language is significantly complex and keeps on evolving. Therefore, the more complex the language model is, the better
it would be at performing NLP tasks. Compared to the n-gram model, an exponential or continuous space model proves to
be a better option for NLP tasks because they are designed to handle ambiguity and language variation.Meanwhile,
language models should be able to manage dependencies. For example, a model should be able to understand words
derived from different languages.
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by VIGNESH M ([email protected])
lOMoARcPSD|28298717
Language models like GPT-3 can be used to build powerful question answering systems. These systems take a question in
natural language as input and generate a relevant and coherent answer. Here's a general approach to building a question
answering system using language models:
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by VIGNESH M ([email protected])
lOMoARcPSD|28298717
GPT is a class of language models developed by OpenAI. It's based on the Transformer architecture, which is designed to
process sequences of data, making it particularly well-suited for natural language understanding and generation tasks.
GPT models are pre-trained on a vast amount of text data from the internet, which allows them to learn grammar, syntax,
semantics, and other language patterns.
QA (Question Answering):
Question answering is a task in natural language processing where a machine is given a question in natural language and
is expected to provide a relevant and accurate answer. QA models typically analyze the question and a given context (such
as a passage of text) to generate an answer that addresses the question.
BERT
BERT, which stands for "Bidirectional Encoder Representations from Transformers," is a groundbreaking natural
language processing (NLP) model introduced by researchers at Google in 2018. BERT is designed to understand and
represent the context of words in a sentence by considering both the left and right context, unlike previous models that
only looked at the left or right context.
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by VIGNESH M ([email protected])
lOMoARcPSD|28298717
on their relationships within a sentence. The outputs from all layers are combined to create contextualized word
representations.
7. Contextualized Embeddings: BERT produces contextualized word embeddings, which means the embeddings
are different for the same word depending on its context in a sentence. This enables BERT to capture nuances and
polysemy (multiple meanings) in language.
8. Applications: BERT's bidirectional nature and contextual embeddings make it highly effective for a wide range
of NLP tasks, including question answering, sentiment analysis, text classification, text generation, and more. By
fine-tuning BERT on specific tasks, it can achieve state-of-the-art performance on various benchmarks.
BERT has significantly advanced the field of NLP and has paved the way for many subsequent models and research
efforts. Its ability to capture bidirectional context has led to improved language understanding and generation capabilities
in a variety of applications.
BERT ARCHITECTURE
The BERT (Bidirectional Encoder Representations from Transformers) architecture is based on the Transformer
architecture. BERT builds upon this architecture with specific modifications to enable bidirectional context modeling.
Here's a detailed overview of the BERT architecture:
1. **Input Encoding and Tokenization**:
- BERT takes variable-length text input, which is tokenized into subword units like words and subwords using
WordPiece tokenization.
- Special tokens are added to mark the start and end of sentences, as well as to distinguish between different sentences in
a pair.
2. **Embedding Layer**:
- Each token is associated with an embedding vector that combines a word embedding, a positional embedding (to
capture token position), and segment embeddings (to distinguish between sentence pairs).
3. **Transformer Encoder Stack**:
- BERT consists of multiple identical layers, each containing a self-attention mechanism and feedforward neural
networks.
- Each layer processes the token embeddings sequentially.
4. **Self-Attention Mechanism**:
- Self-attention allows each token to consider the other tokens in the input sequence while calculating its representation.
- BERT uses multi-head self-attention, where the model learns multiple sets of attention weights to capture different
types of relationships between words.
5. **Position-wise Feedforward Networks**:
- After self-attention, each token's representation passes through a position-wise feedforward neural network, which
includes two fully connected layers.
- The feedforward network introduces non-linearity and further contextualizes token representations.
6. **Layer Normalization and Residual Connections**:
- Layer normalization is applied after each sub-layer (self-attention and feedforward) to stabilize training.
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by VIGNESH M ([email protected])
lOMoARcPSD|28298717
- Residual connections (skip connections) are used to ensure that original token embeddings are preserved and facilitate
gradient flow during training.
7. **Output Pooling**:
- For certain tasks (e.g., sentence classification), BERT employs a pooling layer to aggregate token representations into a
fixed-size representation for the entire sequence.
- Common pooling strategies include max-pooling and mean-pooling.
8. **Task-Specific Heads**:
- BERT can be fine-tuned for various NLP tasks by adding task-specific layers on top of the BERT encoder.
- For example, for text classification tasks, a linear layer and softmax activation can be added to predict class labels.
The key innovation of BERT is its bidirectional approach, which allows it to capture contextual information from both the
left and right contexts of a token. This contrasts with traditional models that only consider either the left or right context.
The bidirectional encoding enables BERT to better understand language nuances, relationships between words, and the
broader context within sentences.BERT's architecture has served as a foundation for subsequent advancements in NLP,
and its pre-trained representations have proven highly effective for a wide range of downstream tasks through fine-
tuning.BERT reads the whole input text sequence altogether unlike other directional models which do the same task from
one direction such as left to right or right to left.
BERT can better understand long-term queries and as a result surface more appropriate results. BERT models are applied
to both organic search results and featured snippets. While you can optimize for those queries, you cannot “optimize
for BERT.”
To simplify: BERT helps the search engine understand the significance of transformer words like ‘to’ and ‘for’ in the
keywords used.
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by VIGNESH M ([email protected])
lOMoARcPSD|28298717
The two pieces of text are separated by the special [SEP] token.
BERT uses “Segment Embeddings” to differentiate the question from the reference text. These are simply two embeddings
(for segments “A” and “B”) that BERT learned, and which it adds to the token embeddings before feeding them into the
input layer.
Start & End Token Classifiers
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by VIGNESH M ([email protected])
lOMoARcPSD|28298717
1. Text-to-Text Framework:
T5 introduces a unified framework where all NLP tasks are cast as a text generation task. This means that
both the input and output are treated as text sequences, which enables T5 to handle tasks like
classification, translation, summarization, question answering, and more.
The input text includes a prefix indicating the specific task, and the model learns to generate the
appropriate output text.
2. Transformer Architecture:
T5 is built upon the Transformer architecture, which includes self-attention mechanisms and feedforward
neural networks.
The architecture allows T5 to capture contextual relationships between words and generate coherent and
contextually relevant output text.
3. Pre-training:
T5 undergoes a pre-training phase where it is trained on a large corpus of text data using a denoising
autoencoder objective. It learns to reconstruct masked-out tokens in corrupted sentences.
4. Fine-tuning:
During fine-tuning, the model learns to generate the appropriate output for each task while conditioning
on the provided input.
5. Task-Specific Prompts:
For each task, T5 is provided with a specific prompt that guides it to generate the desired output text.
6. Versatility:
T5's text-to-text framework makes it highly versatile. It can be fine-tuned for a wide range of tasks,
including text classification, translation, summarization, question answering, sentiment analysis, and
more.
By using a consistent text generation approach across tasks, T5 simplifies the process of adapting the
model to new tasks.
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by VIGNESH M ([email protected])
lOMoARcPSD|28298717
It has demonstrated strong performance even when fine-tuned on tasks for which it was not explicitly
trained, showcasing its ability to generalize across tasks.
T5's innovative text-to-text approach has demonstrated the potential for a unified framework that can handle diverse NLP
tasks. It offers a streamlined way to apply a single model to various tasks by framing them as text generation problems.
Text-Text framework:
T5 uses the same model for all various tasks by the way we tell the model which task to perform by prepending
the task prefix which is also a text.
As shown in the above picture if we want to use T5 for the classification task of predicting sentence
grammatically correct or not, adding the prefix “Cola sentence: ”will take care of it and return two texts as
output ‘acceptable’ or ‘not acceptable’
Interestingly T5 also perform two sentences similarity regression task in Text- text framework. They posed
this as a classification problem with 21 classes (from 1–5 with 0.2 increments eg:’1.0’,’1.2’,’1.4’……’5.0’) and
asked the model to predict a string and T5 gave SOTA results to this task too.
Similarly for other tasks ‘summarize:’ which return a summary of the article and for NMT ‘translate English to
german:’
T5 Pretraining and Finetuning:
Q) What’s new in T5?
Ans) Nothing
yeah its true, T5 Uses vanilla Transformer Architecture. Then how they got SOTA results? The main motivation
behind T5 work is ….
Given the current landscape of transferlearning for NLP what works best and how far we can push the tools we
have? Search Results — Colin Raffel
T5 base with Bert base sized encoder-decoder stacks with 220 million parameters experimented with an all wide
variety of NLP techniques during pretraining and fine-tuning.
Summing up Best outcomes from T5 experiments :
1. Used Large Dataset for Pre-training: An important ingredient for transfer learning is the unlabeled
dataset used for pre-training .T5 uses common crawl web extract text (C4) which results in 800 GB of
data after cleaning and deduplication of data. The cleaning process involved deduplication, discarding
incomplete sentences, and removing offensive or noisy content.
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by VIGNESH M ([email protected])
lOMoARcPSD|28298717
Architectures: Experimented with encoder-decoder models and decoder-only language models similar
toGPTand found encoder-decoder models did well
un-supervised objectives: T5 Uses MLM-Masked Language Modeling as pertaining objective and it
worked best for then they also experimented with Permutation Language modeling where XLNET uses
this as un-supervised objectives.;
Finally,
Insights + Scale = State-of-the-Art
T5 further explores with scaling their models large, with dmodel = 1024, a 24 layer encoder and decoder, and
dkv = 128. T5–3Billion variant uses dff = 16,384 and 32-headed attention, which results in around 2.8 billion
parameters;
for T5 -11Billion has dff = 65,536 and 128-headed attention producing a model with about 11 billion
parameters.
T5 the largest model had 11 billion parameters and achieved SOTA on the GLUE, SuperGLUE, SQuAD, and
CNN/Daily Mail benchmarks. One particularly exciting result was that T5 achieved a near-human score on
the Superglue natural language understanding benchmark, which was specifically designed to be difficult
for machine learning models but easy for humans.
1.1. Unified Input & Output Format
T5 means “Text-to-Text Transfer Transformer”: Every task considered — including translation,
question answering, and classification — is cast as feeding the T5 model text as input and training it
to generate some target text.
Translation: Ask the model to translate the sentence “That is good.” from English to German, the
model would be fed the sequence “translate English to German: That is good.” and would be
trained to output “Das ist gut.”
Text classification: The model simply predicts a single word corresponding to the target label. For
example, on the MNLI benchmark, the goal is to predict whether a premise implies
(“entailment”), contradicts (“contradiction”), or neither (“neutral”) a hypothesis.
The input sequence becomes “mnli premise: I hate pigeons. hypothesis: My feelings towards
pigeons are filled with animosity.”. Only possible labels are “entailment”, “neutral”, or
“contradiction”, other outcomes are treated as wrong prediction.
Regression: In STS-B, the goal is to predict a similarity score between 1 and 5. Increments of 0.2
are used as text prediction.
(It is in details for each task how they unify the format, please feel free to read the paper directly if
interested.)
1.2. Encoder-Decoder Transformer Model
T5 uses encoder-decoder Transformer implementation which closely follows the
original Transformer, with the exception of below differences:
(Please feel free to Transformer if interested.)
But a simplified layer normalization is used where the activations are only rescaled and no
additive bias is applied. After layer normalization, a residual skip connection, originated
from ResNet, adds each subcomponent’s input to its output.
Also, instead of using a fixed embedding for each position, relative position embeddings (Shaw
NAACL’18) produce a different learned embedding according to the offset (distance) between the
“key” and “query” being compared in the self-attention mechanism.
1.3. Training
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by VIGNESH M ([email protected])
lOMoARcPSD|28298717
A combination of model and data parallelism are used to train models on “slices” of Cloud TPU
Pods. 5 TPU pods are are multi-rack ML supercomputers that contain 1,024 TPU v3
chips connected via a high-speed 2D mesh interconnect with supporting CPU host machines.
CHATBOTS
Chatbots are a relatively recent concept and despite having a huge number of programs and NLP tools. An
natural language processing chatbot is a software program that can understand and respond to human speech.
Bots powered by NLP allow people to communicate with computers in a way that feels natural and human-like
— mimicking person-to-person conversations. These clever chatbots have a wide range of applications in the
customer support sphere.
NLP chatbots: The first generation of virtual agents
NLP-powered virtual agents are bots that rely on intent systems and pre-built dialogue flows — with different
pathways depending on the details a user provides — to resolve customer issues. A chatbot using NLP will keep
track of information throughout the conversation and learn as they go, becoming more accurate over time. Here
are some of the most important elements of an NLP chatbot.
Key elements of NLP-powered bots
Dialogue management: This tracks the state of the conversation. The core components of dialogue
management in AI chatbots include a context — saving and sharing data exchanged in the conversation
— and session — one conversation from start to finish
Human handoff: This refers to the seamless communication and execution of a handoff from the AI
chatbot to a human agent
Business logic integration: It’s important that your chatbot has been programmed with your company’s
unique business logic
Rapid iteration: You want your bot to provide a seamless experience for customers and to be easily
programmable. Rapid iteration refers to the fastest route to the right solution
Training and iteration: To ensure your NLP-powered chatbot doesn’t go awry, it’s necessary to
systematically train and send feedback to improve its understanding of customer intents using real-world
conversation data being generated across channels
Simplicity: To get the most out of your virtual agent, you’ll want it to be set up as simply as possible,
with all the functionality that you need — but no more than that. There is, of course, always the potential
to upgrade or add new features as you need later on
Benefits of bots
Bots allow you to communicate with your customers in a new way. Customers’ interests can be piqued
at the right time by using chatbots.
With the help of chatbots, your organization can better understand consumers’ problems and take steps
to address those issues.
A single operator can serve one customer at a time. On the other hand, a chatbot can answer thousands
of inquiries.
Chatbots are unique in that they operate inside predetermined frameworks and rely on a single source of
truth within the command catalog to respond to questions they are asked, which reduces the risk of
confusion and inconsistency in answers.
Types of Chatbots
With the help of this experience, we can understand that there are 2 types of chatbots around us: Script-bot and Smart-bot.
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by VIGNESH M ([email protected])
lOMoARcPSD|28298717
1. Script bots are easy to make 1. Smart bots are flexible and powerful
2. Script bots work around a script that is 2. Smart bots work on bigger databases and
programmed in them other resources directly
3. Mostly they are free and are easy to 3. Smart bots learn with more data
integrate into a messaging platform 4. Coding is required to take this up on board
4. No or little language processing skills 5. Wide functionality
5. Limited functionality 6. Example: Google Assistant, Alexa, Cortana,
6. Example: The bots that are deployed in the Siri, etc.
customer care section of various companies
DIALOG SYSTEMS
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by VIGNESH M ([email protected])