SlideShare a Scribd company logo
Chat bot
TEXT SIMILARITY DISTANCES
Introduction on Chat bot techniques
There are two main technique involved in chat bot to understand the user input such as Pattern
matching and Intent classification.
Types of response:
Static Response: The simplest way is to have a static response, with eventually a list of variants,
for each user input. These static responses could be templates, such as “John is located in
<Location details>”, where <Location details> is a variable computed by the chat bot.
Dynamic Response: A different approach would be to use resources, such as a knowledge base, to
get a list of potential responses, and then score them to choose the better response. This is
particularly appropriate if the chat bot acts mainly like a question-answering system.
Generated Response: If you have a huge corpus of examples of conversations, you could use a
deep learning technique (Recurrent Neural Network) to train a generative model that, given an
input, will generate the answer. You will need millions of examples to reach a decent quality and
sometimes the results are going to be unexpected, but it could be interesting.
Introduction on Text Similarity
1. Words are similar in two ways lexically and semantically.
2. Words are similar lexically if they have a similar character sequence.
3. Words are similar semantically if they have the same thing, are opposite of each other, used
in the same way , used in the same context.
4. Lexical similarity is introduced through String based algorithms and semantic similarity is
introduced through Corpus based and Knowledge based algorithms.
5. String based measures operates on string sequences and character composition.
Text similarity approaches
1. Measuring the similarity between words, sentences, paragraphs and documents is an
important component in various tasks such as information retrieval, document clustering,
word-sense disambiguation, automatic essay scoring, short answer grading, machine
translation, topic tracking, question generation, question answering and text summarization.
2. Text similarity is partitioned into three approaches
Text similarity
String based
similarity
Corpus based
similarity
Knowledge
based similarity
String
based
Character
based
Jaro
LCS
Levenshtein
Jaro - winkler
Smith
waterman
Needleman
Wunsch
N gram
Term
based
Dice’s
coefficient
Block distance
Cosine
similarity
Euclidean
distance
Matching
coefficient
Jaccard
similarity
Overlap
coefficient
String based similarity measures
Term Based Example 1: Cosine similarity
Term based Distances:
Let us consider two example sentences to calculate the term based text similarity
- Jenny loves burger more than Linda loves pizza
- Jane likes pizza more than Jenny loves burger
Cosine Similarity:
Words = [burger, jenny, loves, linda, than, more, likes, jane, pizza ]
If you take the word count , you will get a matrix like mentioned on right side.
Now calculating cosine similarity between a and b
a : [1,1,2,1,1,1,0,0,1]
b : [1,1,1,0,1,1,1,1,1]
The cosine of the angle between vectors is their similarity ,
cos α =
𝑎 . 𝑏
𝑎 ⋅ 𝑏
= dot product of vectors / Vectors magnitude
Cos α = 7 / 8.942 = 0.782 and cos dissimilarity = 1 – 0.782 = 0.217
Terms Seq 1 Seq 2
Burger 1 1
Jenny 1 1
Loves 2 1
Linda 1 0
Than 1 1
More 1 1
Likes 0 1
Jane 0 1
pizza 1 1
Term Based Example 2 : Jaccard Similarity
Jaccard similarity :
J(a , b) = a ∩ 𝑏/a ∪ 𝑏
a : { jenny, loves, burger, more, than, pizza, Linda}
b : { jane, likes, pizza, more, than, burger, loves, jenny}
a ∩ 𝑏 : { jenny, loves, burger, more, pizza, than}
a ∪ 𝑏 : { burger, jenny, loves, linda, than, more, likes, jane, pizza }
Jaccard similarity = 6 / 9 = 0.666
Removing stop words by parsing ,
Jaccard similarity = 5 / 8 = 0.625 and Jaccard dissimilarity = 1 - Jaccard similarity
Examples of Character based distance
Hamming distance:
It measures the minimum number of substitutions required to change one string into the other for a fixed
length of strings.
For example the hamming distance between ,
“Dinesh” - “Ramesh” = 3 and “John” – “Jack” = 3
Levenshtein Distance:
For comparing strings of different lengths of strings where not just substitutions but also insertions or
deletions have to be expected, then Levenshtein distance is appropriate.
For example the Levenshtein distance between Artist and Artificial is 6 which is computed as follows ,
Corpus
based
ESA
HAL
LSA
PMI-IR
DISCO
NGD
Corpus based similarity measures
GLSA
CL-ESA
SOC-PMI
DISCO1
DISCO2
Corpus based similarity measures
1. Hyperspace Analogue to Language (HAL) creates a semantic space from word co-occurrences. A word-by-word matrix
is formed with each matrix element is the strength of association between the word represented by the row and the word
represented by the column.
2. Latent semantic Analysis (LSA) is the most popular technique of Corpus-Based similarity. LSA assumes that words that
are close in meaning will occur in similar pieces of text. A matrix containing word counts per paragraph is constructed from a
large piece of text and a mathematical technique which called singular value decomposition (SVD) is used to reduce the number
of columns while preserving the similarity structure among rows.
Generalized Latent Semantic Analysis (GLSA) is a framework for computing semantically motivated term and
document vectors. It extends the LSA approach by focusing on term vectors instead of the dual document-term representation.
3. Explicit Semantic Analysis (ESA) is a measure used to compute the semantic relatedness between two arbitrary texts.
The Cross-language explicit semantic analysis (CL-ESA) is a multilingual generalization of ESA. CL-ESA exploits a
document-aligned multilingual reference collection such as Wikipedia to represent a document as a language independent
concept vector.
Corpus based similarity measures
4. Pointwise Mutual Information – Information Retrieval (PMI-IR) is a method for computing the similarity between
pairs of words, it uses AltaVista's Advanced Search query  syntax to calculate probabilities. The more often two words co-
occur near each other on a web page, the higher is their PMI-IR similarity score.
Second-order co-occurrence pointwise mutual information (SCO-PMI) is a semantic similarity measure
using pointwise mutual information to sort lists of important neighbor words of the two target words from a large corpus. The
advantage of using SOC-PMI is that it can calculate the similarity between two words that do not co-occur frequently, because
they co-occur with the same neighboring words.
5. Normalized Google Distance (NGD) is a semantic similarity measure derived from the number of hits returned by the
Google search engine for a given set of keywords.
NGD(x , y) = max{log f(x),log f(y)} – log f(x , y) / log M - min{log f(x),log f(y)}
where M is the total number of web pages searched by Google; f(x) and f(y) are the number of hits for search terms x and y,
respectively; and f(x, y) is the number of web pages on which both x and y occur. If the two search terms x and y never occur
together on the same web page, but do occur separately, the normalized Google distance between them is infinite. If both
terms always occur together, their NGD is zero, or equivalent to the coefficient between x squared and y squared.
Corpus based similarity measures
6. Extracting Distributionally similar words using Co-occurrences (DISCO) Distributional similarity
between words assumes that words with similar meaning occur in similar context. Large text collections are
statistically analyzed to get the distributional similarity. DISCO is a method that computes distributional similarity
between words by using a simple context window of size ±3 words for counting co-occurrences. When two
words are subjected for exact similarity DISCO simply retrieves their word vectors from the indexed data, and
computes the similarity according to Lin measure . If the most Distributionally similar word is required; DISCO
returns the second order word vector for the given word. DISCO has two main similarity measures DISCO1 and
DISCO2;
DISCO1 computes the first order similarity between two input words based on their collocation sets.
DISCO2 computes the second order similarity between two input words based on their sets of
Distributionally similar words
Knowledge
based
Similarity
Relatedness
Knowledge based similarity measures
Information
content
Path length
lesk
vector
hso
res
lin
jcn
lch
wup
path
Knowledge based similarity measures
1. Knowledge-Based Similarity is one of semantic similarity measures that bases on identifying
the degree of similarity between words using information derived from semantic networks.
2. WordNet is the most popular semantic network in the area of measuring the Knowledge-Based
similarity between words.
3. WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are
grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept.
4. Knowledge-based similarity measures can be divided roughly into two groups: measures of
semantic similarity and measures of semantic relatedness.
5. Semantically similar concepts are deemed to be related on the basis of their likeness. Semantic
relatedness, on the other hand, is a more general notion of relatedness, not specifically tied to
the shape or form of the concept.
6. In other words, Semantic similarity is a kind of relatedness between two words, it covers a
broader range of relationships between concepts that includes extra similarity relations such as
is-a-kind-of, is-a-specific example-of, is-a-part-of, is-the-opposite-of.
Knowledge based similarity measures
Resnik (res) measure is equal to the information content (IC) of the Least Common Subsumer (most informative
subsumer). This means that the value will always be greater-than or equal-to zero. The upper bound on the value
is generally quite large and varies depending upon the size of the corpus used to determine information content
values.
Lin and Jiang & conrath (lin and jcn ) measures augment the information content of the Least Common Subsumer
with the sum of the information content of concepts. The lin measure scales the information content of the Least
Common Subsumer by this sum, while jcn takes the difference of this sum and the information content of the
Least Common Subsumer.
(Leacock & Chodorow) lch measure returns a score denoting how similar two word senses are, based on the
shortest path that connects the senses and the maximum depth of the taxonomy in which the senses occur.
Wu & Palmer (wup) measure returns a score denoting how similar two word senses are, based on the depth of the
two senses in the taxonomy and that of their Least Common Subsumer.
Path measure returns a score denoting how similar two word senses are, based on the shortest path that connects
the senses in the is-a (hypernym/ hyponym ) taxonomy.
Knowledge based similarity measures
St.Onge (hso) measure works by finding lexical chains linking the two word senses. There are
three classes of relations that are considered: extra-strong, strong, and medium-strong. The
maximum relatedness score is 16.
Lesk (lesk) measure works by finding overlaps in the glosses of the two synsets. The relatedness
score is the sum of the squares of the overlap lengths.
Vector pairs (vector) measure creates a co–occurrence matrix for each word used in the
WordNet glosses from a given corpus, and then represents each gloss/concept with a vector that
is the average of these co–occurrence vectors.
The most popular packages that cover knowledge-based similarity measures are
WordNet::Similarity and Natural Language Toolkit (NLTK).
Ad

More Related Content

Similar to Chat bot using text similarity approach (20)

Text smilarity02 corpus_based
Text smilarity02 corpus_basedText smilarity02 corpus_based
Text smilarity02 corpus_based
cyan1d3
 
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...
acijjournal
 
Ijarcet vol-2-issue-7-2252-2257
Ijarcet vol-2-issue-7-2252-2257Ijarcet vol-2-issue-7-2252-2257
Ijarcet vol-2-issue-7-2252-2257
Editor IJARCET
 
Ijarcet vol-2-issue-7-2252-2257
Ijarcet vol-2-issue-7-2252-2257Ijarcet vol-2-issue-7-2252-2257
Ijarcet vol-2-issue-7-2252-2257
Editor IJARCET
 
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
mlaij
 
L0261075078
L0261075078L0261075078
L0261075078
inventionjournals
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
inventionjournals
 
L0261075078
L0261075078L0261075078
L0261075078
inventionjournals
 
Wsd final paper
Wsd final paperWsd final paper
Wsd final paper
Milind Gokhale
 
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual SimilaritySemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
pathsproject
 
semeval2016
semeval2016semeval2016
semeval2016
Lukáš Svoboda
 
G04124041046
G04124041046G04124041046
G04124041046
IOSR-JEN
 
The Duet model
The Duet modelThe Duet model
The Duet model
Bhaskar Mitra
 
Cc35451454
Cc35451454Cc35451454
Cc35451454
IJERA Editor
 
Measuring Similarity Between Contexts and Concepts
Measuring Similarity Between Contexts and ConceptsMeasuring Similarity Between Contexts and Concepts
Measuring Similarity Between Contexts and Concepts
University of Minnesota, Duluth
 
Interactive Analysis of Word Vector Embeddings
Interactive Analysis of Word Vector EmbeddingsInteractive Analysis of Word Vector Embeddings
Interactive Analysis of Word Vector Embeddings
gleicher
 
IJNLC 2013 - Ambiguity-Aware Document Similarity
IJNLC  2013 - Ambiguity-Aware Document SimilarityIJNLC  2013 - Ambiguity-Aware Document Similarity
IJNLC 2013 - Ambiguity-Aware Document Similarity
kevig
 
AMBIGUITY-AWARE DOCUMENT SIMILARITY
AMBIGUITY-AWARE DOCUMENT SIMILARITYAMBIGUITY-AWARE DOCUMENT SIMILARITY
AMBIGUITY-AWARE DOCUMENT SIMILARITY
ijnlc
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
Bhaskar Mitra
 
L1803058388
L1803058388L1803058388
L1803058388
IOSR Journals
 
Text smilarity02 corpus_based
Text smilarity02 corpus_basedText smilarity02 corpus_based
Text smilarity02 corpus_based
cyan1d3
 
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...
acijjournal
 
Ijarcet vol-2-issue-7-2252-2257
Ijarcet vol-2-issue-7-2252-2257Ijarcet vol-2-issue-7-2252-2257
Ijarcet vol-2-issue-7-2252-2257
Editor IJARCET
 
Ijarcet vol-2-issue-7-2252-2257
Ijarcet vol-2-issue-7-2252-2257Ijarcet vol-2-issue-7-2252-2257
Ijarcet vol-2-issue-7-2252-2257
Editor IJARCET
 
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
mlaij
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
inventionjournals
 
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual SimilaritySemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
pathsproject
 
G04124041046
G04124041046G04124041046
G04124041046
IOSR-JEN
 
Interactive Analysis of Word Vector Embeddings
Interactive Analysis of Word Vector EmbeddingsInteractive Analysis of Word Vector Embeddings
Interactive Analysis of Word Vector Embeddings
gleicher
 
IJNLC 2013 - Ambiguity-Aware Document Similarity
IJNLC  2013 - Ambiguity-Aware Document SimilarityIJNLC  2013 - Ambiguity-Aware Document Similarity
IJNLC 2013 - Ambiguity-Aware Document Similarity
kevig
 
AMBIGUITY-AWARE DOCUMENT SIMILARITY
AMBIGUITY-AWARE DOCUMENT SIMILARITYAMBIGUITY-AWARE DOCUMENT SIMILARITY
AMBIGUITY-AWARE DOCUMENT SIMILARITY
ijnlc
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
Bhaskar Mitra
 

Recently uploaded (20)

Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136
illuminati Agent uganda call+256776963507/0741506136
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
Ad

Chat bot using text similarity approach

  • 2. Introduction on Chat bot techniques There are two main technique involved in chat bot to understand the user input such as Pattern matching and Intent classification. Types of response: Static Response: The simplest way is to have a static response, with eventually a list of variants, for each user input. These static responses could be templates, such as “John is located in <Location details>”, where <Location details> is a variable computed by the chat bot. Dynamic Response: A different approach would be to use resources, such as a knowledge base, to get a list of potential responses, and then score them to choose the better response. This is particularly appropriate if the chat bot acts mainly like a question-answering system. Generated Response: If you have a huge corpus of examples of conversations, you could use a deep learning technique (Recurrent Neural Network) to train a generative model that, given an input, will generate the answer. You will need millions of examples to reach a decent quality and sometimes the results are going to be unexpected, but it could be interesting.
  • 3. Introduction on Text Similarity 1. Words are similar in two ways lexically and semantically. 2. Words are similar lexically if they have a similar character sequence. 3. Words are similar semantically if they have the same thing, are opposite of each other, used in the same way , used in the same context. 4. Lexical similarity is introduced through String based algorithms and semantic similarity is introduced through Corpus based and Knowledge based algorithms. 5. String based measures operates on string sequences and character composition.
  • 4. Text similarity approaches 1. Measuring the similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation, topic tracking, question generation, question answering and text summarization. 2. Text similarity is partitioned into three approaches Text similarity String based similarity Corpus based similarity Knowledge based similarity
  • 5. String based Character based Jaro LCS Levenshtein Jaro - winkler Smith waterman Needleman Wunsch N gram Term based Dice’s coefficient Block distance Cosine similarity Euclidean distance Matching coefficient Jaccard similarity Overlap coefficient String based similarity measures
  • 6. Term Based Example 1: Cosine similarity Term based Distances: Let us consider two example sentences to calculate the term based text similarity - Jenny loves burger more than Linda loves pizza - Jane likes pizza more than Jenny loves burger Cosine Similarity: Words = [burger, jenny, loves, linda, than, more, likes, jane, pizza ] If you take the word count , you will get a matrix like mentioned on right side. Now calculating cosine similarity between a and b a : [1,1,2,1,1,1,0,0,1] b : [1,1,1,0,1,1,1,1,1] The cosine of the angle between vectors is their similarity , cos α = 𝑎 . 𝑏 𝑎 ⋅ 𝑏 = dot product of vectors / Vectors magnitude Cos α = 7 / 8.942 = 0.782 and cos dissimilarity = 1 – 0.782 = 0.217 Terms Seq 1 Seq 2 Burger 1 1 Jenny 1 1 Loves 2 1 Linda 1 0 Than 1 1 More 1 1 Likes 0 1 Jane 0 1 pizza 1 1
  • 7. Term Based Example 2 : Jaccard Similarity Jaccard similarity : J(a , b) = a ∩ 𝑏/a ∪ 𝑏 a : { jenny, loves, burger, more, than, pizza, Linda} b : { jane, likes, pizza, more, than, burger, loves, jenny} a ∩ 𝑏 : { jenny, loves, burger, more, pizza, than} a ∪ 𝑏 : { burger, jenny, loves, linda, than, more, likes, jane, pizza } Jaccard similarity = 6 / 9 = 0.666 Removing stop words by parsing , Jaccard similarity = 5 / 8 = 0.625 and Jaccard dissimilarity = 1 - Jaccard similarity
  • 8. Examples of Character based distance Hamming distance: It measures the minimum number of substitutions required to change one string into the other for a fixed length of strings. For example the hamming distance between , “Dinesh” - “Ramesh” = 3 and “John” – “Jack” = 3 Levenshtein Distance: For comparing strings of different lengths of strings where not just substitutions but also insertions or deletions have to be expected, then Levenshtein distance is appropriate. For example the Levenshtein distance between Artist and Artificial is 6 which is computed as follows ,
  • 9. Corpus based ESA HAL LSA PMI-IR DISCO NGD Corpus based similarity measures GLSA CL-ESA SOC-PMI DISCO1 DISCO2
  • 10. Corpus based similarity measures 1. Hyperspace Analogue to Language (HAL) creates a semantic space from word co-occurrences. A word-by-word matrix is formed with each matrix element is the strength of association between the word represented by the row and the word represented by the column. 2. Latent semantic Analysis (LSA) is the most popular technique of Corpus-Based similarity. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per paragraph is constructed from a large piece of text and a mathematical technique which called singular value decomposition (SVD) is used to reduce the number of columns while preserving the similarity structure among rows. Generalized Latent Semantic Analysis (GLSA) is a framework for computing semantically motivated term and document vectors. It extends the LSA approach by focusing on term vectors instead of the dual document-term representation. 3. Explicit Semantic Analysis (ESA) is a measure used to compute the semantic relatedness between two arbitrary texts. The Cross-language explicit semantic analysis (CL-ESA) is a multilingual generalization of ESA. CL-ESA exploits a document-aligned multilingual reference collection such as Wikipedia to represent a document as a language independent concept vector.
  • 11. Corpus based similarity measures 4. Pointwise Mutual Information – Information Retrieval (PMI-IR) is a method for computing the similarity between pairs of words, it uses AltaVista's Advanced Search query syntax to calculate probabilities. The more often two words co- occur near each other on a web page, the higher is their PMI-IR similarity score. Second-order co-occurrence pointwise mutual information (SCO-PMI) is a semantic similarity measure using pointwise mutual information to sort lists of important neighbor words of the two target words from a large corpus. The advantage of using SOC-PMI is that it can calculate the similarity between two words that do not co-occur frequently, because they co-occur with the same neighboring words. 5. Normalized Google Distance (NGD) is a semantic similarity measure derived from the number of hits returned by the Google search engine for a given set of keywords. NGD(x , y) = max{log f(x),log f(y)} – log f(x , y) / log M - min{log f(x),log f(y)} where M is the total number of web pages searched by Google; f(x) and f(y) are the number of hits for search terms x and y, respectively; and f(x, y) is the number of web pages on which both x and y occur. If the two search terms x and y never occur together on the same web page, but do occur separately, the normalized Google distance between them is infinite. If both terms always occur together, their NGD is zero, or equivalent to the coefficient between x squared and y squared.
  • 12. Corpus based similarity measures 6. Extracting Distributionally similar words using Co-occurrences (DISCO) Distributional similarity between words assumes that words with similar meaning occur in similar context. Large text collections are statistically analyzed to get the distributional similarity. DISCO is a method that computes distributional similarity between words by using a simple context window of size ±3 words for counting co-occurrences. When two words are subjected for exact similarity DISCO simply retrieves their word vectors from the indexed data, and computes the similarity according to Lin measure . If the most Distributionally similar word is required; DISCO returns the second order word vector for the given word. DISCO has two main similarity measures DISCO1 and DISCO2; DISCO1 computes the first order similarity between two input words based on their collocation sets. DISCO2 computes the second order similarity between two input words based on their sets of Distributionally similar words
  • 13. Knowledge based Similarity Relatedness Knowledge based similarity measures Information content Path length lesk vector hso res lin jcn lch wup path
  • 14. Knowledge based similarity measures 1. Knowledge-Based Similarity is one of semantic similarity measures that bases on identifying the degree of similarity between words using information derived from semantic networks. 2. WordNet is the most popular semantic network in the area of measuring the Knowledge-Based similarity between words. 3. WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. 4. Knowledge-based similarity measures can be divided roughly into two groups: measures of semantic similarity and measures of semantic relatedness. 5. Semantically similar concepts are deemed to be related on the basis of their likeness. Semantic relatedness, on the other hand, is a more general notion of relatedness, not specifically tied to the shape or form of the concept. 6. In other words, Semantic similarity is a kind of relatedness between two words, it covers a broader range of relationships between concepts that includes extra similarity relations such as is-a-kind-of, is-a-specific example-of, is-a-part-of, is-the-opposite-of.
  • 15. Knowledge based similarity measures Resnik (res) measure is equal to the information content (IC) of the Least Common Subsumer (most informative subsumer). This means that the value will always be greater-than or equal-to zero. The upper bound on the value is generally quite large and varies depending upon the size of the corpus used to determine information content values. Lin and Jiang & conrath (lin and jcn ) measures augment the information content of the Least Common Subsumer with the sum of the information content of concepts. The lin measure scales the information content of the Least Common Subsumer by this sum, while jcn takes the difference of this sum and the information content of the Least Common Subsumer. (Leacock & Chodorow) lch measure returns a score denoting how similar two word senses are, based on the shortest path that connects the senses and the maximum depth of the taxonomy in which the senses occur. Wu & Palmer (wup) measure returns a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer. Path measure returns a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/ hyponym ) taxonomy.
  • 16. Knowledge based similarity measures St.Onge (hso) measure works by finding lexical chains linking the two word senses. There are three classes of relations that are considered: extra-strong, strong, and medium-strong. The maximum relatedness score is 16. Lesk (lesk) measure works by finding overlaps in the glosses of the two synsets. The relatedness score is the sum of the squares of the overlap lengths. Vector pairs (vector) measure creates a co–occurrence matrix for each word used in the WordNet glosses from a given corpus, and then represents each gloss/concept with a vector that is the average of these co–occurrence vectors. The most popular packages that cover knowledge-based similarity measures are WordNet::Similarity and Natural Language Toolkit (NLTK).