0% found this document useful (0 votes)
19 views

A Semantic Based Multi-Platform IoT Integration Approach From Sensors To Chatbots

This document discusses the challenges of building chatbots using large domain-specific knowledge bases. It describes a chatbot built to respond to vehicle-related complaints. Key challenges include acquiring domain knowledge about relevant entities and understanding how users refer to those entities. The authors evaluate popular frameworks like Alexa Skills Kit and Google Actions, finding limitations in scaling to large knowledge bases and entity identification accuracy. They propose an alternative pipeline for knowledge extraction from texts and learning to identify entities to address these issues. Evaluation shows their approach scales better, with up to 30% higher entity identification accuracy when handling many domain-specific entities.

Uploaded by

Nodal gaming
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

A Semantic Based Multi-Platform IoT Integration Approach From Sensors To Chatbots

This document discusses the challenges of building chatbots using large domain-specific knowledge bases. It describes a chatbot built to respond to vehicle-related complaints. Key challenges include acquiring domain knowledge about relevant entities and understanding how users refer to those entities. The authors evaluate popular frameworks like Alexa Skills Kit and Google Actions, finding limitations in scaling to large knowledge bases and entity identification accuracy. They propose an alternative pipeline for knowledge extraction from texts and learning to identify entities to address these issues. Evaluation shows their approach scales better, with up to 30% higher entity identification accuracy when handling many domain-specific entities.

Uploaded by

Nodal gaming
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Building chatbots from large scale domain-specific knowledge

bases: challenges and opportunities


Walid Shalaby, Adriano Arantes, Teresa GonzalezDiaz, Chetan Gupta
Hitachi America Ltd.
Santa Clara, CA, USA
{walid.shalaby, adriano.arantes, teresa.gonzalezdiaz, chetan.gupta}@hal.hitachi.com

ABSTRACT and extracting elements of interest (entities) in her utterances.


Building high quality LU systems requires: 1) lexical knowledge,
Popular conversational agents frameworks such as Alexa Skills meaning access to general, common-sense, as well as domain-
Kit (ASK) and Google Actions (gActions) offer unprecedented specific knowledge related to the task and involved entities, and
opportunities for facilitating the development and deployment of 2) syntactic/semantic knowledge, meaning “ideally” complete
voice-enabled AI solutions in various verticals. Nevertheless, coverage of all different ways users might utter their commands
understanding user utterances with high accuracy remains a and questions.
challenging task with these frameworks. Particularly, when
building chatbots with large volume of domain-specific entities. Several conversational agents frameworks were introduced in
In this paper, we describe the challenges and lessons learned from recent years to facilitate and standardize building and deploying
building a large scale virtual assistant for understanding and voice-enabled personal assistants. Examples include Alexa Skills
responding to equipment-related complaints. In the process, we Kit (ASK) 12 (Kumar, et al., 2017), Google Actions (gActions) 3
describe an alternative scalable framework for: 1) extracting the powered by DialogFlow 4 for LU, Cortana Skills 5 , Facebook
knowledge about equipment components and their associated Messenger Platform6, and others (López, Quesada, & Guerrero,
problem entities from short texts, and 2) learning to identify such 2017). Each of these frameworks comes with developer-friendly
entities in user utterances. We show through evaluation on a real web interface which allows developers to define their skills7 and
dataset that the proposed framework, compared to off-the-shelf conversation flow. Speech recognition, LU, built-in intents, and
popular ones, scales better with large volume of entities being up predefined ready-to-use general entity types such as city names
to 30% more accurate, and is more effective in understanding user and airports come at no cost and require no integration effort.
utterances with domain-specific entities.
However, developers are still required to: 1) provide sample
utterances either as raw sentences or as templates with slots
KEYWORDS representing entity placeholders, 2) provide a dictionary of entity
KB extraction from text, natural language understanding, slot values for each domain-specific entity type, and 3) customize
tagging, chatbots responses and interaction flows. Obviously, obtaining such
domain-specific knowledge about entities of interest, and defining
all possible structures of user utterances remain two key
1 Introduction challenges. Moreover, developers have no control over the LU
Virtual assistants are transforming how consumers and businesses engine or its outcome, making interpretability and error analysis
interact with Artificial Intelligence (AI) technologies. They cumbersome. It is also unclear how the performance of these LU
provide users with a natural language interface to the backend AI engines, in terms of accuracy, will scale when the task involves a
services, allowing them to ask questions or issue commands using large volume of (tens of) thousands of domain-specific entities, or
their voice. Nowadays, voice-activated AI services are how generalizable these engines are when user utterances involve
ubiquitously available across desktops, smart home devices, new entities or new utterance structures.
mobile devices, wearables, and much more, helping people to
search for information, organize their schedule, manage their
activities, and accomplish many other day-to-day tasks.
The use of digital assistants technology, though widely adopted in
the consumer space, haven’t seen the same degree of interest in
1
https://developer.amazon.com/alexa-skills-kit
2
https://amzn.to/2qDjNcJ
industrial and enterprise scenarios. One key challenge for the 3
https://developers.google.com/actions/
4
success of these voice assistants is the so-called conversational https://dialogflow.com/
5
https://developer.microsoft.com/en-us/cortana
Language Understanding (LU), which refers to the agent’s ability 6
https://developers.facebook.com/docs/messenger-platform/
to understand user speech through precisely identifying her intent 7
The task to be accomplished
Building chatbots from large scale domain-specific knowledge bases: challenges and opportunities Shalaby et al.

Figure 1 High level architecture of our chatbot for responding to vehicle-related complaints

In this paper, we describe the challenges and lessons learned from KB construction from text aims at converting the unstructured
deploying a virtual assistant for suggesting repairs of equipment- noisy textual data into a structured task-specific actionable
related complaints. We demonstrate on two popular frameworks, knowledge that captures entities (elements of interest (EOI)), their
namely ASK and gActions. Here, we focus on understanding and attributes, and their relationships (Pujara & Singh, 2018). KBs are
responding to vehicle-related problems, as an example equipment, key components for many AI and knowledge-driven tasks such as
which could be initiated by a driver or a technician. Along the question answering (Hao, et al., 2017), decision support systems
paper, we try to answer three questions: 1) how can we facilitate (Dikaleh, Pape, Mistry, Felix, & Sheikh, 2018), recommender
the acquisition of domain-specific knowledge about entities systems (Zhang, Yuan, Lian, Xie, & Ma, 2016), and others. KB
related to equipment problems?; 2) how much knowledge off-the- construction has been an attractive research topic for decades
shelf frameworks can digest effectively; 3) how accurately these resulting in many general KBs such as DBPedia (Auer, et al.,
frameworks built-in LU engines can identify entities in user 2007), Freebase (Bollacker, Evans, Paritosh, Sturge, & Taylor,
utterances?. 2008), Google Knowledge Vault (Dong, et al., 2014), ConceptNet
(Speer & Havasi, 2013), NELL (Carlson, et al., 2010), YAGO
Due to the scalability and accuracy limitations we experienced
(Hoffart, Suchanek, Berberich, & Weikum, 2013), and domain-
with ASK and gActions, we describe an alternative scalable
specific KBs such as Amazon Product Graph, Microsoft
pipeline for: 1) extracting the knowledge about equipment
Academic Graph (Sinha, et al., 2015).
components and their associated problems entities, and 2) learning
to identify such entities in user utterances. We show through The first step toward building such KBs is to extract information
evaluation on real dataset that the proposed framework about target entities, attributes, and relationships between them.
understanding accuracy scales better with large volume of Several information extraction frameworks have been proposed in
domain-specific entities being up to 30% more accurate. literature including OpenIE (Banko, Cafarella, Soderland,
Broadhead, & Etzioni, 2007), DeepDive (Niu, Zhang, Ré, &
Shavlik, 2012), Fonduer (Wu, et al., 2018), Microsoft QnA Maker
2 Background and Related Work (Shaikh, 2019), and others. Most of current information extraction
Figure 1 shows the main components of our chatbot. In a nutshell, systems utilize Natural Language Processing (NLP) techniques
user utterance is firstly transcribed into text using the Automatic such as Part of Speech Tags (POS), shallow parsing, and
Speech Recognition (ASR) module. Then, the LU module dependency parse trees to extract linguistic features for
identifies the entities (component, problem) in the input. recognizing entities.
Afterwards, the parsed input is passed to the dialog manager Despite the extensive focus in the academic and industrial labs on
which keeps track of the conversation state and decides the next constructing general purpose KBs, identifying component names
response (e.g., recommended repair) which is finally uttered back
and their associated problems in text has been lightly studied in
to the user using the language generation module.
literature.
As we mentioned earlier, we focus here on the cognitive-intensive
task of creating the Knowledge Base (KB) of target entities on
which the LU engine will be trained.
Building chatbots from large scale domain-specific knowledge bases: challenges and opportunities Shalaby et al.

Figure 2 The Knowledge base construction framework is a pipeline of five main stages

Table 1 Sample vehicle complaint utterances more <has-a> relationships (e.g., coolant gauge <has-a> not
(problems in red, and components in blue) reading  gauge <has-a> not reading). Unsupervised curation
Complaint Utterance and purification of extracted entities is another key differentiator
low oil pressure of our framework compared to prior work. The proposed
fuel filter is dirty framework utilizes a state-of-the-art deep learning for sequence
leak at oil pan tagging to annotate raw sentences with component(s) and
coolant reservoir cracked problem(s).
pan leaking water
coolant tank is leaking 3 A Pipeline for KB Extraction
Existing chatbot development frameworks require knowledge
The closest work to ours is (Niraula, Whyatt, & Kao, 2018) who about target entities8 which would appear in users utterances. For
proposed an approach to identify component names in service and each entity type (e.g., component, problem, etc.), an extensive
maintenance logs using a combination of linguistic analysis and vocabulary of possible values of such entities should be provided
machine learning. The authors start with seed head nouns by the virtual assistant developer. These vocabularies are then
representing high level part names (e.g., valve, switch). Then used to train the underlying LU engine to identify entities in user
extract all n-grams ending with these head nouns. Afterwards, the utterance.
extracted n-grams are purified using heuristics. Finally, the We propose a pipeline for creating a KB of entities related to
purified part names are used to create an annotated training data vehicle complaint understanding from short texts, specifically
for training a Conditional Random Fields (CRF) model (Lafferty, posts in public Questions and Answers (QA) forums.
McCallum, & Pereira, 2001) to extract part names in raw Nevertheless, the design of the proposed framework is flexible
sentences. and generic enough to be applied to several other maintenance
Similarly, (Chandramouli, et al., 2013) introduced a simple scenarios of different equipment given a corpus with mentions of
approach using n-gram extraction from service logs. Given a seed the same target entities. Table 1 shows sample complaint
of part types, the authors extract all n-grams, with maximum of utterances from QA posts. As we can notice, most of these
three tokens, which end with these part types. Then candidate n- utterances are short sentences composed of a component along
grams are scored using a mutual information metric, and then with an ongoing problem.
purified using POS tagging. As shown in Figure 2, the proposed KB construction system is
Our framework automatically construct a KB of equipment organized as a pipeline. We start with a domain-specific corpus
components and their problems entities with “component <has-a> that contains our target entities. We then process the corpus
problem” relationships. Unlike previous work, we go one step through five main stages including preprocessing, candidate
further by extracting not only components and part names, but generation using POS-based syntactic rules, embedding-based
also their associated problems. Unlike (Niraula, Whyatt, & Kao, filtration and curation, and enrichment through training a
2018), we start with syntactic rules rather than seed head nouns. sequence-to-sequence (seq2seq) slot tagging model. Our pipeline
The rules require less domain knowledge and should yield higher produces two outputs:
coverage. We then expand the constructed KB through two steps:
1) reorganizing the extracted vocabulary of components into a  A KB of three types of entities including car options (car,
hierarchy using a simple traversal mechanism introducing <is-a> truck, vehicle, etc.), components, and their associated
relationships (e.g., stop light <is-a> light), and 2) aggregating all
the problems associated with subtype components in the hierarchy 8
Slots in ASK terminology
and associating them with supertype components introducing
Building chatbots from large scale domain-specific knowledge bases: challenges and opportunities Shalaby et al.

Table 2 POS-based syntactic rules for candidate entity generation (problems in red, and components in blue)
Utterance POS Rule
replace water pump VB (NN\S*\s?)+ (NN\S*\s?)+  component
low oil pressure JJ (NN\S*\s?)+ JJ  problem, (NN\S*\s?)+  component
fuel filter is dirty (NN\S*\s?)+ VBZ JJ (NN\S*\s?)+  component, JJ  problem
coolant reservoir cracked (NN\S*\s?)+ VBD (NN\S*\s?)+  component, VBD  problem
pan leaking water (NN\S*\s?)+ VBG (NN\S*\s?)+ (NN\S*\s?)+  component, VBG (NN\S*\s?)+  problem
coolant tank is leaking (NN\S*\s?)+ VBZ VBG (NN\S*\s?)+  component, VBG  problem

problems. These entities can be used to populate the First, all sentences are extracted and parsed using the Stanford
vocabulary needed to build the voice-based agent in both CoreNLP library (Manning, et al., 2014). Second, we employ
ASK and DialogFlow. linguistic heuristics to define chunks of tokens corresponding to
 A tagging model which we call Sequence-to-Sequence component and problem entities based on their POS tags.
Tagger (S2STagger). Besides its value in enriching the KB Specifically, we define the rules considering only the most
with new entities, S2Stagger can also be used as a standalone
frequent POS patterns in our dataset.
LU system that’s able to extract target entities from raw user
utterances. Table 2 shows the rules defined for the most frequent six POS
patterns. For example, whenever a sentence POS pattern matches
In the following sub-sections, we will describe in more details an adjective followed by sequence of nouns of arbitrary length (JJ
each of the stages presented in Figure 2. (NN\S*\s?)+$) (e.g. “low air pressure”), the adjective chunk is
considered a candidate problem entity (“low”) and the noun
3.1 Preprocessing
sequence chunk is considered a candidate component entity (“air
Dealing with noisy text is challenging. In the case of equipment pressure”). It is worth mentioning that, the defined heuristics are
troubleshooting, service and repair records and QA posts include designed to capture components with long multi-term names
complaint, diagnosis, and correction text which represent highly which are common in our corpus (e.g., “intake manifold air
rich resources of components and problems that might arise with pressure sensor”). We also discard irrelevant tokens in the
each of them. Nevertheless, these records are typically written by extracted chunk such as determiners (a, an, the) preceding noun
technicians and operators who have time constraints and may lack sequences and others.
language proficiency. Consequently, the text will be full of typos,
spelling mistakes, inconsistent use of vocabulary, and domain- 3.3 Curation
specific jargon and abbreviations. For these reasons, cautious use In this stage, we prune incorrect and noisy candidate entities using
of preprocessing is required to reduce such inconsistencies and weak supervision. We found that most of these wrong extractions
avoid inappropriate corrections. We perform the following were due to wrong annotations from the POS tagger due to the
preprocessing steps: noisy nature of the text. For example, “clean” in “clean tank” was
 Lowercase. incorrectly tagged as adjective rather than verb causing “clean” to
 Soft normalization: By removing punctuation characters be added to the candidate problems pool. Another example
separating single characters (e.g., a/c, a.c, a.c.  ac). “squeals” in “belt squeals” was tagged as plural noun rather than
 Hard normalization: By collecting all frequent tokens that are verb causing “belt squeals” to be added to the candidate
prefixes of a larger token and manually replace them with their components pool. To alleviate these issues, we employ different
normalized version (e.g., temp  temperature, eng  engine, weak supervision methods to prune incorrectly extracted entities
diag  diagnose…etc). as follows:
 Dictionary-based normalization: We create a dictionary of  Statistical-based pruning: A simple pruning rule is to eliminate
frequent abbreviations and use it to normalize tokens in the candidates that rarely appear in our corpus with frequency less
original text (e.g., chk, ch, ck  check) than F.
 Manual tagging: We manually tag terms as vehicle, car, truck,  Linguistic-based pruning: These rules focus on the number and
etc. as a car-option entity. structure of tokens in the candidate entity. For example, a
candidate entity cannot exceed T terms, must have terms with a
3.2 Candidate Generation minimum of L letters each, and cannot have alphanumeric
To extract candidate entities, we define a set of syntactic rules tokens.
based on POS tags of complaint utterances.
Building chatbots from large scale domain-specific knowledge bases: challenges and opportunities Shalaby et al.

Figure 3 S2STagger utilizes LSTM encoder-decoder to generate the IOB tags of input utterance. Attention layer is used to learn to
softly align the input/output sequences

engine oil pressure gauge


1
2
gauge
3
pressure gauge
oil pressure gauge

Figure 4 Component hierarchy construction through backward traversal. Left – traversal through “engine oil pressure gauge” resulting
in three higher level components. Right – example hierarchy with “sensor” as the root supertype component

 Embedding-based pruning: Fixed-length distributed However, depending solely on rules limits the system recognition
representation models (aka embeddings) have proven effective capacity to mentions in structures that match these predefined
for representing words and entities in many NLP tasks rules. Moreover, it is infeasible to handcraft rules that cover all
(Mikolov, Sutskever, Chen, Corrado, & Dean, 2013) (Shalaby, possible complaint structures limiting the system recall. It is also
Zadrozny, & Jin, 2019). We exploit the fact that these models expected that new components and problems will emerge,
can effectively capture similarity and relatedness relationships especially in highly dynamic domains, and running the rules on an
between pairs of words and entities using their embeddings. To updated snapshot of the corpus would be an expensive solution.
this end, we employ the model proposed by (Shalaby, A more practical and efficient solution is to build a machine
Zadrozny, & Jin, 2019) to obtain the vector representations of learning model to tag raw sentences and identify chunks of tokens
all candidates. Then, we normalize all vectors and compute the that correspond to our target entities. To this end, we adopt a
similarity score between pairs of candidates using the dot neural attention-based seq2seq model called S2STagger to tag raw
product between their corresponding vectors. Afterwards, we sentences and extract target entities from them. To train
prune all candidate problems that do not have at least P other S2STagger, we create a dataset from utterances that match our
problem entities with a minimum of Sp similarity score. And syntactic rules and label terms in these utterances using the inside-
prune all components that do not have at least C other outside-beginning (IOB) notation (Ramshaw & Marcus, 1999).
component entities with a minimum of Sc similarity score. For example, “the car air pressure is low” would be tagged as
 Sentiment-based pruning: Utterances that express problems and “<O> <car-options> <B-component> <I- component> <O>
issues usually have negative sentiment. With this assumption, <B-problem>”. As the extractions from the syntactic rules
we prune all candidate problem entities that are not followed by curation are highly accurate, we expect to have
semantically similar to at least one word from the list of highly accurate training data for our tagging model. It is worth
negative sentiment words created by (Hu & Liu, 2004). Here, mentioning that we only use utterances with mentions of entities
we measure the similarity score using the embeddings of not pruned during the curation phase.
candidate problem entities and the sentiment words as in the
embedding-based pruning. Sentiment-based pruning helps As shown in Figure 3, S2STagger utilizes an encoder-decoder
discarding wrong extractions such as “clean” in “clean tank” Recurrent Neural Network architecture (RNN) with Long-Short
where “clean” is tagged incorrectly as an adjective. Term Memory (LSTM) cells (Gers, Schmidhuber, & Cummins,
1999). During encoding, raw terms in each sentence are processed
3.4 Slot tagging (S2STagger) sequentially through an RNN and encoded as a fixed-length
A desideratum of any information extraction system is to be vector that captures all the semantic and syntactic structures in the
lexical-agnostic; i.e., to be able to generalize well and identify sentence. Then, a decoder RNN takes this vector and produces a
unknown entities that have no mentions in the original dataset. sequence of IOB tags, one for each term in the input sentence.
Another desideratum is to be structure-agnostic; i.e., to be able to Because each tag might depend on one or more terms in the input
generalize well and identify seen or new entities in utterances with but not the others, we utilize an attention mechanism so that the
different structures from those in the original dataset. Rule-based network learns what terms in the input are more relevant for each
candidate extraction typically yields highly precise extractions.
Building chatbots from large scale domain-specific knowledge bases: challenges and opportunities Shalaby et al.

Table 3 Dataset Statistics


Mechanics StackExchange + Yahoo QA
Sample utterances  rough start
 battery drains when AC is used
# of utterances 574,432
# of utterances matching syntactic-rule 11,619 (~2%)
# of extracted components 5,972
# of extracted problems 2,455

tag in the output (Bahdanau, Cho, & Bengio, 2014) (Luong,


We consolidate the entities from the curation stage along with the
Pham, & Manning, 2015).
new entities discovered at each of the three steps to create our KB
3.5 KB Consolidation and Enrichment of components and problems entities as shown in Figure 2.
At this stage, we enrich the KB with new entities not explicitly
4 Data and Model Evaluation
mentioned in the training utterances. These new entities are
obtained from three different sources: 4.1 Dataset
 S2STagger: After training S2STagger, we use it to tag the We experiment our framework with two datasets in the
remaining utterances in our dataset which do not match our automotive sector. First, a dataset of Questions and Answers (QA)
syntactic rules, resulting in a new set of entities. Importantly, from the public Mechanics Stack Exchange9 QA forum. Second,
the trained S2Stagger model can be used to tag newly unseen another subset of questions related to cars maintenance from the
utterances allowing the proposed KB framework to scale Yahoo QA dataset. Table 3 shows some example utterances and
efficiently whenever new utterances are collected. statistics from the datasets. As we can see, the coverage of
 Component backward traversal: We propose using a simple syntactic rules is noticeably low. This demonstrates the need for a
traversal method to create a hierarchy of supertype components learning-based entity extractor, such as our proposed S2STagger
from the extracted components vocabulary after curation and model to harness the knowledge from utterances not matching the
tagging using S2STagger. As shown in Figure 4, we consider predefined rules.
each extracted component (subtype) and backward traverse
through its tokens one token at a time. At each step, we append 4.2 Augmentation and Model Training
the new token to the component identified in the previous We create a labeled dataset for S2STagger using the question
traversal step (supertype). For example, traversing “engine oil utterances. After candidate extraction, we curate the extracted
pressure gauge” will result in “gauge”, “pressure gauge”, and entities using F=1, T=6 and 5 for components and problems
“oil pressure gauge” in order. As we can notice, components at respectively, L=2 and 2 for components and problems recursively,
the top of the hierarchy represent high level and generic ones P=1, Sp=0.5, C=5, Sc=0.5. Then, we only tag utterances with
(supertypes) which can be common across domains (e.g., entity mentions found in the curated entities pool.
sensor, switch, pump, etc.). The hierarchy allows introducing
“subtype <is-a> supertype” relationship between components 4.3 Quantitative Evaluation
enriching the KB with more supertype components. One of the main pain points in creating voice-enabled agents is
 Problem aggregation: The new components identified through the definition of the vocabulary users will use to communicate
backward traversal will initially have no problems associated with the system. This vocabulary is typically composed of large
with them. We propose using a simple aggregation method to number of entities which are very costly and time consuming to
automatically associate between supertype components and define manually. Our framework greatly facilitates this step
problems of their subtypes. First, we start with the leaf subtype through the construction of KBs with target entities which could
components in the hierarchy. Second, we navigate through the be readily available to build voice-enabled agents. To assess the
hierarchy upward one level at a time. At each step, we combine effectiveness of the extracted knowledge, we create three tagging
all the problems from the previous level and associate them to models. Two models using off-the-shelf NLU technologies
the supertype component at the current level. For example, all including: 1) a skill (AlexaSkill) using Amazon Alexa Skills kit 10,
problems associated with “oil pressure sensor”, “tire pressure and 2) an agent (DiagFlow) using Google Dialog Flow 11. With
sensor”, etc. will be aggregated and associated with “pressure both models, we define utterance structures equivalent to the
sensor”. Then problems of “pressure sensor”, “o2 sensor”, “abs syntactic rules structures. We also feed the same curated entities
sensor”, etc. will be aggregated and associated with “sensor”.
This simple aggregation method allows introducing new
“supertype <has-a> problem” relationships in the constructed 9
https://mechanics.stackexchange.com/
10
KB. https://developer.amazon.com/alexa
11
https://dialogflow.com/
Building chatbots from large scale domain-specific knowledge bases: challenges and opportunities Shalaby et al.

Table 4 Evaluation dataset including vehicle-related complaints


Characteristic Utterances
Same structure/Same entities 119
Same structure/Different entities 75
Different structure/Same entities 20
All 214

Table 5 Accuracy on vehicle-related complaints dataset


Accuracy (Exact IOB Match)
AlexaSkill DiagFlow S2STagger
Same structure/Same entities (119) (83) 70% (47) 39% (111) 93%
Same structure/Different entities (75) (40) 53% (7) 9% (67) 89%
All (194) (123) 63% (54) 28% (178) 92%

Table 6 Accuracy on utterances with different structure / same entities


Accuracy
(Exact IOB Match) component problem
entities (20) entities (19)
AlexaSkill 0% 0% 0%
DiagFlow 0% 4 (45%) 0%
S2STagger (2) 1% 9 (45%) 3 (16%)

in our KB to both models as slot values and entities for AlexaSkill exact match accuracy in which each and every term in the
and DiagFlow respectively. The third model is S2STagger trained utterance must be tagged same as in ground truth for the utterance
on all the tagged utterances in the QA dataset. It’s important to to be considered correctly tagged. As we can notice, S2STagger
emphasize that, the training utterances for S2STagger are the gives the best exact accuracy outperforming the other models
same from which KB entities and utterance structures are significantly. Interestingly, with the utterances that have same
extracted and fed to both AlexaSkill and DiagFlow. Due to model structure but different entities, S2STagger tagging accuracy is
size limitations imposed by these frameworks, we couldn’t feed close to same structure / same entities utterances. This indicates
the raw utterances to both agents as we did with S2STagger. that our model is more lexical-agnostic and can generalize better
than the other two models. AlexaSkill model comes second, while
We create an evaluation dataset of utterances that were manually
DiagFlow model can only tag few utterances correct, indicating its
tagged. The dataset describes vehicle-related complaints and
heavy dependence on exact entity matching and limited
shown in Table 4. The utterances are chosen such that three
generalization ability.
aspects of the model are assessed. Specifically, we would like to
quantitatively measure the model accuracy on utterances with: 1) On the other hand, the three models seem more sensitive to
same syntactic structures and same entities as in the training variation in utterance syntactic structure than its lexical variation.
utterances (119 in total), 2) same syntactic structures but different As we can notice in Table 6, the three models fail to correctly tag
entities from the training utterances (75 in total), and 3) different almost all the utterances with different structure (S2STagger tags
syntactic structures but same entities as in the training utterances 2 of 20 correct). Even when we measure the accuracy on the
(20 in total). It is worth mentioning that, to alleviate the out-of- entity extraction level not the whole utterance, both AlexaSkill
vocabulary (OOV) problem, different entities are created from and DiagFlow models still struggle with understanding the
terms in the model vocabulary. This way, incorrect tagging can different structure utterances. S2Stagger, on the other hand, can
only be attributed to the model inability to generalize to entities tag 9 component entities and 4 problem entities correctly, which is
tagged differently from the training ones. still lower than its accuracy on the utterances with same structure.
Table 5 shows that accuracy of S2STagger compared to the other
models on the car complaints evaluation dataset. We report the
Building chatbots from large scale domain-specific knowledge bases: challenges and opportunities Shalaby et al.

Table 7 Success and failure tagging of the three models. Bold indicates incorrect tagging
Utterance Tags
Same structure/Same entities
my car steering wheel wobbles Ground Truth <O> <car-option> <B-component> <I-component > <B-problem>
AlexaSkill <O> <car-option> <B-component> <I-component > <B-problem>
DiagFlow <O> <car-option> <B-component> <I-component > <B-problem>
S2STagger <O> <car-option> <B-component> <I-component > <B-problem>
car has low coolant Ground Truth <car-option> <O> <B-problem> <B-component>
AlexaSkill <car-option> <O> <B-problem> <B-component>
DiagFlow fail
S2STagger <car-option> <O> <B-problem> <B-component>
Same structure/different entities
clutch pedal is hard to push Ground Truth <B-component> <I-component> <O> <B-problem> <I-problem> <I-problem>
AlexaSkill <B-component> <I-component> <O> <B-problem> <I-problem> <I-problem>
DiagFlow fail
S2STagger <B-component> <I-component> <O> <B-problem> <I-problem> <I-problem>
wrapped brake rotor Ground Truth <B-problem> <B-component> <I-component>
AlexaSkill fail
DiagFlow fail
S2STagger <B-problem> <B-component> <I-component>
Different structure/same entities
the steering wheel in my car Ground Truth <O> <B-component> <I-component> <O> <O> <car-option> <B-problem>
wobbles
AlexaSkill fail
DiagFlow fail
S2STagger <O> <B-problem> <B-component> <I-component> <O> <car-option> <B-
problem>
low car coolant Ground Truth <B-problem> <car-option> <B-component>
AlexaSkill <B-problem> <O> <B-component>
DiagFlow fail
S2STagger <B-problem> <car-option> <B-component>

4.4 Qualitative Evaluation 5 Conclusion


To better understand these results, we show some examples of The proposed pipeline serves our goal toward automatically
success and failure cases for the three models in Table 7. As we constructing the knowledge required to understand equipment-
can notice, the models work well on what it has seen before (same related complaints in arbitrary target domain (e.g., vehicles,
structure and same entities). On the other hand, when the same appliances, etc.). By creating such knowledge about components
entities appear in paraphrased version of the training utterance and problems associated with them, it is possible to identify what
(e.g., “car has low coolant” vs. “low car coolant”), the models the user is complaining about.
generally fail to recognize them. When it comes to generalizing to
different from training entity mentions in utterances with same One of the benefits of the proposed KB construction framework is
structures such as “hard to push” and “brake rotor”, S2STagger facilitating the development and deployment of intelligent
generalizes better than the two other models though “hard” and conversational assistants for various industrial AI scenarios (e.g.,
“brake” were already labeled as entities in the training data. maintenance & repair, operations, etc.) through better
understanding of user utterances. As we demonstrated in section
More importantly, even though S2STagger can successfully tag 4, the constructed KB from QA forums facilitated developing two
new entities (“hard to push” and “brake rotor”) when they voice-enabled assistants using ASK and DialogFlow without any
appear in similar to training structures; it fails to recognize same finetuning or adaption to either vendors. Thus, the proposed
entities when they appear in slightly different structure (e.g., pipeline is an important tool to potentially automate the
“steering wheel in my car wobbles” vs. “car steering wheel deployment process of voice-enabled AI solutions making it easy
wobble”). These examples demonstrate the ability of S2STagger to use NLU systems of any vendor. In addition, S2STagger
to generalize to unseen entities better than unseen structures. It provides scalable and efficient mechanism to extend the
also demonstrates how the other two models seem to depend constructed KB beyond the handcrafted candidate extraction rules.
heavily to lexical matching causing them to fail to recognize Another benefit of this research is to improve existing
mentions of new entities. maintenance & repair solutions through better processing of user
complaint text. In other words, identifying components(s) and
Building chatbots from large scale domain-specific knowledge bases: challenges and opportunities Shalaby et al.

problem(s) in the noisy user complaints text and focusing on these scale approach to probabilistic knowledge fusion.
entities only while predicting the repair. ConferenceProceedings of the 20th ACM SIGKDD
international conference on Knowledge discovery and
The results demonstrate superior performance of the proposed
data mining, (pp. 601-610).
knowledge construction pipeline including S2STagger, the slot
Gers, F. A., Schmidhuber, J., & Cummins, F. (1999). Learning to
tagging model, over popular systems such as ASK and
forget: Continual prediction with LSTM.
DialogFlow in understanding vehicle-related complaints. One
Hao, Y., Zhang, Y., Liu, K., He, S., Liu, Z., Wu, H., & Zhao, J.
important and must have feature is to increase the effectiveness of
(2017). An end-to-end model for question answering
S2STagger and off-the-shelf NLU systems to handle utterances
over knowledge base with cross-attention combining
with different from training structures. We think augmenting the
global knowledge . ConferenceProceedings of the 55th
training data with carrier phrases is one approach. Additionally,
Annual Meeting of the Association for Computational
training the model to paraphrase and tag jointly could be a more
Linguistics (Volume 1: Long Papers) , 1, pp. 221-231.
genuine approach as it does not require to manually define the
Hoffart, J., Suchanek, F. M., Berberich, K., & Weikum, G.
paraphrasing or carrier phrases patterns.
(2013). YAGO2: A spatially and temporally enhanced
There were also some of the issues that impacted the development knowledge base from Wikipedia . Artificial Intelligence,
of this research. For example, the limited scalability of off-the- 194, pp. 28-61.
shelf NLU systems: ASK model size cannot exceed 1.5MB, while Hu, M., & Liu, B. (2004). Mining and summarizing customer
DialogFlow Agents cannot contain more than 10K different reviews. ConferenceProceedings of the tenth ACM
entities. Deployment of the constructed KB on any of these SIGKDD international conference on Knowledge
platforms would be limited to a subset of the extracted discovery and data mining, (pp. 168-177).
knowledge. Therefore, it seems mandatory for businesses and Kumar, A., Gupta, A., Chan, J., Tucker, S., Hoffmeister, B.,
R&D labs to develop in-house NLU technologies to bypass such Dreyer, M., . . . others. (2017). Just ASK: building an
limitations. architecture for extensible self-service spoken language
understanding. arXiv preprint arXiv:1711.00549.
REFERENCES Lafferty, J., McCallum, A., & Pereira, F. C. (2001). Conditional
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., & random fields: Probabilistic models for segmenting and
Ives, Z. (2007). Dbpedia: A nucleus for a web of open labeling sequence data .
data. Springer. López, G., Quesada, L., & Guerrero, L. A. (2017). Alexa vs. Siri
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine vs. Cortana vs. Google Assistant: a comparison of
translation by jointly learning to align and translate . speech-based natural user interfaces. International
arXiv preprint arXiv:1409.0473. Conference on Applied Human Factors and
Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., & Ergonomics, (pp. 241-250).
Etzioni, O. (2007). Open information extraction from Luong, M.-T., Pham, H., & Manning, C. D. (2015). Effective
the web. IJCAI, 7, pp. 2670-2676. approaches to attention-based neural machine
Bollacker, K., Evans, C., Paritosh, P., Sturge, T., & Taylor, J. translation . arXiv preprint arXiv:1508.04025.
(2008). Freebase: a collaboratively created graph Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S., &
database for structuring human knowledge. McClosky, D. (2014). The Stanford CoreNLP natural
ConferenceProceedings of the 2008 ACM SIGMOD language processing toolkit . ConferenceProceedings of
international conference on Management of data, (pp. 52nd annual meeting of the association for
1247-1250). computational linguistics: system demonstrations , (pp.
Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka Jr, E. 55-60).
R., & Mitchell, T. M. (2010). Toward an architecture Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J.
for never-ending language learning. AAAI, 5, p. 3. (2013). Distributed representations of words and
Chandramouli, A., Subramanian, G., Bal, D., Ao, S. I., Douglas, phrases and their compositionality. Advances in neural
C., Grundfest, W. S., & Burgstone, J. (2013). information processing systems, (pp. 3111-3119).
Unsupervised extraction of part names from service Niraula, N. B., Whyatt, D., & Kao, A. (2018). A Novel Approach
logs. ConferenceProceedings of the World Congress on to Part Name Discovery in Noisy Text .
Engineering and Computer Science, 2. ConferenceProceedings of the 2018 Conference of the
Dikaleh, S., Pape, D., Mistry, D., Felix, C., & Sheikh, O. (2018). North American Chapter of the Association for
Refine, restructure and make sense of data visually, Computational Linguistics: Human Language
using IBM Watson Studio. ConferenceProceedings of Technologies, Volume 3 (Industry Papers) , 3, pp. 170-
the 28th Annual International Conference on Computer 176.
Science and Software Engineering, (pp. 344-346). Niu, F., Zhang, C., Ré, C., & Shavlik, J. W. (2012). DeepDive:
Dong, X., Gabrilovich, E., Heitz, G., Horn, W., Lao, N., Murphy, Web-scale Knowledge-base Construction using
K., . . . Zhang, W. (2014). Knowledge vault: A web-
Building chatbots from large scale domain-specific knowledge bases: challenges and opportunities Shalaby et al.

Statistical Learning and Inference. . VLDS, 12, pp. 25-


28.
Pujara, J., & Singh, S. (2018). Mining Knowledge Graphs From
Text. ConferenceProceedings of the Eleventh ACM
International Conference on Web Search and Data
Mining, (pp. 789-790).
Ramshaw, L. A., & Marcus, M. P. (1999). Text chunking using
transformation-based learning. Springer.
Shaikh, K. (2019). Creating the FAQ Bot Backend from Scratch.
Springer.
Shalaby, W., Zadrozny, W., & Jin, H. (2019). Beyond word
embeddings: learning entity and concept representations
from large scale knowledge bases. Information
Retrieval Journal, 22(6), pp. 525-542.
Sinha, A., Shen, Z., Song, Y., Ma, H., Eide, D., Hsu, B.-j. P., &
Wang, K. (2015). An overview of microsoft academic
service (mas) and applications. ConferenceProceedings
of the 24th international conference on world wide web,
(pp. 243-246).
Speer, R., & Havasi, C. (2013). ConceptNet 5: A large semantic
network for relational knowledge . Springer.
Wu, S., Hsiao, L., Cheng, X., Hancock, B., Rekatsinas, T., Levis,
P., & Ré, C. (2018). Fonduer: Knowledge base
construction from richly formatted data.
ConferenceProceedings of the 2018 International
Conference on Management of Data, (pp. 1301-1316).
Zhang, F., Yuan, N. J., Lian, D., Xie, X., & Ma, W.-Y. (2016).
Collaborative knowledge base embedding for
recommender systems. ConferenceProceedings of the
22nd ACM SIGKDD international conference on
knowledge discovery and data mining, (pp. 353-362).

You might also like