0% found this document useful (0 votes)
43 views

A General Approach To Website Question Answering With Large Language Models

Dissertação sobre RAG
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

A General Approach To Website Question Answering With Large Language Models

Dissertação sobre RAG
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

SoutheastCon 2024

SoutheastCon 2024

A General Approach to Website Question


~
Answeringr=--Jwith Large Language Models ~

Which
c=J Database?
Yilang Ding Jiawei Nie Di Wu Chang Liu
Emory University Emory University Emory University Emory University
[email protected] [email protected] [email protected] [email protected]
§j §j
[--o;;:;;;;--j f--o~~~;;-j Docu~~l- Dooument 2

B f--~:,o-j B §j -
Abstract—Language Models (LMs), in their most basic form,
-
Doc:ument P
Fig. 2: Runtime Flowchart
(1) as pretraining, collect and store knowledge in a semi-
perform just like any other machine learning model - they structured way; (2) while deployed, searching for and selecting
produce interpolations and extrapolations based on their training the most relevant pieces of information to a user’s query;
Fig.
distribution. 1: Example
Although recentDatabase
models after
such First Stage GPT-
as OpenAI’s knowledge discrepancy is discovered, we can simply remove
4 have demonstrated unprecedented capabilities in absorbing (3) theincorporating
old documentthefrom
information obtained
the databases and from (2) and
reprocess the the
new
the copious volumes of information in their training data, their capabilities of the LLM to answer the user’s query in natural
document and reinsert the split parts. This bypasses the need
ability
the LLMto consistently
and ask reproduce factual
it to select information
a database stillto
for us search language;
remains (4) should the knowledge base change, identify the
for retraining.
unproven. Additionally, LMs on their own lack the ability to discrepancy and swap out the information.
from. We then search the database for relevant pieces of
keep up to date with real life data without frequent fine-tuning.
information
These drawbacks and effectively
insert themrender into basea LangChain vectors tore
LMs unserviceable A. Pretraining Stage
in- Question
a data structure
Answering thatscenarios
stores information
where they in must a vector
respond spaceto III. CONCLUSION AND FUTURE DIRECTION
and enables
queries regardingvector similarity
volatile searchesRetrieval
information. via distance metrics
Augmented In the first stage, we acquire the sitemap of a target domain.
Generation
such as cosine (RAG)similarity
and ToolandLearning
Euclidean [1] distance.
were proposed Combined as ThisOn yields an understanding
a small scale, with aof subset
the structure
of Emoryof theUniversity's
domain’s
solutions to these problems, and with the development and structure
with of
usage semantic embedding
associated libraries,models like Sentence problems
the aforementioned Transformers' can domain as a test subject, our system has shown some be
and helps locate any information that may used
promise.
during QA. Once all relevant URLs are identified,
It was able to accurately and quickly answer simple questions we go
beall-MiniLM-L6-v2, which transform natural language strings
SoutheastCon 2024 | 979-8-3503-1710-7/24/$31.00 ©2024 IEEE | DOI: 10.1109/SOUTHEASTCON52093.2024.10500166

greatly mitigated. In this paper, we ponder a general approach


tointo high Question
website dimensional vectors that
Answering in a integrates
semantic space, we can
the zero-shot through the documents and employ an LLM
following both the database and the Google answering rou-to process and
perform semantic
decision-making search to
capabilities locatewith
of LMs the the
mostRAG relevant piece
capabilities generate a summary
tines. Also, tag - a short
the hot swapping string of text
of information was that acts
relatively
ofofLangChain
information andand
is able
answerto bea kept
user'supquestion.
to date with For dynamic
example, aseasy.
a summarizing label for a resource e.g. using
However, as we enlarged the scale of our knowledge a college
information without the need for constant fine-tuning. website
following the example from the first stage, if we had the base (toas aanmaximum
example, aofcourse requirement
five levels of depth listfollowing
for a specific
from
databases School Culture, Campus Life, Academic Require-
I. I NTRODUCTION undergraduate major may result in the tag
Emory University's homepage, limited to internal academic
Xmajor pages and
ments
In theand Research
realm user asked "What
and aintelligence,
of artificial courses do
large language I need
models excludingAlthough
checklist. non web itpageisn’tand
necessary,
non PDFwe limit the the
resources), knowledge
accuracy
to take for
(LLMs) havetheemerged
AMS major at Emory?",
as powerful tools thecapable
LLM would return
of under- base to textual information
of the process greatly decreased. contained in web pages and PDF
"Academic andRequirements". The systemtext. would thencompelling
proceed to files for simplicity. Once all documents have been processed,
standing generating human-like One During pretraining, the problems manifested in the summary
givecase
the for
LLM a list of document tagsanswering
present in(QA). Academic we regroup the tagged documents into similar groupings e.g.
use these models is question QA tags of the documents. Specifically, as the number of docu-
Requirements and ask it to pick outtothecomprehend
document with the using the same example, all the resources relating to un-
involves the ability of these models natural ments increased, GPT-3.5-turbo ceased to be able to produce
greatest queries
relevance and- provide
in our case, it would return "Academic dergraduate academic requirements may be grouped together
language accurate and contextually rele- tags for the documents that differentiated them enough to
checklist for Applied Mathematics & Statistics major". Then under the grouping Undergraduate Academic Requirements.
vant responses. However, due to the known and fundamentally be used in QA. For example, of the 1221 resources that we
it would read issue over theof segments of the whichdocument and execute The tagged documents are then broken up into smaller strings
unaddressable hallucinations, involve LLMs extracted from the Nursing School's subdomain, 55 (4.5%) of
the third factually
stage by incorrect
answering the user'sarisingquestions:from "Fora vari-an of text and stored with their group in a database. After this
generating information them had the same exact tag of "Nursing School Information".
Applied Mathematics
including&but Statistics (AMS) major at Emory, di- you process, all remaining information in the knowledge base will
ety of factors not limited to source-reference Although this doesn't affect the sorting of information into
would need have been restructured and stored as fragmented pieces across
vergence, biastoandtake the following
erroneous decoding courses:
[2], LLMs QTM areCourses:
unfit for - databases, the information becomes effectively useless under
QTM QA lID: Introduction to Scientific Methods (3 any credits)..." many databases (Fig 1). In our experiments, we employed
the scenario without assistance. Additionally, changes the current configuration of the system because the second
In the case thatbase
all documents retrieved don'tofmatch up to in the a breadth-first search based web crawler to find the URLs
to the knowledge necessitates some form retraining stage uses a tag search to identify the relevant document.
user'sforqueries, we have addedthea ability
subsystem that does aanswer google and the structure of the domain, utilized GPT-3.5-turbo as our
order the LLM to retain to accurately One of the main reasons we originally chose to use a tagging
search instead.
which By cancomparing the vector costlysimilarity obtained
LLM of choice, and used the python library LangChain to
questions, quickly become and impractical system was to speed up the document matching process for
from the models
vector store search toknowledge
a threshold hyperparameter, process and segment the documents, and finally stored the
for larger and dynamic bases like school QA, but evidently doing so has hurt the accuracy greatly.
we can determine tagged document slices in MySQL databases as strings.
websites. However, ifwith documents
the advent are of potentially
technologies irrelevant
like toolto a Additionally, compared to the manual regrouping of tagged
given query.
learning Should allAugmented
and Retrieval documents Generation
fail to pass (RAG),this threshold,
these documents, there may be more optimal regrouping methods via
B. Runtime Stages
we instead
issues can be employ the Google
greatly mitigated. Withsearch
this inAPI and we
mind, usepropose
the top clustering algorithms. The structure of the website itself could
anresults
end toinenda similar
system process to answer the question. Withtool this
The second stage locates information that will help answer
for the web QA scenario that utilizes also act as a guideline to grouping. During testing, we found
addition,RAG we canandguarantee that the system's answers are no a user’s query once the system is deployed. By utilizing the
learning, the logic capabilities of LLMs to address that controlling the number of entries in each vector store was
worse information bank that we have built through the first stage, we
the two than Google's. problems.
aforementioned essential to ensuring the generation of an accurate and relevant
can bypass the problem of hallucination since we essentially
The last stageII.deals S YSTEMwith Sthe hot-swapping of informa-
CHEMATIC answer, but have yet to ascertain the exact figure or range of
transform the task that the LLM has to perform from a task
tion for the system. When a web page is changed, we can document volume per database for maximum search efficiency
The system we propose can be divided into four parts of knowledge generation to a task of summarization, which
quickly locate the information that needs to be modified in in terms of time and accuracy. As such, further research should
that can be separated into pretraining and runtime processes: is much less prone to hallucination. To do this, we query
our databases by a URL or metadata searching. Once the aim to find the optimal balance of database size and accuracy.

979-8-3503-1710-7/24/$31.00 ©2024 IEEE 894


Authorized licensed use limited to: Universidade Federal Rural do Semiárido - Campus Mossoró. Downloaded on June 21,2024 at 16:08:15 UTC from IEEE Xplore. Restrictions apply.
895
SoutheastCon
SoutheastCon 2024
2024

~
~
r=--J Which
c=J Database?

[--o;;:;;;;--j f--o~~~;;-j §j §j
Docu~~l- Dooument 2

B f--~:,o-j B §j
- -
Doc:ument P
Fig. 2: Runtime Flowchart

Fig. 1: Example Database after First Stage knowledge discrepancy is discovered, we can simply remove
the old document from the databases and reprocess the new
document and reinsert the split parts. This bypasses the need
the LLM and ask it to select a database for us to search for retraining.
from. We then search the database for relevant pieces of
information and insert them into a LangChain vectors tore
- a data structure that stores information in a vector space III. CONCLUSION AND FUTURE DIRECTION
and enables vector similarity searches via distance metrics
such as cosine similarity and Euclidean distance. Combined On a small scale, with a subset of Emory University's
with semantic embedding models like Sentence Transformers' domain as a test subject, our system has shown some promise.
all-MiniLM-L6-v2, which transform natural language strings It was able to accurately and quickly answer simple questions
into high dimensional vectors in a semantic space, we can following both the database and the Google answering rou-
perform semantic search to locate the most relevant piece tines. Also, the hot swapping of information was relatively
of information and answer a user's question. For example, easy. However, as we enlarged the scale of our knowledge
following the example from the first stage, if we had the base (to a maximum of five levels of depth following from
databases School Culture, Campus Life, Academic Require- Emory University's homepage, limited to internal pages and
ments and Research and a user asked "What courses do I need excluding non web page and non PDF resources), the accuracy
to take for the AMS major at Emory?", the LLM would return of the process greatly decreased.
"Academic Requirements". The system would then proceed to During pretraining, the problems manifested in the summary
give the LLM a list of document tags present in Academic tags of the documents. Specifically, as the number of docu-
Requirements and ask it to pick out the document with the ments increased, GPT-3.5-turbo ceased to be able to produce
greatest relevance - in our case, it would return "Academic tags for the documents that differentiated them enough to
checklist for Applied Mathematics & Statistics major". Then be used in QA. For example, of the 1221 resources that we
it would read over the segments of the document and execute extracted from the Nursing School's subdomain, 55 (4.5%) of
the third stage by answering the user's questions: "For an them had the same exact tag of "Nursing School Information".
Applied Mathematics & Statistics (AMS) major at Emory, you Although this doesn't affect the sorting of information into
would need to take the following courses: QTM Courses: - databases, the information becomes effectively useless under
QTM lID: Introduction to Scientific Methods (3 credits)..." the current configuration of the system because the second
In the case that all documents retrieved don't match up to the stage uses a tag search to identify the relevant document.
user's queries, we have added a subsystem that does a google One of the main reasons we originally chose to use a tagging
search instead. By comparing the vector similarity obtained system was to speed up the document matching process for
from the vector store search to a threshold hyperparameter, QA, but evidently doing so has hurt the accuracy greatly.
we can determine if documents are potentially irrelevant to a Additionally, compared to the manual regrouping of tagged
given query. Should all documents fail to pass this threshold, documents, there may be more optimal regrouping methods via
we instead employ the Google search API and use the top clustering algorithms. The structure of the website itself could
results in a similar process to answer the question. With this also act as a guideline to grouping. During testing, we found
addition, we can guarantee that the system's answers are no that controlling the number of entries in each vector store was
worse than Google's. essential to ensuring the generation of an accurate and relevant
The last stage deals with the hot-swapping of informa- answer, but have yet to ascertain the exact figure or range of
tion for the system. When a web page is changed, we can document volume per database for maximum search efficiency
quickly locate the information that needs to be modified in in terms of time and accuracy. As such, further research should
our databases by a URL or metadata searching. Once the aim to find the optimal balance of database size and accuracy.

895
Authorized licensed use limited to: Universidade Federal Rural do Semiárido - Campus Mossoró. Downloaded on June 21,2024 at 16:08:15 UTC from IEEE Xplore. Restrictions apply.
895
SoutheastCon 2024

R EFERENCES
[1] Y. Qin et al., “Tool Learning with Foundation Models,” arXiv.org, Jun.
15, 2023. https://arxiv.org/abs/2304.08354
[2] Ji, Ziwei and Lee, Nayeon and Frieske, Rita and Yu, Tiezheng and Su,
Dan and Xu, Yan and Ishii, Etsuko and Bang, Ye Jin and Madotto,
Andrea and Fung, Pascale, ”Survey of Hallucination in Natural Lan-
guage Generation”, ACM computing surveys, 2023, Vol.55 (12), pp.
1-38, Article 248

896
Authorized licensed use limited to: Universidade Federal Rural do Semiárido - Campus Mossoró. Downloaded on June 21,2024 at 16:08:15 UTC from IEEE Xplore. Restrictions apply.

You might also like