A General Approach To Website Question Answering With Large Language Models
A General Approach To Website Question Answering With Large Language Models
SoutheastCon 2024
Which
c=J Database?
Yilang Ding Jiawei Nie Di Wu Chang Liu
Emory University Emory University Emory University Emory University
[email protected] [email protected] [email protected] [email protected]
§j §j
[--o;;:;;;;--j f--o~~~;;-j Docu~~l- Dooument 2
B f--~:,o-j B §j -
Abstract—Language Models (LMs), in their most basic form,
-
Doc:ument P
Fig. 2: Runtime Flowchart
(1) as pretraining, collect and store knowledge in a semi-
perform just like any other machine learning model - they structured way; (2) while deployed, searching for and selecting
produce interpolations and extrapolations based on their training the most relevant pieces of information to a user’s query;
Fig.
distribution. 1: Example
Although recentDatabase
models after
such First Stage GPT-
as OpenAI’s knowledge discrepancy is discovered, we can simply remove
4 have demonstrated unprecedented capabilities in absorbing (3) theincorporating
old documentthefrom
information obtained
the databases and from (2) and
reprocess the the
new
the copious volumes of information in their training data, their capabilities of the LLM to answer the user’s query in natural
document and reinsert the split parts. This bypasses the need
ability
the LLMto consistently
and ask reproduce factual
it to select information
a database stillto
for us search language;
remains (4) should the knowledge base change, identify the
for retraining.
unproven. Additionally, LMs on their own lack the ability to discrepancy and swap out the information.
from. We then search the database for relevant pieces of
keep up to date with real life data without frequent fine-tuning.
information
These drawbacks and effectively
insert themrender into basea LangChain vectors tore
LMs unserviceable A. Pretraining Stage
in- Question
a data structure
Answering thatscenarios
stores information
where they in must a vector
respond spaceto III. CONCLUSION AND FUTURE DIRECTION
and enables
queries regardingvector similarity
volatile searchesRetrieval
information. via distance metrics
Augmented In the first stage, we acquire the sitemap of a target domain.
Generation
such as cosine (RAG)similarity
and ToolandLearning
Euclidean [1] distance.
were proposed Combined as ThisOn yields an understanding
a small scale, with aof subset
the structure
of Emoryof theUniversity's
domain’s
solutions to these problems, and with the development and structure
with of
usage semantic embedding
associated libraries,models like Sentence problems
the aforementioned Transformers' can domain as a test subject, our system has shown some be
and helps locate any information that may used
promise.
during QA. Once all relevant URLs are identified,
It was able to accurately and quickly answer simple questions we go
beall-MiniLM-L6-v2, which transform natural language strings
SoutheastCon 2024 | 979-8-3503-1710-7/24/$31.00 ©2024 IEEE | DOI: 10.1109/SOUTHEASTCON52093.2024.10500166
~
~
r=--J Which
c=J Database?
[--o;;:;;;;--j f--o~~~;;-j §j §j
Docu~~l- Dooument 2
B f--~:,o-j B §j
- -
Doc:ument P
Fig. 2: Runtime Flowchart
Fig. 1: Example Database after First Stage knowledge discrepancy is discovered, we can simply remove
the old document from the databases and reprocess the new
document and reinsert the split parts. This bypasses the need
the LLM and ask it to select a database for us to search for retraining.
from. We then search the database for relevant pieces of
information and insert them into a LangChain vectors tore
- a data structure that stores information in a vector space III. CONCLUSION AND FUTURE DIRECTION
and enables vector similarity searches via distance metrics
such as cosine similarity and Euclidean distance. Combined On a small scale, with a subset of Emory University's
with semantic embedding models like Sentence Transformers' domain as a test subject, our system has shown some promise.
all-MiniLM-L6-v2, which transform natural language strings It was able to accurately and quickly answer simple questions
into high dimensional vectors in a semantic space, we can following both the database and the Google answering rou-
perform semantic search to locate the most relevant piece tines. Also, the hot swapping of information was relatively
of information and answer a user's question. For example, easy. However, as we enlarged the scale of our knowledge
following the example from the first stage, if we had the base (to a maximum of five levels of depth following from
databases School Culture, Campus Life, Academic Require- Emory University's homepage, limited to internal pages and
ments and Research and a user asked "What courses do I need excluding non web page and non PDF resources), the accuracy
to take for the AMS major at Emory?", the LLM would return of the process greatly decreased.
"Academic Requirements". The system would then proceed to During pretraining, the problems manifested in the summary
give the LLM a list of document tags present in Academic tags of the documents. Specifically, as the number of docu-
Requirements and ask it to pick out the document with the ments increased, GPT-3.5-turbo ceased to be able to produce
greatest relevance - in our case, it would return "Academic tags for the documents that differentiated them enough to
checklist for Applied Mathematics & Statistics major". Then be used in QA. For example, of the 1221 resources that we
it would read over the segments of the document and execute extracted from the Nursing School's subdomain, 55 (4.5%) of
the third stage by answering the user's questions: "For an them had the same exact tag of "Nursing School Information".
Applied Mathematics & Statistics (AMS) major at Emory, you Although this doesn't affect the sorting of information into
would need to take the following courses: QTM Courses: - databases, the information becomes effectively useless under
QTM lID: Introduction to Scientific Methods (3 credits)..." the current configuration of the system because the second
In the case that all documents retrieved don't match up to the stage uses a tag search to identify the relevant document.
user's queries, we have added a subsystem that does a google One of the main reasons we originally chose to use a tagging
search instead. By comparing the vector similarity obtained system was to speed up the document matching process for
from the vector store search to a threshold hyperparameter, QA, but evidently doing so has hurt the accuracy greatly.
we can determine if documents are potentially irrelevant to a Additionally, compared to the manual regrouping of tagged
given query. Should all documents fail to pass this threshold, documents, there may be more optimal regrouping methods via
we instead employ the Google search API and use the top clustering algorithms. The structure of the website itself could
results in a similar process to answer the question. With this also act as a guideline to grouping. During testing, we found
addition, we can guarantee that the system's answers are no that controlling the number of entries in each vector store was
worse than Google's. essential to ensuring the generation of an accurate and relevant
The last stage deals with the hot-swapping of informa- answer, but have yet to ascertain the exact figure or range of
tion for the system. When a web page is changed, we can document volume per database for maximum search efficiency
quickly locate the information that needs to be modified in in terms of time and accuracy. As such, further research should
our databases by a URL or metadata searching. Once the aim to find the optimal balance of database size and accuracy.
895
Authorized licensed use limited to: Universidade Federal Rural do Semiárido - Campus Mossoró. Downloaded on June 21,2024 at 16:08:15 UTC from IEEE Xplore. Restrictions apply.
895
SoutheastCon 2024
R EFERENCES
[1] Y. Qin et al., “Tool Learning with Foundation Models,” arXiv.org, Jun.
15, 2023. https://arxiv.org/abs/2304.08354
[2] Ji, Ziwei and Lee, Nayeon and Frieske, Rita and Yu, Tiezheng and Su,
Dan and Xu, Yan and Ishii, Etsuko and Bang, Ye Jin and Madotto,
Andrea and Fung, Pascale, ”Survey of Hallucination in Natural Lan-
guage Generation”, ACM computing surveys, 2023, Vol.55 (12), pp.
1-38, Article 248
896
Authorized licensed use limited to: Universidade Federal Rural do Semiárido - Campus Mossoró. Downloaded on June 21,2024 at 16:08:15 UTC from IEEE Xplore. Restrictions apply.