Vectorstores
Vectorstores
Table of Contents
If you are just looking for a short tutorial that explains how to build a simple LLM
application, you can skip to section “6. Creating a Vector store”, there you have all the
code snippets you need to build up a minimalistic LLM app with vector store, prompt
template and LLM call.
Intro
Step-by-Step Tutorial
Build your own chatbot with context injection — Image by the author
2 of 49 18/07/2023, 10:42
All You Need to Know to Build Your First LLM App | by Dominik Polze... https://towardsdatascience.com/all-you-need-to-know-to-build-your-firs...
Table of contents
3 of 49 18/07/2023, 10:42
All You Need to Know to Build Your First LLM App | by Dominik Polze... https://towardsdatascience.com/all-you-need-to-know-to-build-your-firs...
Initiatives undertaken over the past two decades to digitize information and
processes have often focused on accumulating more and more data in relational
databases. This approach enables traditional analytical machine learning
algorithms to process and understand our data.
4 of 49 18/07/2023, 10:42
All You Need to Know to Build Your First LLM App | by Dominik Polze... https://towardsdatascience.com/all-you-need-to-know-to-build-your-firs...
Fine-Tuning
Fine-tuning refers to training an existing language model with additional data to
optimise it for a specific task.
A team from Stanford University used the LLM Llama and fine-tuned it by using
50,000 examples of how a user/model interaction could look like. The result is a Chat
Bot that interacts with a user and answers queries. This fine-tuning step changed
the way the model is interacting with the end user.
So, when you fine-tune the model, it might occasionally provide correct answers,
but it will often fail because it heavily relies on the information it learned during
pre-training, which might not be accurate or relevant to your specific task. In other
words, fine-tuning helps the model adapt to HOW it communicates, but not
5 of 49 18/07/2023, 10:42
All You Need to Know to Build Your First LLM App | by Dominik Polze... https://towardsdatascience.com/all-you-need-to-know-to-build-your-firs...
So we need to think about how to provide the prompt with the right information. In
the figure below, you can see schematically how the whole thing works. We need a
process that is able to identify the most relevant data. To do this, we need to enable
our computer to compare text snippets with each other.
This can be done with embeddings. With embeddings, we translate text into vectors,
allowing us to represent text in a multidimensional embedding space. Points that
are closer to each other in space are often used in the same context. To prevent this
similarity search from taking forever, we store our vectors in a vector database and
index them.
Microsoft is showing us how this could work with Bing Chat. Bing combines the ability of
LLMs to understand language and context with the efficiency of traditional web search.
6 of 49 18/07/2023, 10:42
All You Need to Know to Build Your First LLM App | by Dominik Polze... https://towardsdatascience.com/all-you-need-to-know-to-build-your-firs...
straightforward solution that allows us to analyse our own texts and documents, and
then incorporate the insights gained from them into the answers our solution
returns to the user. I will describe all steps and components you need to implement
an end-to-end solution.
So how can we use the capabilities of LLMs to meet our needs? Let’s go through it
step by step.
We can also guide the chatbot to exclusively answer questions based on the data we
provide. This way, we can ensure that the chatbot remains focused on the data at
hand and provides accurate and relevant responses.
What is LangChain?
“LangChain is a framework for developing applications powered by language
models.” (Langchain, 2023)
Thus, LangChain is a Python framework that was designed to support the creation
of various LLM applications such as chatbots, summary tools, and basically any tool
you want to create to leverage the power of LLMs. The library combines various
components we will need. We can connect these components in so-called chains.
7 of 49 18/07/2023, 10:42
All You Need to Know to Build Your First LLM App | by Dominik Polze... https://towardsdatascience.com/all-you-need-to-know-to-build-your-firs...
3. Indexes: Document loaders, text splitters, vector stores — Enable faster and
more efficient access to the data
4. Chains: Chains go beyond a single LLM call, they allow us to set up sequences of
calls
In the image below, you can see where these components come into play. We load
and process our own unstructured data using the document loaders and text
splitters from the indexes module. The prompts module allows us to inject the
found content into our prompt template, and finally, we are sending the prompt to
our model using the model's module.
8 of 49 18/07/2023, 10:42
All You Need to Know to Build Your First LLM App | by Dominik Polze... https://towardsdatascience.com/all-you-need-to-know-to-build-your-firs...
Components you need for your LLM app — Image by the author
5. Agents: Agents are entities that use LLMs to make choices regarding which
actions to take. After taking an action, they observe the outcome of that action and
repeat the process until their task is completed.
9 of 49 18/07/2023, 10:42
All You Need to Know to Build Your First LLM App | by Dominik Polze... https://towardsdatascience.com/all-you-need-to-know-to-build-your-firs...
Agents decide autonomously how to perform a particular task — Image by the author
We use Langchain in the first step to load documents, analyse them and make them
efficiently searchable. After we have indexed the text, it should become much more
efficient to recognize text snippets that are relevant for answering the user’s
questions.
10 of 49 18/07/2023, 10:42
All You Need to Know to Build Your First LLM App | by Dominik Polze... https://towardsdatascience.com/all-you-need-to-know-to-build-your-firs...
What we need for our simple application is of course an LLM. We will use GPT3.5
via the OpenAI API. Then we need a vector store that allows us to feed the LLM with
our own data. And if we want to perform different actions for different queries, we
need an agent that decides what should happen for each query.
Let’s start from the beginning. We first need to import our own documents.
The following section describes what modules are included in LangChain’s Loader
Module to load different types of documents from different sources.
For our simple example, we use data that was probably not included in the training
data of GPT3.5. I use the Wikipedia article about GPT4 because I assume that GPT3.5
has limited knowledge about GPT4.
For this minimal example, I’m not using any of the LangChain loaders, I’m just
scraping the text directly from Wikipedia [License: CC BY-SA 3.0] using
BeautifulSoup.
Please note that scraping websites should only be done in accordance with the website’s
terms of use and the copyright/license status of the text and data you wish to use.
import requests
from bs4 import BeautifulSoup
2. Split our document into text fragments
url = "https://en.wikipedia.org/wiki/GPT-4"
Next, we must= divide
response the text into smaller sections called text chunks. Each text
requests.get(url)
chunk represents a data point in the embedding space, allowing the computer to
soup = BeautifulSoup(response.content, 'html.parser')
determine the similarity between these chunks.
# find all the text on the page
11 of 49 18/07/2023, 10:42
All You Need to Know to Build Your First LLM App | by Dominik Polze... https://towardsdatascience.com/all-you-need-to-know-to-build-your-firs...
text = soup.get_text()
The #following
find thetext snippet
content div is utilizing the text splitter module from langchain. In
this content_div
particular case, we specify a chunk
= soup.find('div', size of'mw-parser-output'})
{'class': 100 and a chunk overlap of 20. It’s
common to use larger text chunks, but you can experiment a bit to find the optimal
# remove unwanted elements from div
size unwanted_tags
for your use case. You just
= ['sup', need to
'span', remember
'table', that
'ul', every LLM has a token limit
'ol']
for tag in unwanted_tags:
(4000 tokes for GPT 3.5). Since we are inserting the text blocks into our prompt, we
for match in content_div.findAll(tag):
need to make sure that the entire prompt is no larger than 4000 tokens.
match.extract()
print(content_div.get_text())
article_text = content_div.get_text()
text_splitter = RecursiveCharacterTextSplitter(
# Set a really small chunk size, just to show.
chunk_size = 100,
chunk_overlap = 20,
length_function = len,
)
texts = text_splitter.create_documents([article_text])
print(texts[0])
print(texts[1])
12 of 49 18/07/2023, 10:42