0% found this document useful (0 votes)
15 views

Vectorstores

understand vector stores

Uploaded by

macguyversmusic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Vectorstores

understand vector stores

Uploaded by

macguyversmusic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

All You Need to Know to Build Your First LLM App | by Dominik Polze... https://towardsdatascience.com/all-you-need-to-know-to-build-your-firs...

Table of Contents

If you are just looking for a short tutorial that explains how to build a simple LLM
application, you can skip to section “6. Creating a Vector store”, there you have all the
code snippets you need to build up a minimalistic LLM app with vector store, prompt
template and LLM call.

Intro

Why we need LLMs


Fine-Tuning vs. Context Injection
What is LangChain?

Step-by-Step Tutorial
Build your own chatbot with context injection — Image by the author

1. Load documents using LangChain


2. Split our Documents into Text Chunks
3. From Text Chunks to Embeddings
4. Define the LLM you want to use
5. Define our Prompt Template
6. Creating a Vector Store

2 of 49 18/07/2023, 10:42
All You Need to Know to Build Your First LLM App | by Dominik Polze... https://towardsdatascience.com/all-you-need-to-know-to-build-your-firs...

Table of contents

Why we need LLMs


The evolution of language has brought us humans incredibly far to this day. It
enables us to efficiently share knowledge and collaborate in the form we know
today. Consequently, most of our collective knowledge continues to be preserved
and communicated through unorganized written texts.

3 of 49 18/07/2023, 10:42
All You Need to Know to Build Your First LLM App | by Dominik Polze... https://towardsdatascience.com/all-you-need-to-know-to-build-your-firs...

Initiatives undertaken over the past two decades to digitize information and
processes have often focused on accumulating more and more data in relational
databases. This approach enables traditional analytical machine learning
algorithms to process and understand our data.

However, despite our extensive efforts to store an increasing amount of data in a


structured manner, we are still unable to capture and process the entirety of our
knowledge.

About 80% of all data in companies is unstructured,


like work descriptions, resumes, emails, text
documents, power point slides, voice recordings,
videos and social media

Distribution of data in companies — Image by the author

The development and advancement leading to GPT3.5 signify a major milestone as it


empowers us to effectively interpret and analyze diverse datasets, regardless of their
structure or lack thereof. Nowadays, we have models that can comprehend and
generate various forms of content, including text, images, and audio files.

4 of 49 18/07/2023, 10:42
All You Need to Know to Build Your First LLM App | by Dominik Polze... https://towardsdatascience.com/all-you-need-to-know-to-build-your-firs...

So how can we leverage their capabilities for our


needs and data?

Fine-Tuning vs. Context Injection


In general, we have two fundamentally different approaches to enable large
language models to answer questions that the LLM cannot know: Model fine-tuning
and context injection

Fine-Tuning
Fine-tuning refers to training an existing language model with additional data to
optimise it for a specific task.

Instead of training a language model from scratch, a pre-trained model such as


BERT or LLama is used and then adapted to the needs of a specific task by adding
use case specific training data.

A team from Stanford University used the LLM Llama and fine-tuned it by using
50,000 examples of how a user/model interaction could look like. The result is a Chat
Bot that interacts with a user and answers queries. This fine-tuning step changed
the way the model is interacting with the end user.

→ Misconceptions around fine-tuning

Fine-tuning of PLLMs (Pre-trained Language Models) is a way to adjust the model


for a specific task, but it doesn’t really allow you to inject your own domain
knowledge into the model. This is because the model has already been trained on a
massive amount of general language data, and your specific domain data is usually
not enough to override what the model has already learned.

So, when you fine-tune the model, it might occasionally provide correct answers,
but it will often fail because it heavily relies on the information it learned during
pre-training, which might not be accurate or relevant to your specific task. In other
words, fine-tuning helps the model adapt to HOW it communicates, but not

5 of 49 18/07/2023, 10:42
All You Need to Know to Build Your First LLM App | by Dominik Polze... https://towardsdatascience.com/all-you-need-to-know-to-build-your-firs...

necessarily WHAT it communicates. (Porsche AG, 2023)

This is where context injection comes into play.

In-context learning / Context Injection


When using context injection, we are not modifying the LLM, we focus on the
prompt itself and inject relevant context into the prompt.

So we need to think about how to provide the prompt with the right information. In
the figure below, you can see schematically how the whole thing works. We need a
process that is able to identify the most relevant data. To do this, we need to enable
our computer to compare text snippets with each other.

Similarity search in our unstructured data — Image by the author

This can be done with embeddings. With embeddings, we translate text into vectors,
allowing us to represent text in a multidimensional embedding space. Points that
are closer to each other in space are often used in the same context. To prevent this
similarity search from taking forever, we store our vectors in a vector database and
index them.

Microsoft is showing us how this could work with Bing Chat. Bing combines the ability of
LLMs to understand language and context with the efficiency of traditional web search.

The objective of the article is to demonstrate the process of creating a

6 of 49 18/07/2023, 10:42
All You Need to Know to Build Your First LLM App | by Dominik Polze... https://towardsdatascience.com/all-you-need-to-know-to-build-your-firs...

straightforward solution that allows us to analyse our own texts and documents, and
then incorporate the insights gained from them into the answers our solution
returns to the user. I will describe all steps and components you need to implement
an end-to-end solution.

So how can we use the capabilities of LLMs to meet our needs? Let’s go through it
step by step.

Step by Step tutorial — Your first LLM App


In the following, we want to utilize LLMs to respond to inquiries about our personal
data. To accomplish this, I begin by transferring the content of our personal data
into a vector database. This step is crucial as it enables us to efficiently search for
relevant sections within the text. We will use this information from our data and the
LLMs capabilities to interpret text to answer the user’s question.

We can also guide the chatbot to exclusively answer questions based on the data we
provide. This way, we can ensure that the chatbot remains focused on the data at
hand and provides accurate and relevant responses.

To implement our use case, we will rely heavily on LangChain.

What is LangChain?
“LangChain is a framework for developing applications powered by language
models.” (Langchain, 2023)

Thus, LangChain is a Python framework that was designed to support the creation
of various LLM applications such as chatbots, summary tools, and basically any tool
you want to create to leverage the power of LLMs. The library combines various
components we will need. We can connect these components in so-called chains.

The most important modules of Langchain are (Langchain, 2023):

1. Models: Interfaces to various model types

2. Prompts: Prompt management, prompt optimization, and prompt serialization

7 of 49 18/07/2023, 10:42
All You Need to Know to Build Your First LLM App | by Dominik Polze... https://towardsdatascience.com/all-you-need-to-know-to-build-your-firs...

3. Indexes: Document loaders, text splitters, vector stores — Enable faster and
more efficient access to the data

4. Chains: Chains go beyond a single LLM call, they allow us to set up sequences of
calls

In the image below, you can see where these components come into play. We load
and process our own unstructured data using the document loaders and text
splitters from the indexes module. The prompts module allows us to inject the
found content into our prompt template, and finally, we are sending the prompt to
our model using the model's module.

8 of 49 18/07/2023, 10:42
All You Need to Know to Build Your First LLM App | by Dominik Polze... https://towardsdatascience.com/all-you-need-to-know-to-build-your-firs...

Components you need for your LLM app — Image by the author

5. Agents: Agents are entities that use LLMs to make choices regarding which
actions to take. After taking an action, they observe the outcome of that action and
repeat the process until their task is completed.

9 of 49 18/07/2023, 10:42
All You Need to Know to Build Your First LLM App | by Dominik Polze... https://towardsdatascience.com/all-you-need-to-know-to-build-your-firs...

Agents decide autonomously how to perform a particular task — Image by the author

We use Langchain in the first step to load documents, analyse them and make them
efficiently searchable. After we have indexed the text, it should become much more
efficient to recognize text snippets that are relevant for answering the user’s
questions.

10 of 49 18/07/2023, 10:42
All You Need to Know to Build Your First LLM App | by Dominik Polze... https://towardsdatascience.com/all-you-need-to-know-to-build-your-firs...

What we need for our simple application is of course an LLM. We will use GPT3.5
via the OpenAI API. Then we need a vector store that allows us to feed the LLM with
our own data. And if we want to perform different actions for different queries, we
need an agent that decides what should happen for each query.

Let’s start from the beginning. We first need to import our own documents.

The following section describes what modules are included in LangChain’s Loader
Module to load different types of documents from different sources.

1. Load documents using Langchain


LangChain is able to load a number of documents from a wide variety of sources.
You can find a list of possible document loaders in the LangChain documentation.
Among them are loaders for HTML pages, S3 buckets, PDFs, Notion, Google Drive
and many more.

For our simple example, we use data that was probably not included in the training
data of GPT3.5. I use the Wikipedia article about GPT4 because I assume that GPT3.5
has limited knowledge about GPT4.

For this minimal example, I’m not using any of the LangChain loaders, I’m just
scraping the text directly from Wikipedia [License: CC BY-SA 3.0] using
BeautifulSoup.

Please note that scraping websites should only be done in accordance with the website’s
terms of use and the copyright/license status of the text and data you wish to use.

import requests
from bs4 import BeautifulSoup
2. Split our document into text fragments
url = "https://en.wikipedia.org/wiki/GPT-4"
Next, we must= divide
response the text into smaller sections called text chunks. Each text
requests.get(url)
chunk represents a data point in the embedding space, allowing the computer to
soup = BeautifulSoup(response.content, 'html.parser')
determine the similarity between these chunks.
# find all the text on the page

11 of 49 18/07/2023, 10:42
All You Need to Know to Build Your First LLM App | by Dominik Polze... https://towardsdatascience.com/all-you-need-to-know-to-build-your-firs...

text = soup.get_text()

The #following
find thetext snippet
content div is utilizing the text splitter module from langchain. In
this content_div
particular case, we specify a chunk
= soup.find('div', size of'mw-parser-output'})
{'class': 100 and a chunk overlap of 20. It’s
common to use larger text chunks, but you can experiment a bit to find the optimal
# remove unwanted elements from div
size unwanted_tags
for your use case. You just
= ['sup', need to
'span', remember
'table', that
'ul', every LLM has a token limit
'ol']
for tag in unwanted_tags:
(4000 tokes for GPT 3.5). Since we are inserting the text blocks into our prompt, we
for match in content_div.findAll(tag):
need to make sure that the entire prompt is no larger than 4000 tokens.
match.extract()

print(content_div.get_text())

from langchain.text_splitter import RecursiveCharacterTextSplitter

article_text = content_div.get_text()

text_splitter = RecursiveCharacterTextSplitter(
# Set a really small chunk size, just to show.
chunk_size = 100,
chunk_overlap = 20,
length_function = len,
)

texts = text_splitter.create_documents([article_text])
print(texts[0])
print(texts[1])

This splits our entire text as follows:

12 of 49 18/07/2023, 10:42

You might also like