How To Build Your Own Custom ChatGPT Bot With Custom Knowledge Base - Better Programming
How To Build Your Own Custom ChatGPT Bot With Custom Knowledge Base - Better Programming
access ChatGPT has become an integral tool that most people use daily to automate various
tasks. If you have used ChatGPT for any length of time, you would have realized that
it can provide wrong answers and has limited to zero context on some niche topics.
New: Navigate Medium from
the top of the page, and This brings up the question of how we can tap into chatGPT to bridge the gap and
focus more on reading
Published as
in Better Programming allow ChatGPT to have more custom data.
you scroll.
Okay, got it A wealth of knowledge is distributed in various platforms we interact with daily, i.e.,
This is your last free member-only story this month. Upgrade for unlimited access. via confluence wiki pages at work, slack groups, company knowledge base, Reddit,
Stack Overflow, books, newsletters, and google documents shared by colleagues.
Keeping up with all these information sources is a full-time job in itself.
Timothy Mugayi Follow
Custom Knowledge Base ChatGPT manually and what the issues are. The conventional approach to extending
ChatGPT is via prompt engineering.
Step-by-step guide on how to feed your ChatGPT bot with custom
data sources This is quite simple to do since ChatGPT is context aware. First, we need to interact
with ChatGPT by appending the original document content before the actual
questions.
The issue with this approach is the model has a limited context; it can only accept
approximately 4,097 tokens for GPT-3. You will soon run into a wall with this
approach as it's also quite a manual, tedious process to always have to paste in the
content.
Imagine having hundreds of PDF documents you wanted to inject into ChatGPT. You
will soon run into paywall issues. You might be thinking GPT-4 is the successor to
2.2K 41
GPT-3. It was just launched on March 14, 2023, and it can process 25,000 words —
about eight times as many as GPT-3 process images — and handle much more
Photo by Christian Wiediger on Unsplash
nuanced instructions than GPT-3.5. This still has the same fundamental problem of
data input limitation. How do we go about bypassing some of these limitations? We
can leverage a Python library called LlamaIndex.
Next, we'll import the libraries in Python and set up your OpenAI API key in a new
main.py file.
3. Once your project is created, please select it from the dropdown menu in the top
os.environ['OPENAI_API_KEY'] = 'SET-YOUR-OPEN-AI-API-KEY' navigation bar.
4. Go to the "APIs & Services" section from the left-hand menu and click on the "+
In the above snippet, we are explicitly setting the environment variable for clarity, ENABLE APIS AND SERVICES" button at the top of the page.
as the LlamaIndex package implicitly requires access to OpenAI. In a typical
production environment, you can put your keys in environment variables, vault, or 5. Search for "Google Docs API" in the search bar and select it from the results list.
whatever secrets management service your infra can access. 6. Click the "Enable" button to enable the API for your project.
Let's construct a function to help us authenticate against our Google account to 7. Click on the OAuth consent screen menu and create and give your app a name,
discover Google Docs. e.g., "mychatbot," then enter the support email, save, and add scopes.
def authorize_gdocs():
google_oauth2_scopes = [
"https://www.googleapis.com/auth/documents.readonly"
]
cred = None
if os.path.exists("token.pickle"):
with open("token.pickle", 'rb') as token:
cred = pickle.load(token)
if not cred or not cred.valid:
if cred and cred.expired and cred.refresh_token:
cred.refresh(Request())
else:
flow = InstalledAppFlow.from_client_secrets_file("credentials.json"
cred = flow.run_local_server(port=0)
with open("token.pickle", 'wb') as token:
pickle.dump(cred, token)
You must also add test users since this Google app will not be approved yet. This can
To enable the Google Docs API and fetch the credentials in the Google Console, you
be your own email.
can follow these steps:
2. Create a new project if you haven't already. You can do this by clicking on the
"Select a project" dropdown menu in the top navigation bar and selecting "New
Example folder structure with google credentials in root
Once you have set up your credentials, you can access the Google Docs API from
your Python project.
Go to your Google Docs, open up a few of them, and get the unique id that can be
seen in your browser URL bar, as illustrated below:
You will then need to set up credentials for your project to use the API. To do this, go
to the "Credentials" section from the left-hand menu and click "Create Credentials."
Select "OAuth client ID" and follow the prompts to set up your credentials.
Gdoc ID
Copy out the gdoc IDs and paste them into your code below. You can have N number
of gdocs that you can index so ChatGPT has context access to your custom
knowledge base. We will use the GoogleDocsReader plugin from the LlamaIndex
library to load your documents.
loader = GoogleDocsReader()
# load gdocs and index them
documents = loader.load_data(document_ids=gdoc_ids)
index = GPTSimpleVectorIndex(documents)
If you wish to save and load the index on the fly, you can use the following function We will interact directly with vanilla ChatGPT first to see what output it generates
calls. This will speed up the process of fetching from pre-saved indexes instead of without injecting a custom data source.
making API calls to external sources.
Querying the index and getting a response can be achieved by running the following
That was a little disappointing! Let's try again.
code below. Code can easily be extended into a rest API that connects to a UI where
you can interact with your custom data sources via the GPT interface.
INFO:google_auth_oauthlib.flow:"GET /?state=oz9XY8CE3LaLLsTxIz4sDgrHha4fEJ&code
INFO:googleapiclient.discovery_cache:file_cache is only supported with oauth2cl
# Querying the index INFO:root:> [build_index_from_documents] Total LLM token usage: 0 tokens
while True: INFO:root:> [build_index_from_documents] Total embedding token usage: 175 token
prompt = input("Type prompt...") Type prompt...who is timothy mugayi hint he is a writer on medium
response = index.query(prompt)
print(response) INFO:root:> [query] Total LLM token usage: 300 tokens
INFO:root:> [query] Total embedding token usage: 14 tokens
Timothy Mugayi is an Engineering Manager at OVO (PT Visionet Internasional), a
last_token_usage=300
Given we have a Google Doc with details about me, information that's readily Type prompt...
available if you publicly search on google. Type prompt...Given you know who timothy mugayi is write an interesting introdu
index = GPTSimpleVectorIndex(
Dear [Hiring Manager],
documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper
)
I am writing to apply for the Python project to build a custom ChatGPT bot with
I am confident that I can deliver a high-quality product that meets the require
If you want to keep tabs on your OpenAI free or paid credits, you can navigate to the
Thank you for your time and consideration.
OpenAI dashboard and check how much credit is left.
Sincerely,
Timothy Mugayi Creating an index, inserting into an index, and querying an index will use tokens.
last_token_usage=436
Type prompt...
Hence, it's always important to ensure you output token usage for tracking purposes
when building your custom bots.
last_token_usage = index.llm_predictor.last_token_usage
LlamaIndex will internally accept your prompt, search the index for pertinent
chunks, and then pass both your prompt and the pertinent chunks to the ChatGPT print(f"last_token_usage={last_token_usage}")
model. The procedures above demonstrate a fundamental first use of LlamaIndex
and GPT for answering questions. Yet, there is much more you can do. You are only
limited by your creativity when configuring LlamaIndex to utilize a different large
Final Thoughts
language model (LLM), using a different type of index for various activities, or
ChatGPT combined with LlamaIndex can help to build a customized ChatGPT
updating old indices with a new index programmatically.
chatbot that can infer knowledge based on its own document sources. While
Here is an example of changing the LLM model explicitly. This time we tap into ChatGPT and other LLM are pretty powerful, extending the LLM model provides a
another Python package that comes bundled with LlamaIndex called langchain. much more refined experience and opens up the possibility of building a
conversational-style chatbot that can be used to build real business use cases like
customer support assistance or even spam classifiers. Given we can feed real-time
from langchain import OpenAI
data, we can evaluate some of the limitations of ChatGPT models being trained up to
from llama_index import LLMPredictor, GPTSimpleVectorIndex, PromptHelper a certain period.
...
For the complete source code, you can refer to this GitHub repo.
If you are looking to build custom ChatGPT bots that understand your domain, drop a
message in the comments section and let's connect.
Give a tip
A newsletter covering the best programming articles published across Medium Take a look.