Running Llama 2 On CPU Inference Locally For Document Q&A - by Kenneth Leung - Jul, 2023 - Towards Data Science
Running Llama 2 On CPU Inference Locally For Document Q&A - by Kenneth Leung - Jul, 2023 - Towards Data Science
Member-only story
When we host open-source models locally on-premise or in the cloud, the dedicated
compute capacity becomes a key consideration. While GPU instances may seem the
most convenient choice, the costs can easily spiral out of control.
In this easy-to-follow guide, we will discover how to run quantized versions of open-
source LLMs on local CPU inference for retrieval-augmented generation (aka
document Q&A) in Python. In particular, we will leverage the latest, highly-
performant Llama 2 chat model in this project.
Contents
The accompanying GitHub repo for this article can be found here.
A common method is to quantize model weights from their original 16-bit floating-
point values to lower precision ones like 8-bit integer values.
The file we will run the document Q&A on is the public 177-page 2022 annual report
of Manchester United Football Club.
Data Source: Manchester United Plc (2022). 2022 Annual Report 2022 on Form 20-F.
https://ir.manutd.com/~/media/Files/M/Manutd-IR/documents/manu-20f-2022-09-24.pdf
(CC0: Public Domain, as SEC content is public domain and free to use)
The local machine for this project has an AMD Ryzen 5 5600X 6-Core Processor
coupled with 16GB RAM (DDR4 3600). While it also has an RTX 3060TI GPU (8GB
VRAM), it will not be used in this project since we will focus on CPU usage.
Let us now explore the software tools we will leverage in building this backend
application:
(i) LangChain
LangChain is a popular framework for developing applications powered by language
models. It provides an extensive set of integrations and data connectors, allowing us
to chain and orchestrate different modules to create advanced use cases like
chatbots, data analysis, and document Q&A.
(ii) C Transformers
C Transformers is the Python library that provides bindings for transformer models
implemented in C/C++ using the GGML library. At this point, let us first understand
what GGML is about.
Built by the team at ggml.ai, the GGML library is a tensor library designed for
machine learning, where it enables large models to be run on consumer hardware
with high performance. This is achieved through integer quantization support and
built-in optimization algorithms.
LLMs (and corresponding model type name) supported on C Transformers | Image by author
It enables users to compute embeddings for more than 100 languages, which can
then be compared to find sentences with similar meanings.
We will use the open-source all-MiniLM-L6-v2 model for this project because it
offers optimal speed and excellent general-purpose embedding quality.
(iv) FAISS
Facebook AI Similarity Search (FAISS) is a library designed for efficient similarity
search and clustering of dense vectors.
Given a set of embeddings, we can use FAISS to index them and then leverage its
powerful semantic search algorithms to search for the most similar vectors within
the index.
Although it is not a full-fledged vector store in the traditional sense (like a database
management system), it handles the storage of vectors in a way optimized for
efficient nearest-neighbor searches.
(v) Poetry
Poetry is used for setting up the virtual environment and handling Python package
management in this project because of its ease of use and consistency.
I chose the latest open-source Llama-2–7B-Chat model (GGML 8-bit) for this project
based on the following considerations:
Currently the top performer across multiple metrics based on its Open LLM
leaderboard rankings (as of July 2023).
The Llama-2–7B-Chat model is the ideal candidate for our use case since it is
designed for conversation and Q&A.
The model is licensed (partially) for commercial use. It is because the fine-tuned
model Llama-2-Chat model leverages publicly available instruction datasets and
over 1 million human annotations.
The original unquantized 16-bit model requires a memory of ~15 GB, which is
too close to the 16GB RAM limit.
Other smaller quantized formats (i.e., 4-bit and 5–bit) are available, but they
come at the expense of accuracy and response quality.
The accompanying codes for this guide can be found in this GitHub repo, and all
the dependencies can be found in the requirements.txt file.
Note: Since many tutorials are already out there, we will not be deep diving into the
intricacies and details of the general document Q&A components (e.g., text chunking,
vector store setup). We will instead focus on the open-source LLM and CPU inference
aspects in this article.
Step 1 — Process data and build vector store
In this step, three sub-tasks will be performed:
For example, OpenAI’s GPT models are designed to be conversation-in and message-
out. It means input templates are expected to be in a chat-like transcript format
(e.g., separate system and user messages).
However, those templates would not work here because our Llama 2 model is not
specifically optimized for that kind of conversational interface. Instead, a classic
prompt template like the one below would be preferred.
Note: The relatively smaller LLMs, like the 7B model, appear particularly sensitive to
formatting. For instance, I got slightly different outputs when I altered the
whitespaces and indentation of the prompt template.
Step 3 — Download the Llama-2–7B-Chat GGML binary file
Since we will be running the LLM locally, we need to download the binary file of the
quantized Llama-2–7B-Chat model.
We can do so by visiting TheBloke’s Llama-2–7B-Chat GGML page hosted on Hugging
Face and then downloading the GGML 8-bit quantized file named llama-2–7b-
chat.ggmlv3.q8_0.bin .
The downloaded .bin file for the 8-bit quantized model can be saved in a suitable
project subfolder like /models .
The model card page also displays more information and details for each quantized
format:
Different quantized formats with details | Image by author
Note: To download other GGML quantized models supported by C Transformers, visit the
main TheBloke page on HuggingFace to search for your desired model and look for the links
with names that end with ‘-GGML’.
Note: I set the temperature as 0.01 instead of 0 because I got odd responses (e.g., a
long repeated string of the letter E) when the temperature was exactly zero.
Given that we will return source documents, additional code is appended to process
the document chunks for a better visual display.
To evaluate the speed of CPU inference, the timeit module is also utilized.
Step 7 — Running a sample query
It is now time to put our application to the test. Upon loading the virtual
environment from the project directory, we can run a command in the command
line interface (CLI) that comprises our user query.
For example, we can ask about the value of the minimum guarantee payable by
Adidas (Manchester United’s global technical sponsor) with the following command:
poetry run python main.py "How much is the minimum guarantee payable by adidas?
Note: If we are not using Poetry, we can omit the prepended poetry run .
Results
Output from user query passed into document Q&A application | Image by author
The output shows that we successfully obtained the correct response for our user
query (i.e., £750 million), along with the relevant document chunks that are
semantically similar to the query.
The total time of 31 seconds for launching the application and generating a response
is pretty good, given that we are running it locally on an AMD Ryzen 5600X (which is
a good CPU but by no means the best in the market currently).
The result is even more impressive given that running LLM inference on GPUs (e.g.,
directly on HuggingFace) can also take double-digit seconds.
Build a frontend chat interface with Streamlit, especially since it has made two
major announcements recently: Integration of Streamlit with LangChain, and
the launch of Streamlit ChatUI to build powerful chatbot interfaces easily.
Dockerize and deploy the application on a cloud CPU instance. While we have
explored local inference, the application can easily be ported to the cloud. We
can also leverage more powerful CPU instances on the cloud to speed up
inference (e.g., compute-optimized AWS EC2 instances like c5.4xlarge)
Experiment with slightly larger LLMs like the Llama 13B Chat model. Since we
have worked with 7B models, assessing the performance of slightly larger ones
is a good idea since they should theoretically be more accurate and still fit
within memory.
Experiment with smaller quantized formats like the 4-bit and 5-bit (including
those with the new k-quant method) to objectively evaluate the differences in
inference speed and response quality.
Leverage local GPU to speed up inference. If we want to test the use of GPUs on
the C Transformers models, we can do so by running some of the model layers
on the GPU. It is useful because Llama is the only model type that has GPU
support currently.
I will work on articles and projects addressing the above ideas in the upcoming
weeks, so stay tuned for more insightful generative AI content!
Before you go
I welcome you to join me on a data science learning journey! Follow this Medium
page and check out my GitHub to stay in the loop of more exciting practical data
science content. Meanwhile, have fun running open-source LLMs on CPU inference!
Hands On Tutorials
Follow
Data Scientist at Boston Consulting Group (BCG) | Tech Writer | 1.3M+ views on Medium |
linkedin.com/in/kennethleungty | github.com/kennethleungty
839 13
703 10
544 8
Kenneth Leung in Towards Data Science
566 8