100% found this document useful (1 vote)

590 views21 pages

Running Llama 2 On CPU Inference Locally For Document Q&A - by Kenneth Leung - Jul, 2023 - Towards Data Science

The document provides a guide for running quantized open-source large language models on CPUs for document question answering. It discusses tools like Llama 2, C Transformers and FAISS that enable efficient CPU inference. It then provides a step-by-step guide to build a document Q&A application using these tools and techniques.

Uploaded by

van mai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

590 views21 pages

Running Llama 2 On CPU Inference Locally For Document Q&A - by Kenneth Leung - Jul, 2023 - Towards Data Science

Uploaded by

van mai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Open in app

Member-only story

Running Llama 2 on CPU Inference Locally for

Document Q&A
Clearly explained guide for running quantized open-source LLM applications on
CPUs using Llama 2, C Transformers, GGML, and LangChain

Kenneth Leung · Follow

Published in Towards Data Science
11 min read · Jul 18

Listen Share More

Photo by NOAA on Unsplash

Third-party commercial large language model (LLM) providers like OpenAI’s GPT4
have democratized LLM use via simple API calls. However, teams may still require
self-managed or private deployment for model inference within enterprise
perimeters due to various reasons around data privacy and compliance.

The proliferation of open-source LLMs has fortunately opened up a vast range of

options for us, thus reducing our reliance on these third-party providers.

When we host open-source models locally on-premise or in the cloud, the dedicated
compute capacity becomes a key consideration. While GPU instances may seem the
most convenient choice, the costs can easily spiral out of control.

In this easy-to-follow guide, we will discover how to run quantized versions of open-
source LLMs on local CPU inference for retrieval-augmented generation (aka
document Q&A) in Python. In particular, we will leverage the latest, highly-
performant Llama 2 chat model in this project.

Contents

(1) Quick Primer on Quantization

(2) Tools and Data
(3) Open-Source LLM Selection
(4) Step-by-Step Guide
(5) Next Steps

The accompanying GitHub repo for this article can be found here.

(1) Quick Primer on Quantization

LLMs have demonstrated excellent capabilities but are known to be compute- and
memory-intensive. To manage their downside, we can use quantization to compress
these models to reduce the memory footprint and accelerate computational
inference while maintaining model performance.

Quantization is the technique of reducing the number of bits used to represent a

number or value. In the context of LLMs, it involves reducing the precision of the
model’s parameters by storing the weights in lower-precision data types.
Since it reduces model size, quantization is beneficial for deploying models on
resource-constrained devices like CPUs or embedded systems.

A common method is to quantize model weights from their original 16-bit floating-
point values to lower precision ones like 8-bit integer values.

Weight quantization from FP16 to INT8 | Image by author

(2) Tools and Data

The following diagram illustrates the architecture of the document knowledge Q&A
application we will build in this project.
Document Q&A Architecture | Image by author

The file we will run the document Q&A on is the public 177-page 2022 annual report
of Manchester United Football Club.

Data Source: Manchester United Plc (2022). 2022 Annual Report 2022 on Form 20-F.
https://ir.manutd.com/~/media/Files/M/Manutd-IR/documents/manu-20f-2022-09-24.pdf
(CC0: Public Domain, as SEC content is public domain and free to use)

The local machine for this project has an AMD Ryzen 5 5600X 6-Core Processor
coupled with 16GB RAM (DDR4 3600). While it also has an RTX 3060TI GPU (8GB
VRAM), it will not be used in this project since we will focus on CPU usage.

Let us now explore the software tools we will leverage in building this backend
application:

(i) LangChain
LangChain is a popular framework for developing applications powered by language
models. It provides an extensive set of integrations and data connectors, allowing us
to chain and orchestrate different modules to create advanced use cases like
chatbots, data analysis, and document Q&A.

(ii) C Transformers
C Transformers is the Python library that provides bindings for transformer models
implemented in C/C++ using the GGML library. At this point, let us first understand
what GGML is about.

Built by the team at ggml.ai, the GGML library is a tensor library designed for
machine learning, where it enables large models to be run on consumer hardware
with high performance. This is achieved through integer quantization support and
built-in optimization algorithms.

As a result, GGML versions of LLMs (quantized models in binary formats) can be

run performantly on CPUs. Given that we are working with Python in this project,
we will use the C Transformers library, which essentially offers the Python bindings
for the GGML models.

C Transformers supports a selected set of open-source models, including popular

ones like Llama, GPT4All-J, MPT, and Falcon.

LLMs (and corresponding model type name) supported on C Transformers | Image by author

(iii) Sentence-Transformers Embeddings Model

sentence-transformers is a Python library that provides easy methods to compute
embeddings (dense vector representations) for sentences, text, and images.

It enables users to compute embeddings for more than 100 languages, which can
then be compared to find sentences with similar meanings.

We will use the open-source all-MiniLM-L6-v2 model for this project because it
offers optimal speed and excellent general-purpose embedding quality.
(iv) FAISS
Facebook AI Similarity Search (FAISS) is a library designed for efficient similarity
search and clustering of dense vectors.

Given a set of embeddings, we can use FAISS to index them and then leverage its
powerful semantic search algorithms to search for the most similar vectors within
the index.

Although it is not a full-fledged vector store in the traditional sense (like a database
management system), it handles the storage of vectors in a way optimized for
efficient nearest-neighbor searches.
(v) Poetry
Poetry is used for setting up the virtual environment and handling Python package
management in this project because of its ease of use and consistency.

Having previously worked with venv, I highly recommend switching to Poetry as it

makes dependency management more efficient and seamless.

Check out this video to get started with Poetry.

Join Medium with Kenneth’s referral link

As a Medium member, a portion of your membership fee goes to
writers you read, and you get full access to every story
kennethleungty.medium.com

(3) Open-Source LLM Selection

There has been tremendous progress in the open-source LLM space, and the many
LLMs can be found on HuggingFace’s Open LLM leaderboard.

I chose the latest open-source Llama-2–7B-Chat model (GGML 8-bit) for this project
based on the following considerations:

Model Type (Llama 2)

It is an open-source model supported in the C Transformers library.

Currently the top performer across multiple metrics based on its Open LLM
leaderboard rankings (as of July 2023).

Demonstrates a huge improvement on the previous benchmark set by the

original Llama model.

It is widely mentioned and downloaded in the community.

Model Size (7B)

Given that we are performing document Q&A, the LLM will primarily be used
for the relatively simple task of summarizing document chunks. Therefore, the
7B model size fits our needs as we technically do not require an overly large
model (e.g., 65B and above) for this task.

Fine-tuned Version (Llama-2-7B-Chat)

The Llama-2-7B base model is built for text completion, so it lacks the fine-
tuning required for optimal performance in document Q&A use cases.

The Llama-2–7B-Chat model is the ideal candidate for our use case since it is
designed for conversation and Q&A.

The model is licensed (partially) for commercial use. It is because the fine-tuned
model Llama-2-Chat model leverages publicly available instruction datasets and
over 1 million human annotations.

Quantized Format (8-bit)

Given that the RAM is constrained to 16GB, the 8-bit GGML version is suitable as
it only requires a memory size of 9.6GB.

The 8-bit format also offers a comparable response quality to 16-bit.

The original unquantized 16-bit model requires a memory of ~15 GB, which is
too close to the 16GB RAM limit.
Other smaller quantized formats (i.e., 4-bit and 5–bit) are available, but they
come at the expense of accuracy and response quality.

(4) Step-by-Step Guide

Now that we know the various components, let us go through the step-by-step guide
on how to build the document Q&A application.

The accompanying codes for this guide can be found in this GitHub repo, and all
the dependencies can be found in the requirements.txt file.

Note: Since many tutorials are already out there, we will not be deep diving into the
intricacies and details of the general document Q&A components (e.g., text chunking,
vector store setup). We will instead focus on the open-source LLM and CPU inference
aspects in this article.
Step 1 — Process data and build vector store
In this step, three sub-tasks will be performed:

Data ingestion and splitting text into chunks

Load embeddings model (sentence-transformers)

Index chunks and store in FAISS vector store

After running the Python script above, the vector store will be generated and saved
in the local directory named 'vectorstore/db_faiss' , and is ready for semantic
search and retrieval.

Step 2 — Set up prompt template

Given that we use the Llama-2–7B-Chat model, we must be mindful of the prompt
templates utilized here.

For example, OpenAI’s GPT models are designed to be conversation-in and message-
out. It means input templates are expected to be in a chat-like transcript format
(e.g., separate system and user messages).
However, those templates would not work here because our Llama 2 model is not
specifically optimized for that kind of conversational interface. Instead, a classic
prompt template like the one below would be preferred.

Note: The relatively smaller LLMs, like the 7B model, appear particularly sensitive to
formatting. For instance, I got slightly different outputs when I altered the
whitespaces and indentation of the prompt template.
Step 3 — Download the Llama-2–7B-Chat GGML binary file
Since we will be running the LLM locally, we need to download the binary file of the
quantized Llama-2–7B-Chat model.
We can do so by visiting TheBloke’s Llama-2–7B-Chat GGML page hosted on Hugging
Face and then downloading the GGML 8-bit quantized file named llama-2–7b-

chat.ggmlv3.q8_0.bin .

Files and versions page of Llama-2–7B-Chat-GGML page on HuggingFace | Image by author

The downloaded .bin file for the 8-bit quantized model can be saved in a suitable
project subfolder like /models .

The model card page also displays more information and details for each quantized
format:
Different quantized formats with details | Image by author

Note: To download other GGML quantized models supported by C Transformers, visit the
main TheBloke page on HuggingFace to search for your desired model and look for the links
with names that end with ‘-GGML’.

Step 4— Setup LLM

To utilize the GGML model we downloaded, we will leverage the integration
between C Transformers and LangChain. Specifically, we will use the
CTransformers LLM wrapper in LangChain, which provides a unified interface for
the GGML models.
We can define a host of configuration settings for the LLM, such as maximum new
tokens, top k value, temperature, and repetition penalty.

Note: I set the temperature as 0.01 instead of 0 because I got odd responses (e.g., a
long repeated string of the letter E) when the temperature was exactly zero.

Step 5 — Build and initialize RetrievalQA

With the prompt template and C Transformers LLM ready, we write three functions
to build the LangChain RetrievalQA object that enables us to perform document
Q&A.
Step 6 — Combining into the main script
The next step is to combine the previous components into the main.py script. The
argparse module is used because we will pass our user query into the application
from the command line.

Given that we will return source documents, additional code is appended to process
the document chunks for a better visual display.

To evaluate the speed of CPU inference, the timeit module is also utilized.
Step 7 — Running a sample query
It is now time to put our application to the test. Upon loading the virtual
environment from the project directory, we can run a command in the command
line interface (CLI) that comprises our user query.

For example, we can ask about the value of the minimum guarantee payable by
Adidas (Manchester United’s global technical sponsor) with the following command:

poetry run python main.py "How much is the minimum guarantee payable by adidas?
Note: If we are not using Poetry, we can omit the prepended poetry run .

Results

Output from user query passed into document Q&A application | Image by author

The output shows that we successfully obtained the correct response for our user
query (i.e., £750 million), along with the relevant document chunks that are
semantically similar to the query.

The total time of 31 seconds for launching the application and generating a response
is pretty good, given that we are running it locally on an AMD Ryzen 5600X (which is
a good CPU but by no means the best in the market currently).

The result is even more impressive given that running LLM inference on GPUs (e.g.,
directly on HuggingFace) can also take double-digit seconds.

(5) Next Steps

Now that we have built a document Q&A backend LLM application that runs on CPU
inference, there are many exciting steps we can take to bring this project forward.

Build a frontend chat interface with Streamlit, especially since it has made two
major announcements recently: Integration of Streamlit with LangChain, and
the launch of Streamlit ChatUI to build powerful chatbot interfaces easily.

Dockerize and deploy the application on a cloud CPU instance. While we have
explored local inference, the application can easily be ported to the cloud. We
can also leverage more powerful CPU instances on the cloud to speed up
inference (e.g., compute-optimized AWS EC2 instances like c5.4xlarge)

Experiment with slightly larger LLMs like the Llama 13B Chat model. Since we
have worked with 7B models, assessing the performance of slightly larger ones
is a good idea since they should theoretically be more accurate and still fit
within memory.

Experiment with smaller quantized formats like the 4-bit and 5-bit (including
those with the new k-quant method) to objectively evaluate the differences in
inference speed and response quality.

Leverage local GPU to speed up inference. If we want to test the use of GPUs on
the C Transformers models, we can do so by running some of the model layers
on the GPU. It is useful because Llama is the only model type that has GPU
support currently.

Evaluate the use of vLLM, a high-throughput and memory-efficient inference

and serving engine for LLMs. However, utilizing vLLM requires the use of GPUs.

I will work on articles and projects addressing the above ideas in the upcoming
weeks, so stay tuned for more insightful generative AI content!

Before you go
I welcome you to join me on a data science learning journey! Follow this Medium
page and check out my GitHub to stay in the loop of more exciting practical data
science content. Meanwhile, have fun running open-source LLMs on CPU inference!

arXiv Keyword Extraction and Analysis Pipeline with KeyBERT and

Taipy
Build a keyword analysis application in Python comprising a frontend
user interface and backend pipeline
towardsdatascience.com

How to Dockerize Machine Learning Applications Built with H2O,

MLflow, FastAPI, and Streamlit
An easy-to-follow guide to containerizing multi-service ML
applications with Docker
towardsdatascience.com
Micro, Macro & Weighted Averages of F1 Score, Clearly Explained
Understanding the concepts behind the micro average, macro
average and weighted average of F1 score in multi-class…
towardsdatascience.com

Large Language Models Langchain Machine Learning Python

Hands On Tutorials

Written by Kenneth Leung

3.9K Followers · Writer for Towards Data Science

Data Scientist at Boston Consulting Group (BCG) | Tech Writer | 1.3M+ views on Medium |
linkedin.com/in/kennethleungty | github.com/kennethleungty

More from Kenneth Leung and Towards Data Science

Kenneth Leung in Towards Data Science

Micro, Macro & Weighted Averages of F1 Score, Clearly Explained

Understanding the concepts behind the micro average, macro average and weighted average
of F1 score in multi-class classification

· 7 min read · Jan 4, 2022

839 13

Matt Chapman in Towards Data Science

Stop Using PowerPoint for Your ML Presentations and Try This Instead
Gradio is a surefire way to impress both technical and non-technical stakeholders — Why aren’t
more Data Scientists and MLEs using it?

· 6 min read · Jul 3

703 10

Soner Yıldırım in Towards Data Science

ChatGPT Code Interpreter: How It Saved Me Hours of Work

Creating an interactive world map of country populations with a couple of sentences.

· 6 min read · Jul 9

544 8
Kenneth Leung in Towards Data Science

How to Easily Draw Neural Network Architecture Diagrams

Using the no-code diagrams.net tool to showcase your deep learning models with diagram
visualizations

· 5 min read · Aug 23, 2021

566 8

See all from Kenneth Leung

See all from Towards Data Science

Generative AI On AWS
100% (5)
Generative AI On AWS
208 pages
Building LLM Applications For Production
100% (3)
Building LLM Applications For Production
28 pages
Apress Understanding Large Language Models B0CJ2C8TXQ
100% (11)
Apress Understanding Large Language Models B0CJ2C8TXQ
166 pages
(EARLY RELEASE) Quick Start Guide To Large Language Models Strategies and Best Practices For Using ChatGPT and Other LLMs (Sinan Ozdemir) (Z-Library)
100% (14)
(EARLY RELEASE) Quick Start Guide To Large Language Models Strategies and Best Practices For Using ChatGPT and Other LLMs (Sinan Ozdemir) (Z-Library)
132 pages
Thimira Amaratunga - Understanding Large Language Models - Learning Their Underlying Concepts and Technologies-Apress (2023)
100% (6)
Thimira Amaratunga - Understanding Large Language Models - Learning Their Underlying Concepts and Technologies-Apress (2023)
145 pages
Current Best Practices For Training LLMs From Scratch - Final
No ratings yet
Current Best Practices For Training LLMs From Scratch - Final
23 pages
Instant download Transformers for Natural Language Processing and Computer Vision, Third Edition Denis Rothman pdf all chapter
67% (3)
Instant download Transformers for Natural Language Processing and Computer Vision, Third Edition Denis Rothman pdf all chapter
55 pages
Building LLM Powered Applications With Langchain
100% (1)
Building LLM Powered Applications With Langchain
11 pages
RAG - A Simple Introduction
100% (5)
RAG - A Simple Introduction
75 pages
RAG Architecture
100% (7)
RAG Architecture
52 pages
A Taxonomy of Retrieval Augmented Generation
100% (1)
A Taxonomy of Retrieval Augmented Generation
56 pages
LLMs and Generative AI For (Z-Library)
100% (3)
LLMs and Generative AI For (Z-Library)
58 pages
Orifice Chamber
100% (1)
Orifice Chamber
4 pages
Generative AI With Large Language Models
100% (2)
Generative AI With Large Language Models
31 pages
LangChain Cheat Sheet KDnuggets
No ratings yet
LangChain Cheat Sheet KDnuggets
1 page
Lan - Guage Mo - Del Cheat Sheet
100% (2)
Lan - Guage Mo - Del Cheat Sheet
3 pages
Building A PDF Knowledge Bot With Open-Source LLMs - A Step-by-Step Guide - Shakudo
No ratings yet
Building A PDF Knowledge Bot With Open-Source LLMs - A Step-by-Step Guide - Shakudo
13 pages
Local LLM Inference and Fine-Tuning
100% (3)
Local LLM Inference and Fine-Tuning
26 pages
Llama3, LangGraph and Elasticsearch - Build A Local Agent For Vector Search - Search Labs
100% (2)
Llama3, LangGraph and Elasticsearch - Build A Local Agent For Vector Search - Search Labs
48 pages
LLM Mesh: A Practical Guide To Using Generative AI in The Enterprise
100% (1)
LLM Mesh: A Practical Guide To Using Generative AI in The Enterprise
27 pages
LLM Application Through Production
100% (11)
LLM Application Through Production
254 pages
Create LLM Application Using Langchain With Ease
100% (5)
Create LLM Application Using Langchain With Ease
12 pages
List of Open Sourced Fine-Tuned Large Language Models (LLM) - by Sung Kim - Geek Culture - Mar, 2023 - Medium
No ratings yet
List of Open Sourced Fine-Tuned Large Language Models (LLM) - by Sung Kim - Geek Culture - Mar, 2023 - Medium
18 pages
Vector Databases - A Technical Primer
No ratings yet
Vector Databases - A Technical Primer
68 pages
Diffusion
100% (5)
Diffusion
62 pages
AIML001 Generative AI On AWS - Build and Scale Generative AI Applications With Foundation Models
100% (1)
AIML001 Generative AI On AWS - Build and Scale Generative AI Applications With Foundation Models
28 pages
Mlops Ebook With Preview
67% (3)
Mlops Ebook With Preview
57 pages
Vector Database Essentials
No ratings yet
Vector Database Essentials
26 pages
LangGraph: multi-agent systems
No ratings yet
LangGraph: multi-agent systems
9 pages
Explaining Vector Databases in 3 Levels of Difficulty - by Leonie Monigatti - Jul, 2023 - Towards Data Science
No ratings yet
Explaining Vector Databases in 3 Levels of Difficulty - by Leonie Monigatti - Jul, 2023 - Towards Data Science
12 pages
Building Machine Learning Systems With A Feature Store - Early Release
100% (1)
Building Machine Learning Systems With A Feature Store - Early Release
48 pages
Introduction To LLMS: Transformers Types of Llms Configuration Settings
100% (2)
Introduction To LLMS: Transformers Types of Llms Configuration Settings
7 pages
300 LangChain Projects
100% (1)
300 LangChain Projects
17 pages
Vector_Databases
No ratings yet
Vector_Databases
35 pages
MLops Concept
No ratings yet
MLops Concept
20 pages
A Developer's Guide To Building AI Applications: Second Edition
100% (5)
A Developer's Guide To Building AI Applications: Second Edition
46 pages
TensorFlow Cheatsheet Zero To Mastery V1.01
No ratings yet
TensorFlow Cheatsheet Zero To Mastery V1.01
26 pages
Best Practices For Fine-Tuning and Prompt Engineering LLMs - Weights & Biases LLM Whitepaper
50% (2)
Best Practices For Fine-Tuning and Prompt Engineering LLMs - Weights & Biases LLM Whitepaper
21 pages
LLM Questions
100% (1)
LLM Questions
51 pages
Generative Ai Terminology
100% (2)
Generative Ai Terminology
26 pages
Fine-Tuning Pre-Trained Models For Generative AI Applications
100% (2)
Fine-Tuning Pre-Trained Models For Generative AI Applications
19 pages
Large Language Models
100% (1)
Large Language Models
23 pages
Generative Ai With Python Harnessing the Power of Machine Learning and Deep Learning to Build Creative and Intelligent Systems
100% (1)
Generative Ai With Python Harnessing the Power of Machine Learning and Deep Learning to Build Creative and Intelligent Systems
239 pages
AI
100% (2)
AI
234 pages
Aryan A. What Is LLMOps. Large Language Models in Production 2024
No ratings yet
Aryan A. What Is LLMOps. Large Language Models in Production 2024
67 pages
Sheffield R. Generative AI Development With Langchain. The Ultimate Guide 2023
100% (1)
Sheffield R. Generative AI Development With Langchain. The Ultimate Guide 2023
134 pages
Pytorch Tutorial by Chongruo Wu
No ratings yet
Pytorch Tutorial by Chongruo Wu
84 pages
Building a Streamlit Chatbot with LangChain and Llama 3.1_ Exploring LLMs — 3 _ by Abou Zuhayr _ Sep, 2024 _ GoPenAI
No ratings yet
Building a Streamlit Chatbot with LangChain and Llama 3.1_ Exploring LLMs — 3 _ by Abou Zuhayr _ Sep, 2024 _ GoPenAI
15 pages
Software Architecture in An AI World
No ratings yet
Software Architecture in An AI World
25 pages
Retrieval-Augmented Generation For Large Language Models A Survey
No ratings yet
Retrieval-Augmented Generation For Large Language Models A Survey
26 pages
Building RAG-based LLM Applications For Production (Part 1) : Blog Detail
100% (1)
Building RAG-based LLM Applications For Production (Part 1) : Blog Detail
39 pages
Agentic AI Projects
50% (2)
Agentic AI Projects
9 pages
7 Agentic RAG System Architectures to Build AI Agents
No ratings yet
7 Agentic RAG System Architectures to Build AI Agents
12 pages
26 RAG Concepts in Alphabetical Order
No ratings yet
26 RAG Concepts in Alphabetical Order
15 pages
Transformers LLMs
100% (1)
Transformers LLMs
163 pages
What Are Vector Databases
No ratings yet
What Are Vector Databases
5 pages
RAG Notes
No ratings yet
RAG Notes
19 pages
Hands-On Guide to Agentic Corrective RAG-1
No ratings yet
Hands-On Guide to Agentic Corrective RAG-1
5 pages
Deep Learning With PyTorch Guide For Beginners and Intermediate
100% (7)
Deep Learning With PyTorch Guide For Beginners and Intermediate
120 pages
Implement NLP use-cases using BERT: Explore the Implementation of NLP Tasks Using the Deep Learning Framework and Python (English Edition)
From Everand
Implement NLP use-cases using BERT: Explore the Implementation of NLP Tasks Using the Deep Learning Framework and Python (English Edition)
Amandeep
No ratings yet
Generative AI Foundations in Python: Discover key techniques and navigate modern challenges in LLMs
From Everand
Generative AI Foundations in Python: Discover key techniques and navigate modern challenges in LLMs
Carlos Rodriguez
No ratings yet
Olson DL Wu D Enterprise Risk Management in Finance
No ratings yet
Olson DL Wu D Enterprise Risk Management in Finance
277 pages
4.various Tests Part 1
No ratings yet
4.various Tests Part 1
53 pages
Adminguide Mypbx U100
No ratings yet
Adminguide Mypbx U100
152 pages
Marketing Decision Model and Consumer Behavior Pre
No ratings yet
Marketing Decision Model and Consumer Behavior Pre
25 pages
TMMV08 A1 2024
No ratings yet
TMMV08 A1 2024
9 pages
Standards - Based Instruction and Assessment Rubric For Chemistry Content Standards
100% (1)
Standards - Based Instruction and Assessment Rubric For Chemistry Content Standards
20 pages
Data Sheet: Stereo Cassette Head Preamplifier and Equalizer
No ratings yet
Data Sheet: Stereo Cassette Head Preamplifier and Equalizer
16 pages
Can Seaming. Moran. (1999)
No ratings yet
Can Seaming. Moran. (1999)
19 pages
Fundamentals of Pipeline Design
No ratings yet
Fundamentals of Pipeline Design
27 pages
5160DF Power
100% (1)
5160DF Power
7 pages
Lecture 10 - Quick Sort
No ratings yet
Lecture 10 - Quick Sort
74 pages
ECE 575Power System Communication and Control 22.23 PLC
No ratings yet
ECE 575Power System Communication and Control 22.23 PLC
15 pages
The Language of Water Chemistry: - Gramme-Equivalent
No ratings yet
The Language of Water Chemistry: - Gramme-Equivalent
5 pages
(Math-AA 2.3) FUNCTIONS - DOMAIN - RANGE - GRAPH - Solutions
No ratings yet
(Math-AA 2.3) FUNCTIONS - DOMAIN - RANGE - GRAPH - Solutions
9 pages
EMM 3108 Kekuatan Bahan I: Laboratory Report: HARDNESS TEST
No ratings yet
EMM 3108 Kekuatan Bahan I: Laboratory Report: HARDNESS TEST
7 pages
Air Conditioning Engineering (BAB 5) - Fauzan Aziz R - 2002322007
No ratings yet
Air Conditioning Engineering (BAB 5) - Fauzan Aziz R - 2002322007
16 pages
Case Study War Eagle Gold
No ratings yet
Case Study War Eagle Gold
4 pages
Tom Dixon Catalogue 2013 2014
No ratings yet
Tom Dixon Catalogue 2013 2014
79 pages
Basics of Electrical Grounding Earthing and Bonding
100% (4)
Basics of Electrical Grounding Earthing and Bonding
35 pages
Av Log
No ratings yet
Av Log
2,642 pages
XC4000XLA/XV Field Programmable Gate Arrays: This Datasheet Has Been Downloaded From at This
No ratings yet
XC4000XLA/XV Field Programmable Gate Arrays: This Datasheet Has Been Downloaded From at This
14 pages
S6 - Diagnostic - Test Questionnaire
No ratings yet
S6 - Diagnostic - Test Questionnaire
9 pages
01 Exercises Vectors
No ratings yet
01 Exercises Vectors
3 pages
LP Sci10 Q4 Week4
No ratings yet
LP Sci10 Q4 Week4
6 pages
Soil Strength Properties and Their Measurement: Lchapter
No ratings yet
Soil Strength Properties and Their Measurement: Lchapter
18 pages
BS en 50318
100% (1)
BS en 50318
20 pages
MEIL - SkyC - Spares Offer
No ratings yet
MEIL - SkyC - Spares Offer
5 pages
Sun Cluster Cheat Sheet: Daemons and Processes
No ratings yet
Sun Cluster Cheat Sheet: Daemons and Processes
5 pages
fyjc iet question paper 23
No ratings yet
fyjc iet question paper 23
2 pages

Running Llama 2 On CPU Inference Locally For Document Q&A - by Kenneth Leung - Jul, 2023 - Towards Data Science

Uploaded by

Running Llama 2 On CPU Inference Locally For Document Q&A - by Kenneth Leung - Jul, 2023 - Towards Data Science

Uploaded by

Open in app

Running Llama 2 on CPU Inference Locally for

Kenneth Leung · Follow

Listen Share More

Photo by NOAA on Unsplash

The proliferation of open-source LLMs has fortunately opened up a vast range of

(1) Quick Primer on Quantization

(1) Quick Primer on Quantization

Quantization is the technique of reducing the number of bits used to represent a

Weight quantization from FP16 to INT8 | Image by author

(2) Tools and Data

As a result, GGML versions of LLMs (quantized models in binary formats) can be

C Transformers supports a selected set of open-source models, including popular

(iii) Sentence-Transformers Embeddings Model

Having previously worked with venv, I highly recommend switching to Poetry as it

Check out this video to get started with Poetry.

Join Medium with Kenneth’s referral link

(3) Open-Source LLM Selection

Model Type (Llama 2)

Demonstrates a huge improvement on the previous benchmark set by the

It is widely mentioned and downloaded in the community.

Model Size (7B)

Fine-tuned Version (Llama-2-7B-Chat)

Quantized Format (8-bit)

The 8-bit format also offers a comparable response quality to 16-bit.

(4) Step-by-Step Guide

Data ingestion and splitting text into chunks

Load embeddings model (sentence-transformers)

Index chunks and store in FAISS vector store

Step 2 — Set up prompt template

Files and versions page of Llama-2–7B-Chat-GGML page on HuggingFace | Image by author

Step 4— Setup LLM

Step 5 — Build and initialize RetrievalQA

(5) Next Steps

Evaluate the use of vLLM, a high-throughput and memory-efficient inference

arXiv Keyword Extraction and Analysis Pipeline with KeyBERT and

How to Dockerize Machine Learning Applications Built with H2O,

Large Language Models Langchain Machine Learning Python

Written by Kenneth Leung

More from Kenneth Leung and Towards Data Science

Micro, Macro & Weighted Averages of F1 Score, Clearly Explained

· 7 min read · Jan 4, 2022

Matt Chapman in Towards Data Science

· 6 min read · Jul 3

Soner Yıldırım in Towards Data Science

ChatGPT Code Interpreter: How It Saved Me Hours of Work

· 6 min read · Jul 9

How to Easily Draw Neural Network Architecture Diagrams

· 5 min read · Aug 23, 2021

See all from Kenneth Leung

See all from Towards Data Science

You might also like