Llm Application Through Production
Llm Application Through Production
Language
Models
Application through
Production
2. Primer on NLP
Matei Zaharia
Co-founder & CTO of Databricks
Associate Professor of Computer Science
at Stanford University
“Chegg shares drop more than 40% “[...] ask GitHub Copilot to explain a
after company says ChatGPT is killing piece of code. Bump into an error?
its business” Have GitHub Copilot fix it. It’ll even
generate unit tests so you can get
back to building what’s next.”
05/02/2023
Link 03/22/2023*
Link
©2023 Databricks Inc. — All rights reserved *Announcement date instead of article date
LLMs are not that new
Why should I care now?
Decision criteria
2. Primer on NLP
Summarization
I like this book. Me gusta este libro.
• Clinical decision support.
• News article sentiments.
• Legal proceeding summary.
Question answering:
chatbots Text classification
It really depends on your • Customer review sentiments.
What’s the best scifi book ever? preferences. Some of the
top-rated ones include…
• Genre/topic classification.
50:"**’s",
…}
...
©2023 Databricks Inc. — All rights reserved Source: Show and Tell: A Neural Image Caption Generator
Text interpretation is challenging
“The ball hit the table and it broke.” “What’s the best sci-fi book ever?”
Large Language Model—What about these makes them “larger” than other language models?
Categories:
• Generative: find the most likely next word
• Classification: find the most likely classification/answer
The moon, Earth's only natural satellite, has been a subject of fascination and wonder for thousands of years.
a: 0 {The { [1],
Corpus of
The: 1 moon, [45600],
training
is: 2 Earth’s [8097],
data used Building Vocabulary Tokenization
what: 3 only [43],
to build our
I: 4 natural [1323],
vocabulary. Build index Map tokens
and: 5 satellite [754]
(dictionary of to indices
… …} …}
tokens = words)
Pros Cons
Intuitive. Big vocabularies.
Complications such as handling misspellings and
other out-of-vocabulary words.
©2023 Databricks Inc. — All rights reserved
Tokenization - Characters This vocab
is too small!
The moon, Earth's only natural satellite, has been a subject of fascination and wonder for thousands of years.
a: 0 t → 19
Corpus of
b: 1 h → 7
training
c: 2 e → 4
data used
d: 3 m → 12
to build our
e: 4 o → 14
vocabulary. Build
Buildindex
index Maptokens
Map tokens
f: 5 o → 14
(alphabet)
(dictionary of
… toindices
to indices n → 13
tokens =
letters/characters) … → …
Pros Cons
Small vocabulary. Loss of context within words.
No out-of-vocabulary words. Much longer sequences for a given input.
The moon, Earth's only natural satellite, has been a subject of fascination and wonder for thousands of years.
a: 0 The → 319
Corpus of
as: 1 moon → 12
training
ask: 2 **, → 391
data used
be: 3 Earth → 178
to build our
ca: 4 **‘s → 198
vocabulary. Build
Buildindex
index Maptokens
Map tokens
cd: 5 on → 79
(byte-pair
(dictionary of
… toindices
to indices ly → 281
tokens = mix of
encoding)
words and … → …
sub-words)
Compromise
Byte Pair Encoding (BPE) a popular encoding.
Start with a small vocab of characters. “Smart” vocabulary built from characters
Iteratively merge frequent pairs into new bytes in which co-occur frequently.
the vocab (such as “b”,”e” → “be”). More robust to novel words.
©2023 Databricks Inc. — All rights reserved
Tokenization
Tokenization Token
Tokens Vocab size
method count
‘The moon, Earth's only natural satellite, has been a subject of fascination # sentences
Sentence and wonder for thousands of years.’ 1
in doc
'The', 'moon,', "Earth's", 'only', 'natural', 'satellite,', 'has', 'been', 'a', 'subject', 171K
Word 'of', 'fascination', 'and', 'wonder', 'for', 'thousands', 'of', 'years.' 18
(English¹)
'The', 'moon', ',', 'Earth', "'", 's', 'on', 'ly', 'n', 'atur', 'al', 's', 'ate', 'll', 'it', 'e', ',',
Sub-word 'has', 'been', 'a', 'subject', 'of', 'fascinat', 'ion', 'and', 'w', 'on', 'd', 'er', 'for', 'th', 37 (varies)
'ous', 'and', 's', 'of', 'y', 'ears', '.'
'T', 'h', 'e', ' ', 'm', 'o', 'o', 'n', ',', ' ', 'E', 'a', 'r', 't', 'h', "'", 's', ' ', 'o', 'n', 'l', 'y', ' ',
'n', 'a', 't', 'u', 'r', 'a', 'l', ' ', 's', 'a', 't', 'e', 'l', 'l', 'i', 't', 'e', ',', ' ', 'h', 'a', 's', ' ', 'b', 'e', 52 +
Character 'e', 'n', ' ', 'a', ' ', 's', 'u', 'b', 'j', 'e', 'c', 't', ' ', 'o', 'f', ' ', 'f', 'a', 's', 'c', 'i', 'n', 'a', 't', 110 punctuation
'i', 'o', 'n', ' ', 'a', 'n', 'd', ' ', 'w', 'o', 'n', 'd', 'e', 'r', ' ', 'f', 'o', 'r', ' ', 't', 'h', 'o', 'u', 's', (English)
'a', 'n', 'd', 's', ' ', 'o', 'f', ' ', 'y', 'e', 'a', 'r', 's', '.'
“puppy” Embedding
[0.2, 1.5, 0.6 …. 0.6]
function
word/token Pre-trained module Word When done well, similar words will be
(eg. word2vec model) embedding/vector closer in these embedding/vector spaces.
Word Embedding: Basics. Create a vector from a word | by Hariom Gautam | Medium
©2023 Databricks Inc. — All rights reserved
Dense vector representations
Visualizing common words using word vectors.
Word Embedding: Basics. Create a vector from a word | by Hariom Gautam | Medium
©2023 Databricks Inc. — All rights reserved
Natural Language Processing (NLP)
Let’s review
• Large LMs are just LMs with transformer architectures, but bigger.
2. Primer on NLP
(CNN)
A magnitude 6.7 earthquake rattled Papua New Guinea
early Friday afternoon, according to the U.S. Geological <Article 1
Survey. The quake was centered about 200 miles summary>
north-northeast of Port Moresby and had a depth of 28
miles. No tsunami warning was issued…
<Article 2
summary>
* For Spark NLP, this is missing counts from Conda & Maven downloads.
©2023 Databricks Inc. — All rights reserved
Hugging Face:
The GitHub of Large Language Models
% of Stack Overflow
• Datasets
• Spaces for demos and code
Under the hood, these libraries can use PyTorch, TensorFlow, and JAX.
LLM Pipeline
(CNN) from transformers import pipeline
A magnitude
<Article 1
6.7 summarizer = pipeline("summarization") summary>
earthquake
rattled… summarizer("A magnitude 6.7 earthquake rattled ...")
(Optional)
Prompt Tokenizer Model Tokenizer
construction (encoding) (LLM) (decoding)
(CNN)
A magnitude
<Article 1
6.7
summary>
earthquake
rattled… Input text Encoded input
Encoded output
Summarize: “A magnitude [23981, 391078, 19,
[1827, 308, 25, …]
6.7 earthquake rattled…” 308, …]
model = AutoModelForSeq2SeqLM.from_pretrained("<model_name>")
summary_ids = model.generate(
Model
inputs.input_ids,
Mask handles variable-length inputs attention_mask=inputs.attention_mask,
num_beams=10, Models search for best output
Encoded output
[1827, 308, 25, …] min_length=5,
Adjust output lengths to match task
max_length=40)
Datasets library
• 1-line APIs for loading and sharing datasets
• NLP, Audio, and Computer Vision tasks
NLP task behind this app: Find a model for this task:
Summarization Hugging Face Hub → 176,620 models.
Extractive: Select representative pieces of text. Filter by task → 960 models.
Abstractive: Generate new text. Then…? Consider your needs.
GPT-Neo/X 125 M - 20 B MIT / Apache 2.0 EleutherAI 2021 / 2022 based on GPT-2 architecture
FLAN 80 M - 540 B Apache 2.0 Google 2021 methods to improve training for
existing architectures
BART 139 M - 406 M Apache 2.0 Meta 2019 derived from BERT, GPT, others
• Summarization
• Sentiment analysis
We’ll focus on these examples
• Translation in this module.
• Zero-shot classification
• Few-shot learning
• Conversation / chat
• (Table) Question-answering Some “tasks” are very general
and overlap with other tasks.
• Text / token classification
• Text generation
©2023 Databricks Inc. — All rights reserved
Task: Sentiment analysis
Example app: Stock market analysis "New for subscribers: Analysts
continue to upgrade tech stocks Positive
I need to monitor the stock market, and I want on hopes the rebound is for real…"
to use Twitter commentary as an early
indicator of trends.
"<company> stock price target
cut to $54 vs. $55 at BofA Merrill Negative
Lynch"
sentiment_classifier(tweets)
Out:[{'label': 'positive', 'score': 0.997},
{'label': 'negative', 'score': 0.996},
…]
©2023 Databricks Inc. — All rights reserved Blog on sentiment analysis: huggingface.co
Task: Translation
en_to_es_translator = pipeline(
task="text2text-generation", # task of variable length
model="Helsinki-NLP/opus-mt-en-es") # translates English to Spanish
# General models may support multiple languages and require prompts / instructions.
t5_translator("translate English to Romanian: Existing, open-source models...")
predicted_label = zero_shot_pipeline(
sequences=article,
candidate_labels=["politics", "Breaking news", "sports"])
©2023 Databricks Inc. — All rights reserved Zero-shot classification overview: huggingface.co
Task: Few-shot learning
pipeline(
Instruction
“Show” a model what you want """For each tweet, describe its sentiment:
©2023 Databricks Inc. — All rights reserved Blog about GPT-Neo: huggingface.co
Prompts:
Our entry to interacting with LLMs
###
[Tweet]: "This is the link to the article"
[Sentiment]: Neutral
###
Query to
answer
©2023 Databricks Inc. — All rights reserved Example from blog post: huggingface.co
Prompts get complicated
Structured output extraction example from LangChain
pipeline(""" Instruction
High-level instruction
Answer the user query. The output should be formatted as JSON that conforms to the JSON schema below.
Explain how to understand the desired output format
As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array",
"items": {"type": "string"}}}, "required": ["foo"]}} the object {"foo": ["bar", "baz"]} is a well-formatted instance of
the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.
Main instruction
Tell me a joke.""")
Prompt Engineering
As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array",
"items": {"type": "string"}}}, "required": ["foo"]}} the object {"foo": ["bar", "baz"]} is a well-formatted instance of
the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.
Output format
Here is the output schema:
```
{"properties": {"setup": {"title": "Setup", "description": "question to set up a joke", "type": "string"}, "punchline":
{"title": "Punchline", "description": "answer to resolve the joke", "type": "string"}}, "required": ["setup","punchline"]}
```
Input / Question
Tell me a joke.""")
Jailbreaking:
Bypass moderation rule
Prompt leaking:
Extract sensitive information
• Learn best practices for when to use vector stores and how to improve
search-retrieval performance
Word Embedding: Basics. Create a vector from a word | by Hariom Gautam | Medium
©2023 Databricks Inc. — All rights reserved
Turn images and audio into vectors too
Data objects Vectors Tasks
• Object recognition
[0.5, 1.4, -1.3, ….] • Scene detection
• Product search
• Translation
[0.8, 1.4, -2.3, ….] • Question Answering
• Semantic search
• Speech to text
[1.8, 0.4, -1.5, ….] • Music transcription
• Machinery malfunction
©2023 Databricks Inc. — All rights reserved
Use cases of vector databases
• Similarity search: text, images, audio
Are electric cars better for the environment?
• De-duplication
• Semantic match, rather than keyword match! electric cars climate impact
• Example on enhancing product search
• Very useful for knowledge-based Q/A Environmental impact of electric vehicles
• Recommendation engines
How to cope with the pandemic
• Example blog post: Spotify uses vector
search to recommend podcast episodes dealing with covid ptsd
Source: Spotify
The higher the metric, the less similar The higher the metric, the more similar
Source: buildin.com
Source: Pinecone
©2023 Databricks Inc. — All rights reserved
Ability to search for similar
objects is
• Post-query
• In-query
• Pre-query
• # of results is highly
unpredictable
• Branding as a scalar
• Not as performant as
post- or in-query filtering
Pros Cons
Open-Sourced
Not Open-Sourced
• Splitting 1 doc into smaller docs = 1 doc can produce N vectors of M tokens
Existing resources:
• Text Splitters by LangChain
• Blog post on semantic search by Vespa - light mention of chunking
• Chunking Strategies by Pinecone
• Apply LangChain to leverage multiple LLM providers such as OpenAI and Hugging Face.
• Create complex logic flow with agents in LangChain to pass prompts and use logical
reasoning to complete tasks.
Prompt se
Response
Response
task/application
workflow
Workflow: Applications
with more than a single
Task
interaction Workflow
Task Task Task
Workflow
Initiated Task Completed
Task
End-to-end workflow
©2023 Databricks Inc. — All rights reserved
Summarize and Sentiment
Example multi-LLM problem: get the sentiment of many articles on a topic
Article 1: “...”
Article 2: “...”
Goal:
Article 3: “...” Create a reusable workflow for multiple articles.
… Summary LLM
Overall
Sentiment
For this we’ll focus on the first task first.
Summary 1
+ Summary
2 + “...” How do we make this process systematic?
Sentiment LLM
Article 1: “...”
Now we need the output from
DONE
Article 2: “...”
Article 3: “...” our new engineered prompts to
… Summary LLM
Overall
be the input to the sentiment
Sentiment analysis LLM.
Summary 1
+ Summary
2 + “...”
For this we’re going to chain
Sentiment LLM together these LLMs.
# We will also need another prompt template like before, a new sentiment prompt
sentiment_prompt_template = """
Evaluate the sentiment of the following summary: {summary}
Sentiment: """
Workflow Chain
Sentiment Chain
Summary Chain
LLM used: sentiment LLM
LLM used: summarization LLM Input: sentiment_prompt:
Input: summary_prompt: Formats article1_summary
Formats Article_1 into into prompt format
prompt format Output: summary sentiment
Article_1
Output: article1_summary
• ……
intermediate_steps: Steps the LLM has taken to date, along with observations
Building reasoning loops """
output = self.llm_chain.run(intermediate_steps=intermediate_steps)
return self.output_parser.parse(output)
Agents are LLM-based systems
def take_next_step() : """Take a single step in the thought-action-observation loop."""
that execute the ReasonAction # Call the LLM to see what to do.
A set of tools that the LLM will select tools = load_tools([Google Search,Python Interpreter])
and execute to perform steps to agent = initialize_agent(tools, llm)
achieve the task. agent.run("In what year was Isaac Newton born? What is
that year raised to the power of 0.3141?"))
Simplified code from
the LangChain Agent
Source: csdn.net
Source: Twitter.com
LangChain
HF transformers Agents
HuggingGPT/Jarvis BabyAGI
• Be familiar with common tools for training and fine-tuning, such as those from Hugging
Face and DeepSpeed.
News API
Open-source instruction-following LLM
<Article 1
summary riddle>
Paid LLM-as-a-Service
“Some”
premade
examples
Build your own…
News API
<Article 1
summary riddle>
“Some”
premade
examples
LLM needs to reframe the output [Article 1]: "Residents were awoken to the surprise…"
[Summary Riddle 1]: "In houses they stay, the peop… "
as a riddle.
###
[Article 2]: "Gas prices reached an all time …"
[Summary Riddle 1]: "Far you will drive, to find…"
• Large version of base LLM ###
• Long input sequence …
###
[Article n]: {article}
[Summary Riddle n]:""")
News API
Instruction-following LLM
<Article 1
summary riddle>
“Some”
premade
examples
News API
<Article 1
summary riddle>
Paid LLM-as-a-Service
“Some”
premade
examples
back. LLM_API(prompt(article),api_key="sk-@sjr…")
News API
<Article 1
summary riddle>
“Some”
premade
examples
Build your own…
News API
<Article 1
Build your own… summary riddle>
“Some”
premade
examples
Create full model Fine-tune an
from scratch existing model
News API
<Article 1
Build your own… summary riddle>
“Some”
premade
examples
Create full model Fine-tune an
from scratch existing model
Pythia 12B:
Layers:36 Dimensions:5120
Heads:40 Seq. Len:2048
databricks-dolly-15k
The Pile
800GB Dataset of
Diverse Text for
Language Modeling
EVALUATION TIME!
But for a good LLM what does the loss tell us?
A good language will model will have high accuracy and low perplexity
Language Model
probability
distribution
tri-grams
bi-grams
Reference Life is what happens when you're busy making other plans.
Total matching
N-grams N-gram
Total N-grams
recall
©2023 Databricks Inc. — All rights reserved References: Rajpurkar et al., 2016 and https://rajpurkar.github.io/SQuAD-explorer/
Evaluation metrics at the cutting edge
ChatGPT and InstructGPT (predecessor) used similar techniques
1. Target application
a. NLP tasks: Q&A, reading comprehension, and summarization
b. Queries chosen to match the API distribution
c. Metric: human preference ratings
2. Alignment
a. “Helpful” → Follow instructions, and infer user intent. Main metric: human
preference ratings
b. “Honest” → Metrics: human grading on “hallucinations” and TruthfulQA benchmark
dataset
c. “Harmless” → Metrics: human and automated grading for toxicity
(RealToxicityPrompts); automated grading for bias (Winogender, CrowS-Pairs)
i. Note: Human labelers were given very specific definitions of “harmful” (violent content, etc.)
• Examine datasets used to train LLMs and assess their inherent bias
• Information hazard
• Big data != good data • Misinformation harms • Automation of human jobs
• Discrimination, exclusion, • Malicious uses • Environmental harms and
toxicity • Human-computer costs
interaction harm
©2023 Databricks Inc. — All rights reserved Sources: Bender et al 2021 and Kasneci et al 2023
Models can be toxic, discriminatory, exclusive
Reason: data is flawed
Source: Allen AI
Image source:
gyphy.com
Intrinsic Extrinsic
Output contradicts the source Cannot verify output from the source,
but it might not be wrong
Source: Source:
The first Ebola vaccine was approved
Alice won first prize in fencing last
by the FDA in 2019, five years after the
week.
initial outbreak in 2014.
Output:
Summary output:
The first Ebola vaccine was approved Alice won first prize fencing for the
in 2021 first time last week and she was
ecstatic.
Goals of MLOps
Google Search
• Maintain stable performance popularity of
• Meet KPIs “MLOps”
• Update models and systems as needed
• Reduce risk of system failures
©2023 Databricks Inc. — All rights reserved See “The Big Book of MLOps” for an overview
Traditional MLOps architecture
Different
Differentproduction
productiontooling:
tooling:
big
bigmodels,
models,vector
vectordatabases,
etc.
databases, etc.
Vector
database
Vector
database
Vector
database
• Prompt engineering
• Packaging models or pipelines for deployment
• Scaling out
• Managing cost/performance tradeoffs
• Human feedback, testing, and monitoring
• Deploying models vs. deploying code
• Service infrastructure: vector databases and complex models
Model
API
(New) fine-tuned
model
LangChain chain
Vector DB Prompt Hugging Face
lookup template pipeline
Model mlflow.openai.log_model(model="gpt-3.5-turbo",
API task=openai.ChatCompletion, …)
mlflow.pytorch.log_model(
(New) fine-tuned
model pytorch_model=my_finetuned_model, …)
LangChain chain
Vector DB Prompt Hugging Face mlflow.langchain.log_model(lc_model=llm_chain, …)
lookup template pipeline
In-Line Code
v2
Cloud Inference
Services
Metrics to optimize
• Cost of queries and training
• Time for development
• ROI of the LLM-powered product
• Accuracy/metrics of model
• Query latency
Deploy models
©2023 Databricks Inc. — All rights reserved Source: The Big Book of MLOps
Service architecture
Vector databases Complex models behind APIs
• Models have complex behavior and
LLM pipeline can be stochastic.
batch job Vector DB in local • How can you make these APIs stable
cache and compatible?
LLM-based
embedding
LLM pipeline LLM pipeline
v1.0 v1.1