All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (LlamaIndex).pdf

Welcome to ServerlessToronto.org
1
Introduce Yourself:
- Where from? Why are you here?
- Looking for work or Offering work?
Help us serve you better: bit.ly/slsto
An Evening with Mark Ryan and Jerry Liu
• 6:00 - 6:10 Networking & Opening remarks
• 6:10 - 6:35 Mark Ryan: The LLM Landscape
• 6:35 - 7:15 Jerry Liu: Solving Core Challenges
in RAG Pipelines
• 7:15 - 7:45 Q&A
• 7:45 - 8:00 Manning Publications raffle

Why this Generative AI Talk?
2
1. Navigating the Tsunami: Understand the sweeping
changes the "GenAI tsunami“ brings to industries and jobs.
2. Situational Awareness: Learn from AI leaders Mark Ryan
& Jerry Liu to gain a strategic view of the LLM and RAG
landscape.
3. Career Transformation: Learn to position yourself as the
architect of automation rather than its subject.
4. Practical Advice: Acquire actionable strategies to apply
Generative AI within your enterprise.
5. Interactive Learning: Engage in live Q&A to discuss and
clarify your AI dilemmas with experts.
Battle of Waterloo

What is Serverless Toronto about?
3
Serverless became New Agile & Mindset
#1 We started as Back-
end FaaS Developers
who enjoyed 'gluing
together' other people's
APIs and Managed
Services
#3 We're obsessed
with creating business
value (meaningful
Products), focusing on
Outcomes/Impact –
NOT Outputs
#2 We build bridges
between Serverless
Community (“Dev leg”),
and Front-end, Voice-First
& UX folks (“UX leg”)
#4 Achieve agility NOT by
“sprinting” faster but working
smarter (by using bigger
building blocks & less Ops)
1
2
3
4

Serverless is a State of Mind…
4
Way too often, we – the IT folks,
have obsession with “pimping up
our cars” (infrastructure / code /
pipelines) instead of “driving
business” forward & taking them
places ☺

... It is a way to focus on business value.
5
It can be applied to any Tech stack, even On-Prem
Jared Short:
1. If the platform has it, use it
2. If the market has it, buy it
3. If you can reconsider requirements, do it
4. If you have to build it, own it.
Ben Kehoe: Serverless is about how you make
decisions, not about your choices.

Upcoming ServerlessToronto.org Meetups
6
Friday Lunch & Learn, April 19 Monday evening, May 6
Summer 2024

Knowledge Sponsor
1. Go to www.manning.com
2. Select *any* e-Book, Video course, or liveProject you want!
3. Add it to your shopping cart (no more than 1 item in the cart)
4. Raffle winners will send me the emails (used in Manning portal),
5. So the publisher can move it to your Dashboard – as if purchased.
Fill out the Survey to win: bit.ly/slsto

LLM Landscape
A Journey Through A Year of Evolution
Mark Ryan
Developer Knowledge Platform AI Lead, Google Cloud
ryanmark2014@gmail.com

Major Generative AI Milestones: Part 1
Jun 2017
Attention Is All You Need:
Seminal paper from Google
that introduced transformers
Oct 2018
BERT: Google
transformer-based
language model
Feb 2019
GPT-2: OpenAI LLM
May 2020
GPT-3: OpenAI LLM
Aug 2021
Codex: OpenAI code model
Apr 2022
DALLE 2: OpenAI image
model
Jan 2021
DALLE: OpenAI image model
May 2021
LaMDA: Google LLM
May 2022
Imagen: Google image model
PaLM: Google LLM
Gato: DeepMind multimodal model
Aug 2022
Stable Diffusion: Image
model

Major Generative AI Milestones: Part 2
Nov 2022
ChatGPT: Consumer chat from
OpenAI initially featuring GPT 3.5
Feb 2023
Bard: Consumer chat from
Google
Mar 2023
GPT-4: OpenAI flagship model
ChatGPT Plugins: Connect to
third-party applications
Apr 2023
CodeWhisperer:
AWS AI coding
assistant
July 2023
Llama 2: Meta open source LLM licensed for
commercial use.
Code Interpreter: OpenAI integrated sandbox
environment for data upload and analysis
Aug 2023
Duet AI: AI Assistant for Google
Cloud, including chat in console,
and general purpose (VSCode)
and SQL (Big Query) code
completion/interpretation
Sept 2023
DALLE 3: OpenAI image
model
Dec 2023
Gemini: Google flagship
multimodal (text / image /
video) models
Feb 2024
Gemini Pro 1.5: 1M context
multimodal model
Gemma: Google open model
Sora: OpenAI text to video
May 2023
Vertex AI Gen AI: including curated set
of Google, third-party, and open models
PaLM 2: Google flagship model
Nov 2023
Q: AWS chatbot
Grok: X chatbot
Mar 2024
Claude 3: Anthropic flagship models
Devin: Cognition SWE AI

Ecosystem and
Vendor Landscape

The Emerging LLM Ecosystem
Examples Description Use Case
Vector
databases
● Pinecone
● Chroma
● Vertex AI Vector
Search
Store and find associations
between embeddings,
high-dimensional vector
representations of data
Grounding LLM responses in a
set of documents (example of
RAG)
Encapsulated
coding
environments
OpenAI Code Interpreter /
Advanced Data Analysis
Upload datasets & ask questions
to get visualizations and code
running in a limited Python
instance
Ad hoc data analysis
Plugins /
extensions
● ChatGPT plugins /
GPTs
● Vertex AI extensions
Connect LLMs to third-party /
external applications
Access current data / query &
modify data that is external to
the LLM
LLM app
development
frameworks
● LangChain
● LlamaIndex
● Autogen
LLM-centric framework to manage
workflow (data sources, agents,
models, etc)
Assembling LLM-based
applications

Generative AI Landscape by Vendor
Vendor Prod. Suite
Assistance
Developer / Ops
Assistant
Consumer
Chat
Enterprise Gen
AI
Dev / Hobbyist
Gen AI
Open Foundation
Models
Google Gemini for
Google
Workspace
Duet AI for
Google Cloud
Gemini Vertex AI Google AI for
Developers
Gemma
Microsoft CoPilot 365 Github Copilot Bing Chat Azure OpenAI
OpenAI ChatGPT ChatGPT ChatGPT
Enterprise
ChatGPT
AWS ● Q
● CodeWhisperer
Bedrock / Titan
Anthropic Claude 3* Claude 3*
Meta Llama 2*
Mistral Mixtral 8x7B*

Twitter: @MarkRyanMkm
LinkedIn:
www.linkedin.com/in/mark
-ryan-31826743
YouTube:
@markryan2475

RAG in 2024
Jerry Liu, LlamaIndex co-founder/CEO

LlamaIndex: Context Augmentation for your LLM app

Paradigms for inserting knowledge
Retrieval Augmentation - Fix the model, put context into the prompt
LLM
Before college the two main
things I worked on, outside of
school, were writing and
programming. I didn't write
essays. I wrote what
beginning writers were
supposed to write then, and
probably still are: short
stories. My stories were awful.
They had hardly any plot, just
characters with strong
feelings, which I imagined
made them deep...
Input Prompt
Here is the context:
Before college the
two main things…
Given the context,
answer the following
question:
{query_str}

Paradigms for inserting knowledge
Fine-tuning - baking knowledge into the weights of the network
LLM
Before college the two main things
I worked on, outside of school,
were writing and programming. I
didn't write essays. I wrote what
beginning writers were supposed to
write then, and probably still are:
short stories. My stories were
awful. They had hardly any plot,
just characters with strong feelings,
which I imagined made them
deep...
RLHF, Adam, SGD, etc.

Current RAG Stack for building a QA System
Vector
Database
Doc
Chunk
Chunk
Chunk
Chunk
Chunk
Chunk
Chunk
LLM
Data Ingestion / Parsing Data Querying
5 Lines of Code in LlamaIndex!

Current RAG Stack (Data Ingestion/Parsing)
Vector
Database
Doc
Chunk
Chunk
Chunk
Chunk
Process:
● Split up document(s) into even chunks.
● Each chunk is a piece of raw text.
● Generate embedding for each chunk (e.g.
OpenAI embeddings, sentence_transformer)
● Store each chunk into a vector database

Current RAG Stack (Querying)
Vector
Database
Chunk
Chunk
Chunk
LLM
Process:
● Find top-k most similar chunks from vector
database collection
● Plug into LLM response synthesis module

Current RAG Stack (Querying)
Vector
Database
Chunk
Chunk
Chunk
LLM
Process:
● Find top-k most similar chunks from vector
database collection
● Plug into LLM response synthesis module
Retrieval Synthesis

Response Synthesis
Create and refine

Response Synthesis
Tree Summarize

Quickstart
https://colab.research.google.com/drive/1knQpGJLHj-LTTHqlZhgcjDH5F_nJIiY0?
usp=sharing

Challenges with “Naive” RAG

RAG
Data Parsing & Ingestion Data Querying
Index
Data
Data Parsing +
Ingestion
Retrieval
LLM +
Prompts
Response

Naive RAG
PyPDF
Sentence
Splitting
Chunk Size 256
Simple QA
Prompt
Dense Retrieval
Top-k = 5
Index
Data
Data Parsing +
Ingestion
Retrieval
LLM +
Prompts
Response

Easy to Prototype, Hard to Productionize
Naive RAG approaches tend to work well for simple questions over a simple,
small set of documents.
● “What are the main risk factors for Tesla?” (over Tesla 2021 10K)
● “What did the author do during his time at YC?” (Paul Graham essay)

Easy to Prototype, Hard to Productionize
But productionizing RAG over more questions and a larger set of data is hard!
Failure Modes:
● Response Quality: Bad Retrieval, Bad Response Generation
● Hard to Improve: Too many parameters to tune
● Systems: Latency, Cost, Security

Challenges with Naive RAG (Response Quality)
● Bad Retrieval
○ Low Precision: Not all chunks in retrieved set are relevant
■ Hallucination + Lost in the Middle Problems
○ Low Recall: Now all relevant chunks are retrieved.
■ Lacks enough context for LLM to synthesize an answer
○ Outdated information: The data is redundant or out of date.

Challenges with Naive RAG (Response Quality)
● Bad Retrieval
○ Low Precision: Not all chunks in retrieved set are relevant
■ Hallucination + Lost in the Middle Problems
○ Low Recall: Now all relevant chunks are retrieved.
■ Lacks enough context for LLM to synthesize an answer
○ Outdated information: The data is redundant or out of date.
● Bad Response Generation
○ Hallucination: Model makes up an answer that isn’t in the context.
○ Irrelevance: Model makes up an answer that doesn’t answer the question.
○ Toxicity/Bias: Model makes up an answer that’s harmful/offensive.

Difference with Traditional Software
Data Extract Response
Traditional software is defined by a set of programmatic rules.
Given an input, you can easily reason about the expected output.
Transform Load

AI-powered software is defined by a
black-box set of parameters.
It is really hard to reason about what the
function space looks like.
The model parameters are tuned, the
surrounding parameters (prompt templates)
are not.
Index
Data
Data Parsing +
Ingestion
Retrieval
LLM +
Prompts
Response

If one component of the system is a
black-box, all components of the system
become black boxes.
The more components, the more parameters
you have to tune.
Index
Data
Data Parsing +
Ingestion
Retrieval
LLM +
Prompts
Response

If one component of the system is a
black-box, all components of the system
become black boxes.
Every parameter affects the performance of
the end system.
Index
Data
Data Parsing +
Ingestion
Retrieval
LLM +
Prompts
Response
RAG

There’s Too Many Parameters
Every parameter affects the performance of
the entire RAG pipeline.
Which parameters should a user tune?
There’s too many options!
Index
Data
Data Parsing +
Ingestion
Retrieval
LLM +
Prompts
Response
Which PDF parser
should I use?
How do I chunk my
documents?
How do I process
embedded tables and
charts?
Which embedding
model should I use?
What retrieval
parameters should I
use?
Dense retrieval or
sparse?
Which LLM should I
use?

Mapping Pain Points to Solutions

Solution
Categorize by pain point, and establish best practices

Solution
“Seven Failure Points When Engineering a Retrieval
Augmented Generation System”, Barnett et al.

Solution
“12 RAG Pain Points and Proposed Solutions”, by Wenqi Glantz

Pain Points
Response Quality Related
1. Context Missing in the Knowledge
Base
2. Context Missing in the Initial
Retrieval Pass
3. Context Missing After Reranking
4. Context Not Extracted
5. Output is in Wrong Format
6. Output has Incorrect Level of
Specificity
7. Output is Incomplete

Pain Points
Scalability
8. Can't Scale to Larger Data Volumes
11. Rate-Limit Errors
Security
12. LLM Security
Use Case Specific
9. Ability to QA Tabular Data
10. Ability to Parse PDFs

1. Context Missing in the Knowledge Base
Clean your data: Pick a good
document parser (more on this
later!)
Add in Metadata: inject global
context to each chunk
Keep your data updated: Setup a
recurring data ingestion pipeline.
Upsert documents to prevent
duplicates.

2. Context Missing in the Initial Retrieval Pass
Solution: Hyperparameter tuning for chunk
size and top-k
Solution: Reranking
Source: ColBERT

3. Context Missing After Reranking
Solution: try out fancier retrieval methods
(small-to-big, auto-merging, auto-retrieval,
ensembling, …)
Solution: fine-tune your embedding models
to task-specific data

4. Context is there, but not extracted by the LLM
The context is there,
but the LLM doesn’t
understand it.
“Lost in the middle”
Problems.
https://x.com/GregKamradt/status/1722386725635580292?s=20

4. Context is there, but not extracted by the LLM
Solution: Prompt Compression
(LongLLMLingua)
Solution: LongContextReorder LongLLMLingua by Jiang et al.

5. Output is in Wrong Format
A lot of use cases require outputting the
answer in JSON format.
Solutions:
Better text prompting/output parsing
Use OpenAI function calling + JSON mode
Use token-level prompting (LMQL, Guidance)
Source: Guidance

7. Incomplete Answer
What if you have a complex multi-part
question?
Naive RAG is primarily good for answering
simple questions about specific facts.

7. Incomplete Answer
Solution: Add Agentic Reasoning
Agents? RAG
Query Response
Simple
Lower Cost
Lower Latency
Advanced
Higher Cost
Higher Latency
Routing
One-Shot Query
Planning
Tool Use
ReAct
Dynamic
Planning +
Execution

8. Scaling your Data Pipeline
Pain points:
● Processing thousands/millions of docs is slow
● How do we efficiently handle document updates?

8. Scaling your Data Pipeline
Pain points:
● Processing thousands/millions of docs is slow
● How do we efficiently handle document updates?
Reference Production Ingestion Stack
● Parallelize document processing
● HuggingFace TEI
● RabbitMQ Message Queue
● AWS EKS clusters
https://github.com/run-llama/llamaindex_aws_ingestion

10. Proper RAG over Complex Documents

How do we model complex docs
with embedded tables?
RAG with naive chunking +
retrieval → leads to hallucinations!
Embedded Table
Advanced Retrieval:
Embedded Tables

Advanced Retrieval:
Embedded Tables
Instead: model data hierarchically.
Index tables/figures by their
summaries.
The only missing component:
how do I parse out the tables from
the data?

Most PDF Parsing is Inadequate
Extracts into a
messy format that is
impossible to pass
down into more
advanced
ingestion/retrieval
algorithms.

Introducing LlamaParse
A genAI-native parser
designed to let you build
RAG over complex
documents
https://github.com/run-llama/llam
a_parse

Introducing LlamaParse
Capabilities
✅ Extracts tables / charts
✅ Input natural language parsing
instructions
✅JSON mode
✅Image Extraction
✅Support for ~10+ document types
(.pdf, .pptx, .docx, .xml)

LlamaParse Results
The best parser at table extraction == the only parser for advanced RAG
Expanded: https://drive.google.com/file/d/1fyQAg7nOtChQzhF2Ai7HEeKYYqdeWsdt/view?usp=sharing

Steerability
Default (no instructions)

Steerability
With Instructions

What’s next for RAG: Agents?

RAG
Query Response
From RAG to Agents

From RAG to Agents
Agents? RAG
Query Response
Agents?
Agents?

Agents? RAG
Query Response
From RAG to Agents
Agents?
Agent Definition: Using LLMs for automated reasoning and tool selection
RAG is just one Tool: Agents can decide to use RAG with other tools
Agents?

From Simple to Advanced Agents
Simple
Lower Cost
Lower Latency
Advanced
Higher Cost
Higher Latency
Routing
One-Shot Query
Planning
Tool Use
ReAct
Dynamic
Planning +
Execution

Routing
Simplest form of agentic
reasoning.
Given user query and set of
choices, output subset of
choices to route query to.

Routing
Use Case: Joint QA and
Summarization
Guide

Query Planning
Break down query into
parallelizable sub-queries.
Each sub-query can be
executed against any set of
RAG pipelines
Uber 10-K chunk 4
top-2
Uber 10-K chunk 8
Lyft 10-K chunk 4
Lyft 10-K chunk 8
Compare revenue growth of
Uber and Lyft in 2021
Uber 10-K
Lyft 10-K
Describe revenue growth
of Uber in 2021
Describe revenue
growth of Lyft in 2021
top-2

Query Planning
Example: Compare
revenue of Uber and Lyft in
2021
Query Planning Guide
Uber 10-K chunk 4
top-2
Uber 10-K chunk 8
Lyft 10-K chunk 4
Lyft 10-K chunk 8
Compare revenue growth of
Uber and Lyft in 2021
Uber 10-K
Lyft 10-K
Describe revenue growth
of Uber in 2021
Describe revenue
growth of Lyft in 2021
top-2

Tool Use
Use an LLM to call an API
Infer the parameters of that
API

Tool Use
In normal RAG you just pass
through the query.
But what if you used the
LLM to infer all the
parameters for the API
interface?
A key capability in many QA
use cases (auto-retrieval,
text-to-SQL, and more)

This is cool but
● How can an agent tackle sequential multi-part problems?
● How can an agent maintain state over time?

This is cool but
● How can an agent tackle sequential multi-part problems?
○ Let’s make it loop
● How can an agent maintain state over time?
○ Let’s add basic memory

Data Agents - Core Components
Agent Reasoning Loop
● ReAct Agent (any LLM)
● OpenAI Agent (only OAI)
Tools
Query Engine Tools (RAG
pipeline)
LlamaHub Tools (30+ tools to
external services)

ReAct: Reasoning + Acting with LLMs
Source: https://react-lm.github.io/

Add a loop around
query
decomposition + tool
use

Superset of query
planning + routing
capabilities.
ReAct + RAG Guide

Can we make this even better?
● Stop being so short-sighted - plan ahead at each step
● Parallelize execution where we can

LLMCompiler
Kim et al. 2023
An agent compiler
for parallel
multi-function
planning +
execution.

LLMCompiler
Plan out steps
beforehand, and
replan as necessary
LLMCompiler Agent

Tree-based Planning
Tree of Thoughts
(Yao et al. 2023)
Reasoning via
Planning (Hao et al.
2023)
Language Agent
Tree Search (Zhou
et al. 2023)

Additional Requirements
● Observability: see the full trace of the agent
○ Observability Guide
● Control: Be able to guide the intermediate steps of an agent step-by-step
○ Lower-Level Agent API
● Customizability: Define your own agentic logic around any set of tools.
○ Custom Agent Guide
○ Custom Agent with Query Pipeline Guide

Additional Requirements
Possible through our
query pipeline syntax
Query Pipeline Guide

What’s next for RAG: Long Contexts?

Is RAG Dead?
https://x.com/Francis_YAO_/status/1759962812229800012?s=20
Gemini 1.5 Pro has a 1-10M
context window.
What does this mean for RAG?

Our Position
1. Frameworks are valuable whether or not RAG lives or dies
2. Certain RAG concepts will go away, but others will remain and evolve

Long Context LLMs will Solve the Following
1. Developers will worry less about tuning chunking algorithms
2. Developers will need to spend less time tuning retrieval and
chain-of-thought over single documents
3. Summarization will be easier
4. Personalized memory will be better and easier to build

Some Challenges Remain
1. 10M tokens is not enough for large document corpuses (hundreds of
MB, GB)
2. Embedding models are lagging behind in context length
3. Cost and Latency
4. A KV Cache takes up a significant amount of GPU memory, and has
sequential dependencies

New RAG Architectures
1. Small to Big Retrieval over Documents
2. Intelligent Routing for Latency/Cost Tradeoffs
3. Retrieval Augmented KV Caching

Small to Big Retrieval over Documents

Intelligent Routing for Latency/Cost Tradeoffs

Retrieval Augmented KV Caching

www.ServerlessToronto.org
Reducing the gap between IT and Business needs

All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (LlamaIndex).pdf

Recommended

More Related Content

What's hot (20)

Similar to All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (LlamaIndex).pdf (20)

More from Daniel Zivkovic (20)

Recently uploaded (20)

All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (LlamaIndex).pdf