Generative AI - POC - Readout
Generative AI - POC - Readout
POC – Readout
& Demo
1
Table of Contents
2
Table of Contents
3
Objective of the POC
Recap on why we set out to do the POC
4
Broad category of Generative AI Use Cases
Most of the initial use cases of Gen AI fall in these categories
5
POC was specifically about Information Retrieval use case
We tested LLMs across various technology platforms Product
Manuals
6
Comparison of Gen AI Deployment Approaches
Build v/s Buy deployment options are being considered
Buy Options tend to have lower overall TCO & faster Time to Market than Build Options
7
Key Takeaways from the Gen AI POC
Next Steps … Evaluate Buy Options including CoPilots before deciding on path forward
8
Generative AI Technology Evaluation as of Nov 2023
A Solution selection is dependent on more than just a selection of a Large Language Model
Information Retrieval against Structured Data Information Retrieval against Unstructured Data
• Lot of CoPilot options being released • Microsoft 365 CoPilot showing lot of
into Private Preview promise for Q&A, summarization
against Sharepoint
• Hold lot of promise for faster time to
market with lower deployment effort • Q&A, Summarization capabilities
embedded in Office 365 suite
• License fee considerations v/s TCO
for Build option needs vetting • Min 300 seat license commitment
($30/usr/month or $9K/month)
9
Gen AI Tech Eval – Build – Information Retrieval on Unstructured Data
** All analysis and recommendations are based on a short 6 weeks POC using PDF product manuals data in varied formats and validation being done on sample of 10-15 business
questions. A full-fledged implementation and business validation is recommended for a closely defined use case before setting up for enterprise grade scale.
10
Gen AI Tech Eval – Build – Information Retrieval on Structured Data
• GPT 3.5/4 yielded reasonable accuracy (4/8) on
combination of performance + Cost for Complex
query generation
• All models work well with Simple Queries
• Models except GPT struggled with Text-SQL step
for complex queries
• Significant Business Context & Prompt
Engineering required to improve accuracy
• Underlying Data nuances contributed to
* 3/4 = one of the questions produces relevant output with a reproducibility factor of 2/3 inaccurate query generation
** All analysis and recommendations are based on a short 4 weeks POC leveraging raw data directly from data warehouse and validation being done on sample of 8-10 business questions.
A full-fledged implementation and business validation is recommended for a closely defined use case before setting up for enterprise grade scale.
11
Table of Contents
12
Conceptual Architecture
● Retrieval & Generation pipeline is triggered whenever there is a trigger in the UI.
13
Prompt Template
14
Architecture - Azure
15
Architecture - GCP
16
Architecture - AWS
17
Pipeline Combinations for POC
18
UI Walkthrough
19
Retriever & Model performance evaluation approach
* Default Message : ‘Sorry, Manual does not have the information you are looking for’
20
Cost Metrics
LLM API Cost (Closed Out of Box Models)
Retriever API Cost (Use Cloud Retrieval Services) Cloud Model Pricing Cost for 1000 Queries
Retriever Deployment Cost (Bring your own Vector DB) GCP PaLM (text-bison)
$0.0005/1000 characters
$3.00
($0.002 Per 1000 tokens)
Cost for
Machine Cost
Cloud Retriever Machine Type 1000
(On-Demand)
Queries LLM Deployment Cost (Open Source Models)
AWS Chroma DB r5.xlarge $0.26/hr $6.400 Machine Cost (On- Cost for 1000
Cloud Model Machine Type
Demand) Queries
Azure Chroma DB Standard D4 V3 $0.376 $9.024
AWS LLaMA g5.xlarge $1.10/hr $26.56
GCP Chroma DB e2-standard-4 $0.14/hr $3.360
GCP LLaMA g2-standard-8 $0.87/hr $20.88
* All the calculations are done considering 1.5k tokens or 6k characters for both input and output
21
Evaluation Summary - Azure
Data Extraction & Ingestion (One time activity) Retrieval Augmented Generation (RAG) Recommended pipeline
2 Azure - Cognitive Search Cognitive Search GPT - 4 12/15 13/15 12/15 0.0877
● Azure Cognitive Search and ChromaDB (OpenAI Embedding) are preferred retrievers because of precision and consistent performance.
● Azure OpenAI GPT-4 outperforms other models because of its concise and accurate response quality. However, GPT-3.5 is also on par with GPT-4 and cost-
effectiveness makes it a viable and preferable choice.
● Llama produces poor response due to limitations in context length. We could only pass 1 chunk to the model. Hence retriever performance for Llama is marked as 1.
● LLaMA has an additional overhead of deployment and maintenance.
● ChromaDB also has deployment overhead but is comparatively simpler to manage.
** All analysis and recommendations are based on a short 6 weeks POC using PDF product manuals data in varied formats and validation being done on sample 10-15 business questions.
A full-fledged implementation and business validation is recommended for a closely defined use case before setting up for enterprise grade scale.
22
Evaluation Summary - GCP
Data Extraction & Ingestion (One time activity) Retrieval Augmented Generation (RAG) Recommended pipeline
Data Vector Database / Retrieval Retrieval Model Full pipeline Cost per Query
S No. Cloud Vector Embedding Retrieval System LLM Model
Extraction Warehouse Performance Performance Performance ($)
● Gen AI App Builder and ChromaDB (PaLM Embedding) are preferred retrievers because of precision and consistent performance.
● PaLM performs better because of the quality and tone of the response. LLaMA sometimes produces garbage responses which additionally requires post processing.
● LLaMA has an additional overhead of deployment and maintenance.
● ChromaDB also has deployment overhead but is comparatively simpler to manage.
● Costs of running Chroma Db & UI services on GCP lower than Azure by a factor of 3-5
** All analysis and recommendations are based on a short 6 weeks POC using PDF product manuals data in varied formats and validation being done on sample 10-15 business questions.
A full-fledged implementation and business validation is recommended for a closely defined use case before setting up for enterprise grade scale.
23
Evaluation Summary - AWS
Data Extraction & Ingestion (One time activity) Retrieval Augmented Generation (RAG) Recommended pipeline
● AWS Textract appears to have parsing limitations for vertical text orientations & multi column PDF’s which affected parsing quality & overall pipeline performance results
● The best performing pipeline is with HuggingFace embedding , Chroma as retriever and Titan as LLM
● Kendra Retriever performs comparably to ChromaDB, The pipeline with Kendra and Titan LLM could be improved by increasing the Top K retrieved chunks.
● The higher numbers of anthropic does not suggest a significantly good performance, it is due to the underlying evaluation assumptions and model’s tendency to hallucinate.
● LLaMA has an additional overhead of deployment and maintenance.
● ChromaDB also has deployment overhead but is comparatively simpler to manage.
** All analysis and recommendations are based on a short 6 weeks POC using PDF product manuals data in varied formats and validation being done on sample 10-15 business questions.
A full-fledged implementation and business validation is recommended for a closely defined use case before setting up for enterprise grade scale.
24
Use Case Heat Map & LLM Model Ranking
Model Performance
* Model Performance rankings are subjective and can be improved with refinements.
** LLaMA in a customized deployment requires an extra level of quality control and post-processing to mitigate toxic, harmful, and biased content.
Model Cost
* Model pricings were calculated assuming the Use case involves 1.5k token per API call.
** LLaMA is deployed in a VM and is accessed via endpoints. The cost of LLaMA is calculated per day and then extrapolated to per query.
• Overall GPT 3.5 (11/15) & PaLM (11/15) were best performing considering combination of Cost & performance
• GPT 4 had best overall pipeline performance (12/15)
• Titan & Claude overall pipeline performance were skewed because of AWS Textract limitations
All analysis and recommendations are based on a short 6 weeks POC using PDF product manuals data in varied formats and validation being done on sample of 10-15 business questions.
A full-fledged implementation and business validation is recommended for a closely defined use case before setting up for enterprise grade scale.
25
26
Improving the Performance of RAG System
27
Roadmap for SBD
28
Table of Contents
29
Conceptual Architecture
30
Prompt Template
31
Architecture – Azure
2. SQL executor -
execute SQL on client's DB and store
resulting dataset in cloud storage
(Blob).
3. Data processor -
generate insights in natural
language from structured data using
LLM and respond back to the UI.
32
Architecture – AWS
2. SQL executor -
execute SQL on client's DB and store
resulting dataset in cloud storage
(S3).
3. Data processor -
generate insights in natural
language from structured data using
LLM and respond back to the UI.
33
Architecture – GCP
2. SQL executor -
execute SQL on client's DB and store
resulting dataset in cloud storage
(GCS).
3. Data processor -
generate insights in natural
language from structured data using
LLM and respond back to the UI.
34
Pipeline combinations for POC
Note
* Claude-v2 is used from the Anthropic model family because its superiority in code generation capabilities
** text-bison-32k is used from the PaLM model family because of its higher token size and better code generation capabilities.
All analysis and recommendations are based on a short 4 weeks POC leveraging raw data directly from data warehouse and validation being done on sample of 8-10 business questions.
A full-fledged implementation and business validation is recommended for a closely defined use case before setting up for enterprise grade scale.
35
Benchmarking
* 3/4 = one of the questions produces relevant output with a reproducibility factor of 2/3
All analysis and recommendations are based on a short 4 weeks POC leveraging raw data directly from data warehouse and validation being done on sample of 8-10 business questions.
A full-fledged implementation and business validation is recommended for a closely defined use case before setting up for enterprise grade scale.
36
LLM Model Evaluation Summary
4 Azure 2 Azure OpenAI - gpt 4 Azure OpenAI - gpt 3.5 Azure OpenAI - gpt 4 Azure OpenAI - gpt 3.5 4/8 ~$0.54 Performing relatively better .
All analysis and recommendations are based on a short 4 weeks POC leveraging raw data directly from data warehouse and validation being done on sample of 8-10 business questions.
A full-fledged implementation and business validation is recommended for a closely defined use case before setting up for enterprise grade scale.
37
Query to Insights – Evaluation Summary
• GPT 3.5 / 4 were the only combination that yielded reasonable performance (4/8)
combined with cost for questions translating to complex queries
• DQ issues contributed to inaccurate query generation
• All models work well with Simple queries (1 table)
• All models except GPT struggled on Complex SQL query generation
• Significant Business Context & Prompt Engineering will be required to deliver on
improved accuracy for other models for Text-SQL task
• Additional Intelligence layer will be required for Scale-out to help filter in on
relevant tables to address question prior alongside metadata & business context as
part of input prompts
• Cost per Query can be considerable because of Token size from inclusion of
Instructions, Data Dictionary, Guidelines & Business Context
** All analysis and recommendations are based on a short 4 weeks POC leveraging raw data directly from data warehouse and validation being done on sample of 8-10 business questions.
A full-fledged implementation and business validation is recommended for a closely defined use case before setting up for enterprise grade scale.
38
Augmented Prompt for Text-to-SQL Step
39
40
Improving LLM Performance for Querying on Structured Data
41
Summary of LLM Learnings - Recommendations for SBD
❑ Large Language Models offer multitude of capabilities for natural language and code generation. Contextualizing LLMs for a specific
use case needs initial effort however possible long-term benefits
❑ It is important to define a focused use case to test the potential of how LLMs can we used to generate business impact
❑ Various services offered by different cloud providers are at different levels of maturity but are rapidly evolving
❑ Choice of the right service should be done based on current requirements, foreseeable benefits and cost implications
❑ Data Dictionary must be descriptive with column names and descriptions. Collibra should be updated
❑ Data should be clean. It not ideal to carry out data cleaning and processing through LLMs.
All analysis and recommendations are based on a short 4 weeks POC leveraging raw data directly from data warehouse and validation being done on sample 8-10 business questions.
A full-fledged implementation and business validation is recommended for a closely defined use case before setting up for enterprise grade scale.
42
Known Limitations of all LLM based Services
❑ Quality of suggestions: They may depend on the volume of training dataset for that language
❑ Security and privacy concerns: Since the models are trained on publicly available data, the models could inadvertently make
suggestions that contain security vulnerabilities or were meant to be private
❑ Don’t fully understand context: While the AI has been trained to understand context, it may not be as capable as a human
developer in fully understanding the high-level objectives of a complex project
❑ Dependent on comments and naming: The AI can provide more accurate suggestions when given detailed comments and
descriptive variable names
❑ Lack of creative problem solving: Unlike a human developer, the tool cannot produce innovative solutions or creatively solve
problems.
❑ Inefficient for large data and context: The models may not be optimized for navigating and understanding large codebases. It’s
most effective at suggesting code for small tasks.
Therefore, human developers need to learn to use the power of LLMs in a creative way using a combination of services and models to
best suit their requirements
All analysis and recommendations are based on a short 4 weeks POC leveraging raw data directly from data warehouse and validation being done on sample 8-10 business questions.
A full-fledged implementation and business validation is recommended for a closely defined use case before setting up for enterprise grade scale.
43
Roadmap for SBD
44
Appendix
45
Enterprise Grade Setup - Evaluating Other Solutions & Products
46
Appendix
47
Enterprise Grade Setup – Evaluating Other Options
48
Architecture for Various Options
49
Enterprise Grade Setup – Evaluating Other Solutions & Products
Features Customized Solution using LLMs ChatGPT - Advanced Data Analysis Power BI CoPilot ThoughtSpot Sage
Multiple data sources and tables Can handle curated data supported in
Data Size Multiple data sources and tables A single document upto 512MB
prepared into relevant views the ThoughtSpot data model
Integration with
Can be integrated seamlessly with
Custom Built Business Not possible Not possible through the interface Not possible
other applications
Applications
Deep dive analysis through Deep dive analysis through Narrative summarization of visuals and Deep dive analysis through
Types of Insights
conversations conversations data conversations
Multiple users, same data source, Multiple users, independent data Multiple users, same data source, Multiple users, same data source,
Scalability
concurrent access to insights sources, concurrent access to insights independent access to insights concurrent access to insights
Costs depend on usage/ number of Per user or power users can generate
Licensing Costs Per user Per user
tokens reports
50
Custom Solution using LLMs
51
Snowflake Cortex
52
PowerBI CoPilot
53
ThoughtSpot Sage
54
Snowflake CoPilot Snapshot
55
Model Specific Fine-tuning: AWS Claude Model
56