[DSC Europe 24] Simun Sunjic & Lovro Matosevic - Empowering Sales with Intelligent BI Agents: DWH, LLM, and RAG Integration

Download as PPTX, PDF

•0 likes•24 views

In today's data-driven business landscape, traditional BI tools often fall short, requiring specialized knowledge and struggling with complex queries on large-scale databases. Our innovative approach bridges this gap by seamlessly integrating Data Warehouses, Large Language Models (LLMs), and Retrieval-Augmented Generation (RAG). This cutting-edge system empowers sales professionals to interact with complex data using natural language, eliminating the need for SQL expertise. By leveraging AI, we're enabling rapid, context-aware insights, allowing users to access and interpret data more efficiently. Join us to explore how this integration is transforming the way sales teams make data-driven decisions, marking a leap forward in accessible, intelligent business analytics.

Data & Analytics

EMPOWERING SALES WITH
INTELLIGENT BI AGENTS:
DWH, LLM, AND RAG
INTEGRATION
ŠIMUN ŠUNJIĆ
LOVRO MATOŠEVIĆ

[DSC Europe 24] Simun Sunjic & Lovro Matosevic - Empowering Sales with Intelligent BI Agents: DWH, LLM, and RAG Integration

Challenges
User related Technically related

BI LLM Advanced Database Chat System
LLM-enhanced
SQL generation
Retrieval-Augmented
Generation (RAG) for context
understanding
Multi-agent architecture for
autonomous management of
various processing aspects
and state retention
Graph databases for
global schema context
and understanding
User-friendly UI for
visualization and
reporting

Schema Linking
Essential for handling complex,
multi-table queries
Why?

Schema Linking
How?
Embed user query and schema for similarity search
Periodically update schema
Only pick portion of the schema
Few-shot with Golden SQL queries and relevance scoring

Schema Linking
Bridges the gap between natural
language and database structure
Conclusion?

Query Generation Process
Properly detect users' intent
via confidence scoring
recommendation
utterances: using natural
language parsers / LLM

Query Generation Process
System prompts: Guide LLM
through system prompts by
injecting few-shot examples,
relevant data from high cardinality
columns and the relevant portion
of DDL context

Query Generation Process
Conversation history:
Cache and database records

Query Generation Process
Chatty agents: make sure
agents don't fall into
recursion

Query Generation Process
Context manager: ensure
agents share common
context storage

Query Generation Process
Query Optimizer: built in
generation process with SQL
validation and fixing

High-cardinal columns
• Values with millions of unique
product IDs
• Helps with "vague terms into
specific database values"
conversions
• User query to sub-queries
decomposition

Query Generation Process
• The system maintains
conversation context and
understands business terminology,
allowing users to ask follow-up
questions naturally.
• For example, after seeing sales
data, users can simply ask without
needing to specify all the details
again:
o Show this as a chart
o Compare with previous year

Agents
• Generate skeleton SQL using
graphs and similarity search
• Improve the WHERE clause with
high-cardinal data
• Sub-query agent breaks down the
query into different components

Graphs
• "What is the most effective sales strategy employed by a contemporary
of top sales leaders in the industry?"
• Reason about relationships to create DAG (directed acyclic graph)
• Extract non-local entities connected through multi-hop
• Identify root node -> Graph -> Sub-graph = Query -> Sub-query
Traditional efficient search methods, such as locality-sensitive hashing, which
are designed for similarity search, are not well-suited for extracting complex
structural patterns such as paths or subgraphs. Extracting structural
information must cover the critical evidence needed to answer the query
without exceeding the reasoning capacity of LLMs. Expanding the context
window increases computational complexity and can degrade RAG
performance by introducing irrelevant information.

Fine-Tuning Llama 3.1
Enhancing Query Generation Accuracy
Fine-tuning adapts Llama 3.1
to our client's specific needs
Improves understanding of
complex queries

Fine-Tuning Llama 3.1
Customization to Business Context
Incorporates specific
terminology and data schemas
Aligns model with industry-
specific language

Fine-Tuning Llama 3.1
Handling Domain-Specific Terms
"Customer churn" in telecom
"Inventory turnover" in retail
Examples:

Crafting a Custom Synthetic Dataset
Utilizing LLMs for Dataset Generation
Employed models like ChatGPT and Anthropic
Generated tailored question-query pairs for the
client's DWH

Crafting a Custom Synthetic Dataset
Building a Robust Training Set
Created and validated 100 extremely complex queries
Added 200 less complex queries for
comprehensive coverage

Crafting a Custom Synthetic Dataset
t
eries
WITH CustomerOrderValues AS (
SELECT
"f_sales"."BILL_CUSTOMER_SID",
DATE_TRUNC('quarter', "d_date"."DATE") AS "quarter",
AVG("f_sales"."EXTENDED_PRICE") AS "avg_order_value",
COUNT(DISTINCT "f_sales"."SALES_DOCUMENT_SID") AS "order_count"
FROM "f_sales"
JOIN "d_date" ON "f_sales"."ORDER_DATE_SID" = "d_date".date_sid
WHERE "d_date"."DATE" >= DATE_TRUNC('quarter', CURRENT_DATE) - INTERVAL '3 months'
AND "d_date"."DATE" < DATE_TRUNC('quarter', CURRENT_DATE) + INTERVAL '3 months'
GROUP BY "f_sales"."BILL_CUSTOMER_SID", DATE_TRUNC('quarter', "d_date"."DATE")
),
CustomerGrowth AS (
SELECT
"c"."CUSTOMER_SID",
"c"."CUSTOMER_NAME",
"cov_current"."avg_order_value" AS "current_avg_order_value",
"cov_previous"."avg_order_value" AS "previous_avg_order_value",
("cov_current"."avg_order_value" - "cov_previous"."avg_order_value") / "cov_previous"."avg_order_value" AS "growth_rate"
FROM "d_customers" "c"
JOIN CustomerOrderValues "cov_current" ON "c"."CUSTOMER_SID" = "cov_current"."BILL_CUSTOMER_SID"
JOIN CustomerOrderValues "cov_previous" ON "c"."CUSTOMER_SID" = "cov_previous"."BILL_CUSTOMER_SID"
WHERE "cov_current"."quarter" = DATE_TRUNC('quarter', CURRENT_DATE)
AND "cov_previous"."quarter" = DATE_TRUNC('quarter', CURRENT_DATE) - INTERVAL '3 months'
AND "cov_current"."order_count" >= 5
AND "cov_previous"."order_count" >= 5
),
CompanyAverage AS (
SELECT SUM("f_sales"."EXTENDED_PRICE") / COUNT("f_sales"."SALES_DOCUMENT_SID") AS "company_avg_order_value"
FROM "f_sales"
JOIN "d_date" ON "f_sales"."ORDER_DATE_SID" = "d_date".date_sid
WHERE "d_date"."DATE" >= DATE_TRUNC('quarter', CURRENT_DATE)
AND "d_date"."DATE" < DATE_TRUNC('quarter', CURRENT_DATE) + INTERVAL '3 months'
)
SELECT
"cg"."CUSTOMER_NAME",
ROUND("cg"."current_avg_order_value", 2) AS "current_avg_order_value",
ROUND("cg"."previous_avg_order_value", 2) AS "previous_avg_order_value",
ROUND("cg"."growth_rate" * 100, 2) AS "growth_percentage",
ROUND(("cg"."current_avg_order_value" - "ca"."company_avg_order_value") / "ca"."company_avg_order_value" * 100, 2) AS "percent_di
FROM CustomerGrowth "cg"
CROSS JOIN CompanyAverage "ca"
WHERE "cg"."growth_rate" > 0
ORDER BY "cg"."growth_rate" DESC
LIMIT 10;
"Which customers have
shown the highest increase
in average order value from
last quarter to this quarter,
and how does their current
performance compare to
the overall company
average?"

Crafting a Custom Synthetic Dataset
Expanding Through Paraphrasing
Addressed natural language ambiguities
Added 200 less complex queries for
comprehensive coverage

Crafting a Custom Synthetic Dataset
Impact on Model Performance
Improved accuracy and relevance in query results
Enhanced ability to handle varied expressions
and complex queries

Expanding Capabilities and Tackling Challenges
• Current Challenges:
• Ongoing model tuning for diverse datasets
• Context management for complex queries
• SQL accuracy and injection prevention
• Future Directions:
• Reduced latency
• High precision during the query generation
• Better intention detection

[DSC Europe 24] Simun Sunjic & Lovro Matosevic - Empowering Sales with Intelligent BI Agents: DWH, LLM, and RAG Integration

1. EMPOWERING SALES WITH INTELLIGENT BI AGENTS: DWH, LLM, AND RAG INTEGRATION ŠIMUN ŠUNJIĆ LOVRO MATOŠEVIĆ

3. Challenges User related Technically related

4. Can simple retry help?

5. BI LLM Advanced Database Chat System LLM-enhanced SQL generation Retrieval-Augmented Generation (RAG) for context understanding Multi-agent architecture for autonomous management of various processing aspects and state retention Graph databases for global schema context and understanding User-friendly UI for visualization and reporting

6. Technical Deep Dive

7. Core

8. Integration

9. Data

10. Schema Linking Essential for handling complex, multi-table queries Why?

11. Schema Linking How? Embed user query and schema for similarity search Periodically update schema Only pick portion of the schema Few-shot with Golden SQL queries and relevance scoring

12. Schema Linking Bridges the gap between natural language and database structure Conclusion?

13. Query Generation Process

14. Query Generation Process Properly detect users' intent via confidence scoring recommendation utterances: using natural language parsers / LLM

15. Query Generation Process System prompts: Guide LLM through system prompts by injecting few-shot examples, relevant data from high cardinality columns and the relevant portion of DDL context

16. Query Generation Process Conversation history: Cache and database records

17. Query Generation Process Chatty agents: make sure agents don't fall into recursion

18. Query Generation Process Context manager: ensure agents share common context storage

19. Query Generation Process Query Optimizer: built in generation process with SQL validation and fixing

20. High-cardinal columns • Values with millions of unique product IDs • Helps with "vague terms into specific database values" conversions • User query to sub-queries decomposition

21. Query Generation Process • The system maintains conversation context and understands business terminology, allowing users to ask follow-up questions naturally. • For example, after seeing sales data, users can simply ask without needing to specify all the details again: o Show this as a chart o Compare with previous year

22. Agents • Generate skeleton SQL using graphs and similarity search • Improve the WHERE clause with high-cardinal data • Sub-query agent breaks down the query into different components

23. Graphs • "What is the most effective sales strategy employed by a contemporary of top sales leaders in the industry?" • Reason about relationships to create DAG (directed acyclic graph) • Extract non-local entities connected through multi-hop • Identify root node -> Graph -> Sub-graph = Query -> Sub-query Traditional efficient search methods, such as locality-sensitive hashing, which are designed for similarity search, are not well-suited for extracting complex structural patterns such as paths or subgraphs. Extracting structural information must cover the critical evidence needed to answer the query without exceeding the reasoning capacity of LLMs. Expanding the context window increases computational complexity and can degrade RAG performance by introducing irrelevant information.

24. Fine-Tuning Llama 3.1

25. Fine-Tuning Llama 3.1 Enhancing Query Generation Accuracy Fine-tuning adapts Llama 3.1 to our client's specific needs Improves understanding of complex queries

26. Fine-Tuning Llama 3.1 Customization to Business Context Incorporates specific terminology and data schemas Aligns model with industry- specific language

27. Fine-Tuning Llama 3.1 Handling Domain-Specific Terms "Customer churn" in telecom "Inventory turnover" in retail Examples:

28. Crafting a Custom Synthetic Dataset

29. Crafting a Custom Synthetic Dataset Utilizing LLMs for Dataset Generation Employed models like ChatGPT and Anthropic Generated tailored question-query pairs for the client's DWH

30. Crafting a Custom Synthetic Dataset Building a Robust Training Set Created and validated 100 extremely complex queries Added 200 less complex queries for comprehensive coverage

31. Crafting a Custom Synthetic Dataset t eries WITH CustomerOrderValues AS ( SELECT "f_sales"."BILL_CUSTOMER_SID", DATE_TRUNC('quarter', "d_date"."DATE") AS "quarter", AVG("f_sales"."EXTENDED_PRICE") AS "avg_order_value", COUNT(DISTINCT "f_sales"."SALES_DOCUMENT_SID") AS "order_count" FROM "f_sales" JOIN "d_date" ON "f_sales"."ORDER_DATE_SID" = "d_date".date_sid WHERE "d_date"."DATE" >= DATE_TRUNC('quarter', CURRENT_DATE) - INTERVAL '3 months' AND "d_date"."DATE" < DATE_TRUNC('quarter', CURRENT_DATE) + INTERVAL '3 months' GROUP BY "f_sales"."BILL_CUSTOMER_SID", DATE_TRUNC('quarter', "d_date"."DATE") ), CustomerGrowth AS ( SELECT "c"."CUSTOMER_SID", "c"."CUSTOMER_NAME", "cov_current"."avg_order_value" AS "current_avg_order_value", "cov_previous"."avg_order_value" AS "previous_avg_order_value", ("cov_current"."avg_order_value" - "cov_previous"."avg_order_value") / "cov_previous"."avg_order_value" AS "growth_rate" FROM "d_customers" "c" JOIN CustomerOrderValues "cov_current" ON "c"."CUSTOMER_SID" = "cov_current"."BILL_CUSTOMER_SID" JOIN CustomerOrderValues "cov_previous" ON "c"."CUSTOMER_SID" = "cov_previous"."BILL_CUSTOMER_SID" WHERE "cov_current"."quarter" = DATE_TRUNC('quarter', CURRENT_DATE) AND "cov_previous"."quarter" = DATE_TRUNC('quarter', CURRENT_DATE) - INTERVAL '3 months' AND "cov_current"."order_count" >= 5 AND "cov_previous"."order_count" >= 5 ), CompanyAverage AS ( SELECT SUM("f_sales"."EXTENDED_PRICE") / COUNT("f_sales"."SALES_DOCUMENT_SID") AS "company_avg_order_value" FROM "f_sales" JOIN "d_date" ON "f_sales"."ORDER_DATE_SID" = "d_date".date_sid WHERE "d_date"."DATE" >= DATE_TRUNC('quarter', CURRENT_DATE) AND "d_date"."DATE" < DATE_TRUNC('quarter', CURRENT_DATE) + INTERVAL '3 months' ) SELECT "cg"."CUSTOMER_NAME", ROUND("cg"."current_avg_order_value", 2) AS "current_avg_order_value", ROUND("cg"."previous_avg_order_value", 2) AS "previous_avg_order_value", ROUND("cg"."growth_rate" * 100, 2) AS "growth_percentage", ROUND(("cg"."current_avg_order_value" - "ca"."company_avg_order_value") / "ca"."company_avg_order_value" * 100, 2) AS "percent_di FROM CustomerGrowth "cg" CROSS JOIN CompanyAverage "ca" WHERE "cg"."growth_rate" > 0 ORDER BY "cg"."growth_rate" DESC LIMIT 10; "Which customers have shown the highest increase in average order value from last quarter to this quarter, and how does their current performance compare to the overall company average?"

32. Crafting a Custom Synthetic Dataset Building a Robust Training Set Created and validated 100 extremely complex queries Added 200 less complex queries for comprehensive coverage

33. Crafting a Custom Synthetic Dataset Expanding Through Paraphrasing Addressed natural language ambiguities Added 200 less complex queries for comprehensive coverage

34. Crafting a Custom Synthetic Dataset Impact on Model Performance Improved accuracy and relevance in query results Enhanced ability to handle varied expressions and complex queries

35. System in Action

37. Expanding Capabilities and Tackling Challenges • Current Challenges: • Ongoing model tuning for diverse datasets • Context management for complex queries • SQL accuracy and injection prevention • Future Directions: • Reduced latency • High precision during the query generation • Better intention detection

38. Thank you!

Editor's Notes

#1: Good afternoon everyone. My name is Lovro, and today me and Šimun are going to present you our BI Chat System which utilizes the latest advances in LLMs and Retrieval Augmented Generation in order to empower sales agents and make their life easier. We'll explore the inner workings of the system, giving you an idea on how some of the common problems which can be faced when working with text-to-SQL systems can be solved. But first, let me briefly present you our agenda. 30 sekundi
#2: Here’s a quick agenda overview: We'll start with an introduction to the motivations behind this approach, then dive into the challenges that traditional BI systems face, followed by our proposed solution, focusing on the architecture components and how they interact. After that, we’ll see a demonstration of the system, and finally, we’ll briefly discuss future directions to improve scalability and adaptability. This framework will give us a well-rounded view of how these technologies transform data querying and decision-making for sales. 25 – 30 sec
#3: Let’s start with the user challenges on the left. For many, working with data can feel like an uphill battle. There’s the steep learning curve of SQL, the complexity of writing queries, and the constant pressure to deliver results. In the middle column, we see challenges that overlap between user and technical issues. These include natural language ambiguities, multi-table joins, and query inaccuracies. These can create delays and disconnects between what users want and what the system provides. Finally, on the right, we have the technical challenges. Managing multiple data sources, navigating complex schemas, and maintaining data integrity require significant effort and precision. - do ovdje 30 sekundi This is why we’re addressing these challenges holistically—to make data exploration smoother for users while ensuring robust systems behind the scenes, so teams can focus on insights and decisions instead of struggling with roadblocks.
#4: Let’s address the question: Can simple retry help? On the left, we see examples of common issues in SQL generation, grouped into categories like Schema Contradiction, Value Misrepresentation, and Join Redundancy. These types of errors highlight how models can make mistakes—sometimes due to database structure, other times from misinterpreting user intent or logic. For example, schema contradictions happen when the query conflicts with the database's structure, while join redundancies create unnecessary complexity in the query. Now, on the right, we illustrate the role of a retry mechanism. After a user query is validated, if it fails the validation checks, the system doesn’t stop. Instead, it goes through an error handler that applies auto-correction or allows user feedback to refine the query. Then, it retries. So, does simple retry help? In our implementation, it did reduce a significant number of errors. When the model can examine where it went wrong, this self-reflection often leads to better results on the second attempt. While it’s not a universal fix, it’s a step forward in improving query accuracy. 45 sekundi
#5: This brings us to our BI LLM Advanced Database Chat System. At its core, this system is designed to make data querying as natural as possible, almost like chatting with a knowledgeable assistant. Let’s break down its main components: LLM-enhanced SQL generation translates natural language requests into SQL commands accurately. Retrieval-Augmented Generation (RAG) ensures the system understands the context of each query, and the schema idea Multi-agent architecture enables different parts of the system to work together, from state retention to processing various query types. Graph databases provide a global schema context, making it easier to map complex relationships in the data. Finally, a user-friendly UI makes it simple for users to visualize data and interact with the system. 45-55 sekundi
#6: And now we'll take a closer, more technical look at the inner workings of the system. I will briefly present you the overview of system components, and then Šimun will take over to provide you some more specific technical details. 10 sekundi
#7: This slide highlights the Core Processing Layer, the central system that powers how queries are processed, interpreted, and executed. Let’s walk through each of the components to understand how they work together. First, we have the Intent Analyzer, which determines the purpose behind the user’s query. This component uses a combination of regular expressions (regex) and LLMs to categorize the query type. For instance, it identifies whether the query is analytical, exploratory, or operational. It also makes decisions like selecting which graph or visualization to use if the user hasn’t specified it. This ensures that the system responds appropriately to the user’s exact needs. Next is the Context Processor, which adds depth to the system's understanding. It analyzes the query’s broader context, considering previous interactions, the user’s history, and any domain-specific nuances. This ensures that the response isn’t just accurate but also relevant and consistent with the user’s ongoing workflow. Moving to the SQL Generator, this is where the magic of converting intent into action happens. Here, we use a finetuned Llama model, specifically trained to generate SQL queries tailored to the user’s database schema. This allows the system to handle complex queries efficiently and with high precision. Once the SQL query is generated, the Validation Agent takes over. It rigorously checks the query for errors—such as schema mismatches, logical inconsistencies, or potential inefficiencies. This step ensures the query is ready to execute without issues. The Optimization Agent then refines the validated query to ensure it runs as efficiently as possible. For example, it may reduce redundant joins or optimize query paths to improve performance, particularly for large datasets. Finally, the Coordinator Agent acts as the conductor, orchestrating the entire process. It ensures smooth communication between these components and delivers the final results back to the user, whether that’s raw data, an interactive visualization, or an automated report. Together, these components form a robust and integrated system that bridges user input with database operations, ensuring precision and efficiency. 1:15 minuta
#8: This slide focuses on the Integration Layer, which ensures our system connects seamlessly to both internal and external components. Let’s walk through each integration point. Starting with LLM Integration, this component is designed to connect to the most popular large language models, such as OpenAI, Anthropic, Ollama, or even custom finetuned models like our Llama implementation. This flexibility allows the system to adapt to different use cases and user preferences while processing natural language queries and interacting with other core components like the SQL generator and intent analyzer. Next, we have Database Integration, which links the system directly to various data sources, including relational databases, graph databases, and data warehouses. It supports multiple SQL dialects and handles complex schemas, ensuring compatibility with diverse database environments. Monitoring Integration tracks the system’s performance in real-time, covering metrics like query execution time, resource utilization, and system health. This ensures that the system remains stable and any potential issues are detected early. Then there’s Cache Integration, which significantly boosts performance by storing results from frequently executed queries. This reduces the load on databases and ensures faster response times, which is particularly useful for repetitive queries or high-demand environments. Lastly, External Services Integration connects the system to third-party APIs and platforms, such as CRM or ERP systems. This allows users to pull data from other tools or push insights into existing workflows, making the system a versatile part of broader business operations. With these integrations, the system is not only adaptable to various environments but also capable of leveraging the latest advancements in AI and data handling technologies, providing a flexible and reliable solution for users. 1:30-1:45 minuta
#9: The graph structure naturally handles complex joins and nested queries by representing them as connected subgraphs. This approach requires sophisticated knowledge graph management and careful information retrieval to stay within LLM context windows while maintaining query accuracy.
#10: When users ask questions about their data, they rarely use exact table or column names. Important for: Understanding foreign key connections, managing join conditions automatically, handling nested queries and aggregations.
#11: We embed both the user's query and the database schema into the same vector space to capture contextual relationships. Schema management is handled through periodic updates to ensure accuracy and freshness. Table relationships are cached to optimize performance. We employ few-shot learning with carefully curated golden SQL queries as examples, which helps the system learn common patterns in schema linking. The relevance scoring mechanism ensures that only the most pertinent tables are selected for query generation. This is particularly important for large databases where selecting too many tables could lead to performance issues.
#12: The impact extends beyond just query generation - it enables better data discovery, improves query accuracy, and reduces the learning curve for database interactions.
#14: Intent detection commonly uses confidence scoring with transformer-based models where we first detect query intent through semantic parsing, then we validate it against schema constraints. Our goal is to generate multiple candidate interpretations of user intent, then scoring them based on schema compatibility and semantic similarity. Recommendation utterances help guide users to better query formulation. Confidence thresholds for intent routing.
#15: Carefully crafted system prompts significantly improve SQL generation accuracy. The key is providing contextually relevant examples rather than generic ones.
#16: Keeping a rolling window of recent queries while strategically caching database records helps maintain conversation coherence without compromising system responsiveness. The optimal approach is keeping the last 5 interactions with metadata about tables and columns referenced, allowing the system to maintain context while managing memory efficiently.
#17: Here we tried different agentic frameworks but the conclusion is to keep the maximum recursion depth and timeout to limit recursive self-correction attempts to 3 iterations which provides the optimal balance between accuracy and performance.
#18: A centralized context manager using vector stores for schema metadata and Redis for session state provides the best performance. This approach reduced query generation latency by 40% compared to distributed context management.
#19: Multi-stage validation process improves query accuracy. The process includes syntax validation, schema validation, and execution plan optimization, with each failed validation triggering a self-correction loop using the specific error context.
#20: High-cardinality columns containing millions of unique values (like product IDs) present a unique challenge in Text-to-SQL systems. BI Chat handles this by creating specialized embeddings for high-cardinality columns and storing them in vector indices, allowing for efficient similarity search when users use natural language terms. Source. For vague terms to specific database value conversion, the system employs a two-step process: first creating embeddings of the column values during preprocessing, then using these embeddings to find the closest matches during query time. This approach significantly improves accuracy when dealing with product names, IDs, or other high-cardinality fields. The query decomposition strategy breaks down complex user queries into sub-queries based on the cardinality of involved columns. When high-cardinality columns are detected, the system generates intermediate queries to first resolve specific values before incorporating them into the final SQL query, improving both accuracy and performance.
#21: The agent executor orchestrates the workflow between different specialized tools - each handling specific aspects like SQL generation, visualization, and system operations. RAG (Retrieval Augmented Generation) plays a crucial role by providing relevant context from previous interactions and business terminology stored in the agent memory. The feedback loop (shown as step 7 in the diagram) is particularly important - it helps improve accuracy through self-correction mechanisms. When users request visualizations or comparisons, the system doesn't need to regenerate the entire context but rather builds upon the existing query context maintained in memory. The visualization tool integration (step 3) allows seamless transitions between different representation formats without losing the semantic context of the original query. This is achieved through a stateful conversation manager that maintains not just the query history but also the business context and user preferences. The system's ability to handle natural follow-ups like "Show this as a chart" or "Compare with previous year" demonstrates sophisticated context management - it understands that "this" refers to the previous query results and "previous year" requires temporal context manipulation of the existing query rather than generating a new one from scratch.
#22: Graph-based approaches for SQL generation achieve superior results by modeling schema relationships as interconnected nodes. Combining graph traversal with vector similarity search improves accuracy by up to 15% compared to traditional methods. The graph structure naturally represents table relationships and foreign keys, while similarity search helps identify relevant schema components. Sampling representative values from high-cardinality columns and incorporating them into the prompt improves query accuracy by 23%. Agents breaking down complex queries into manageable components. Agents for sub-query generation can improve nested query accuracy by up to 31%. Each agent focuses on a specific aspect (joins, aggregations, or filtering), while maintaining context through a shared memory system.
#23: Graph-to-sequence models significantly outperform traditional sequence-to-sequence approaches by encoding global structural information into node embeddings. First, the system identifies the root node (typically the main entity or action). Then, it constructs a Directed Acyclic Graph (DAG) by analyzing relationships between entities. This graph is then decomposed into meaningful subgraphs that map directly to SQL sub-queries. Multi-hop reasoning through graphs helps capture non-local dependencies that simple sequence models might miss. For example, when analyzing "sales strategies of top performers", the system creates paths connecting sales data, performance metrics, and strategy attributes through multiple hops. The graph structure naturally handles complex joins and nested queries by representing them as connected subgraphs. This approach requires sophisticated knowledge graph management and careful information retrieval to stay within LLM context windows while maintaining query accuracy.
#24: Firstly, enhancing query generation accuracy. By fine-tuning Llama 3.1 specifically for our client's needs, we've adapted the model to better understand complex queries unique to their business context. This customization means the model isn't just interpreting generic queries but is finetuned to the intricacies of the client's data and requirements. Next, customization to business context. We've tailored the model to incorporate specific business terminology and data schemas. This involves training the model on the client's unique datasets and vocabulary, ensuring it comprehends and accurately processes queries that reflect their actual operations and data structures. Handling domain-specific terms is crucial. For instance, in the telecom industry, terms like "customer churn" are common, while in retail, phrases like "inventory turnover" are frequently used. By integrating these domain-specific terms into the model's training, we've enabled it to interpret and generate precise queries that align with industry-specific language. The impact of this fine-tuning is significant. We've achieved higher accuracy and relevance in query results, which not only improves efficiency but also empowers users across different industries to extract meaningful insights from their data without getting bogged down by technical complexities. In summary, fine-tuning Llama 3.1 has allowed us to create a more accurate, context-aware, and efficient tool for text-to-SQL query generation, tailored specifically to our client's needs." Next, customization to business context. We've tailored the model to incorporate specific business terminology and data schemas. This involves training the model on the client's unique datasets and vocabulary, ensuring it comprehends and accurately processes queries that reflect their actual operations and data structures. Handling domain-specific terms is crucial. For instance, in the telecom industry, terms like "customer churn" are common, while in retail, phrases like "inventory turnover" are frequently used. By integrating these domain-specific terms into the model's training, we've enabled it to interpret and generate precise queries that align with industry-specific language. The impact of this fine-tuning is significant. We've achieved higher accuracy and relevance in query results, which not only improves efficiency but also empowers users across different industries to extract meaningful insights from their data without getting bogged down by technical complexities. In summary, fine-tuning Llama 3.1 has allowed us to create a more accurate, context-aware, and efficient tool for text-to-SQL query generation, tailored specifically to our client's needs."
#25: And now for the finetuning part. We finetuned a llama 3.1 model specifically for the task of text-to-sql query generation. Afterwards, we will briefly discuss about the synthetic text-to-sql dataset we created. So why is finetuning important? Firstly, enhancing query generation accuracy. By fine-tuning Llama 3.1 specifically for our client's needs, we've adapted the model to better understand complex queries unique to their business context
#26: Next, customization to business context. We've tailored the model to incorporate specific business terminology and data schemas. This involves training the model on client's database schema, ensuring it comprehends and accurately processes queries that reflect their actual operations and data structures.
#27: Handling domain-specific terms is crucial. For instance, in the telecom industry, terms like "customer churn" are common. One of our clients is in the retail industry, so for that specific case phrases like "inventory turnover" were important for the model to learn. By integrating these domain-specific terms into the model's training, we've enabled it to interpret and generate precise queries that align with industry-specific language. 1:45 minuta
#28: Building upon the fine-tuning process, I'd like to delve into how we crafted a custom synthetic dataset to train our model more effectively.
#29: We utilized advanced language models like ChatGPT and Anthropic to generate a comprehensive set of question-query pairs tailored to our client's data warehouse. By leveraging these LLMs, we were able to simulate realistic and relevant questions that the model would encounter in actual use.
#30: In building a robust training set, we created and validated 100 extremely complex queries. These queries were designed to cover the most challenging aspects of the client's data schema and anticipated user inquiries. Additionally, we developed another 200 less complex queries to ensure the model could handle a wide spectrum of query difficulties, from simple look-ups to some more complex data manipulations.
#31: Here we can see an example of a complex query. It’s a query that would be quite painful to write by hand. The user question was to show the highest increase in average order value from last quarter to the current one, and to compare it to the company average.
#32: Building upon the fine-tuning process, I'd like to delve into how we crafted a custom synthetic dataset to train our model more effectively. We utilized advanced language models like ChatGPT and Anthropic to generate a comprehensive set of question-query pairs tailored to our client's data warehouse. By leveraging these powerful LLMs, we were able to simulate realistic and relevant interactions that the model would encounter in actual use. In building a robust training set, we created and meticulously validated 100 extremely complex queries. These queries were designed to cover the most challenging aspects of the client's data schema and anticipated user inquiries. Additionally, we developed another 200 less complex queries to ensure the model could handle a wide spectrum of query difficulties, from simple look-ups to intricate data manipulations. To address natural language ambiguities, we prompted the LLMs to paraphrase the questions in our dataset. This expansion through paraphrasing enriched the dataset's diversity, enabling the model to understand and interpret different phrasings and linguistic variations of similar questions. It ensures that whether a user asks, "Show me the sales figures for last quarter," or "What were our earnings in the previous quarter?" the model can generate the correct SQL query. The impact on model performance has been substantial. By training on this expansive and varied dataset, the model now demonstrates improved accuracy and relevance in query results. It can handle varied expressions and complex queries more effectively, reducing errors and increasing user confidence in the system. In conclusion, the combination of fine-tuning Llama 3.1 and crafting a custom synthetic dataset has significantly enhanced our text-to-SQL capabilities. It allows for higher precision, adaptability to different industries, and a more intuitive user experience. We're confident that this approach will provide substantial value to our client by streamlining data retrieval and analysis
#33: To address natural language ambiguities, we prompted the LLMs to paraphrase the questions in our dataset. This expansion through paraphrasing enriched the dataset's diversity, enabling the model to understand and interpret different phrasings and linguistic variations of similar questions. It ensures that whether a user asks, "Show me the sales figures for last quarter," or "What were our earnings in the previous quarter?" the model can generate the correct SQL query.
#34: The impact on model performance has been substantial. By training on this expansive and varied dataset, the model now demonstrates improved accuracy and relevance in query results. It can handle varied expressions and complex queries more effectively, reducing errors and increasing user confidence in the system. In conclusion, the combination of fine-tuning Llama 3.1 and crafting a custom synthetic dataset has significantly enhanced our text-to-SQL capabilities. It allows for higher precision, adaptability to different industries, and a more intuitive user experience. We're confident that this approach will provide substantial value to our client by streamlining data retrieval and analysis 1:30-2:00 minuta
#36: Graph Databases for Schema Linking: Explanation: Graph databases, unlike traditional relational databases, store data in a graph structure, making it easier to represent and traverse relationships. Leveraging graph databases for schema linking helps Text-to-SQL systems understand complex relationships, improving their ability to generate accurate SQL in queries involving multiple tables. Impact: This approach enables better performance in complex queries, where understanding relationships between various data points is crucial, such as in recommendations or customer relationship analysis. Multi-Agent Systems for Task Specialization: Explanation: Multi-Agent Systems (MAS) use specialized agents to handle different parts of a task. Applying this concept to Text-to-SQL enables specific agents to focus on aspects like natural language processing, query optimization, or context tracking, allowing each agent to excel in its area. Impact: Task specialization leads to improved accuracy and efficiency in query handling. For instance, one agent could focus solely on SQL generation while another handles context continuity, ensuring smoother, more accurate results. Adaptability to Dynamic Schema Updates: Explanation: Adapting to schema changes dynamically allows Text-to-SQL systems to remain functional and accurate, even as underlying data structures evolve. This capability is essential for organizations where databases frequently change due to new data sources or evolving business needs. Impact: With this adaptability, models can quickly respond to new fields or table adjustments without extensive retraining, making them more resilient and reliable in real-world applications. Ongoing Model Tuning for Diverse Datasets: Explanation: Different datasets present unique challenges, and tuning models to handle these variations is critical. This continuous refinement process ensures the system remains accurate across various data sources, whether they’re financial, healthcare, or logistics data. Impact: Tuning for diverse datasets supports broader applicability, so that companies from any sector can benefit from Text-to-SQL without sacrificing accuracy or relevance. Context Management for Complex Queries: Explanation: Maintaining context over complex or multi-step queries is a current challenge. Without effective context management, follow-up queries may lose essential information from previous interactions, leading to fragmented results. Impact: Improving context management allows for a more seamless user experience, where each query builds naturally on the previous one, especially useful in dynamic fields like customer service or real-time analytics. SQL Accuracy and Injection Prevention: Explanation: High SQL accuracy is essential, but it must be paired with strong safeguards against injection attacks. Ensuring SQL accuracy without vulnerabilities is crucial for securing data and maintaining trust in the system’s outputs. Impact: Better accuracy combined with injection prevention not only improves data security but also enhances user trust, making the system reliable for both business intelligence and operational tasks.
#37: Graph Databases for Schema Linking: Explanation: Graph databases, unlike traditional relational databases, store data in a graph structure, making it easier to represent and traverse relationships. Leveraging graph databases for schema linking helps Text-to-SQL systems understand complex relationships, improving their ability to generate accurate SQL in queries involving multiple tables. Impact: This approach enables better performance in complex queries, where understanding relationships between various data points is crucial, such as in recommendations or customer relationship analysis. Multi-Agent Systems for Task Specialization: Explanation: Multi-Agent Systems (MAS) use specialized agents to handle different parts of a task. Applying this concept to Text-to-SQL enables specific agents to focus on aspects like natural language processing, query optimization, or context tracking, allowing each agent to excel in its area. Impact: Task specialization leads to improved accuracy and efficiency in query handling. For instance, one agent could focus solely on SQL generation while another handles context continuity, ensuring smoother, more accurate results. Adaptability to Dynamic Schema Updates: Explanation: Adapting to schema changes dynamically allows Text-to-SQL systems to remain functional and accurate, even as underlying data structures evolve. This capability is essential for organizations where databases frequently change due to new data sources or evolving business needs. Impact: With this adaptability, models can quickly respond to new fields or table adjustments without extensive retraining, making them more resilient and reliable in real-world applications. Ongoing Model Tuning for Diverse Datasets: Explanation: Different datasets present unique challenges, and tuning models to handle these variations is critical. This continuous refinement process ensures the system remains accurate across various data sources, whether they’re financial, healthcare, or logistics data. Impact: Tuning for diverse datasets supports broader applicability, so that companies from any sector can benefit from Text-to-SQL without sacrificing accuracy or relevance. Context Management for Complex Queries: Explanation: Maintaining context over complex or multi-step queries is a current challenge. Without effective context management, follow-up queries may lose essential information from previous interactions, leading to fragmented results. Impact: Improving context management allows for a more seamless user experience, where each query builds naturally on the previous one, especially useful in dynamic fields like customer service or real-time analytics. SQL Accuracy and Injection Prevention: Explanation: High SQL accuracy is essential, but it must be paired with strong safeguards against injection attacks. Ensuring SQL accuracy without vulnerabilities is crucial for securing data and maintaining trust in the system’s outputs. Impact: Better accuracy combined with injection prevention not only improves data security but also enhances user trust, making the system reliable for both business intelligence and operational tasks.

[DSC Europe 24] Simun Sunjic & Lovro Matosevic - Empowering Sales with Intelligent BI Agents: DWH, LLM, and RAG Integration

More Related Content

Similar to [DSC Europe 24] Simun Sunjic & Lovro Matosevic - Empowering Sales with Intelligent BI Agents: DWH, LLM, and RAG Integration (20)

More from DataScienceConferenc1 (20)

Recently uploaded (20)

[DSC Europe 24] Simun Sunjic & Lovro Matosevic - Empowering Sales with Intelligent BI Agents: DWH, LLM, and RAG Integration

Editor's Notes