SlideShare a Scribd company logo
15 June 2018
Adding Velocity to BigBench
Todor Ivanov
(todor@dbis.cs.uni-frankfurt.de),
Patrick Bedué, Roberto V. Zicari
Frankfurt Big Data Lab,
Goethe University Frankfurt,
Germany
Ahmad Ghazal
Futurewei Technologies Inc.
Santa Clara, CA, USA
15 June 2018
Content
1. Background BigBench
2. Motivation
3. Streaming Extension
4. Proof of Concept
5. Conclusions & Next Steps
2
15 June 2018
BigBench [Ghazal et al. 2013] (presented @SIGMOD 2013)
● End-to-end, technology agnostic, application-level Big Data benchmark.
○ On top of TPC-DS (decision support on retail business)
○ Adding semi-structured and unstructured data.
○ Focus on: Parallel DBMS and MR engines (Hadoop, etc.).
○ Workload: 30 queries
■ Based on big data retail analytics research
■ 11 queries from TPC-DS
● Adopted by TPC as TPCx-BB (http://www.tpc.org/tpcx-bb/). Implementation in HiveQL
and Spark MLlib.
3
15 June 2018
BigBench V2 [Ghazal et al. 2017] (presented @ ICDE 2017)
● BigBench V2 - a major rework of BigBench
○ Separate from TPC-DS and takes care of late binding.
● New simplified data model and late binding requirements.
○ Custom made scale factor-based data generator for all components.
● Workload:
○ All 11 TPC-DS queries are replaced with new queries in BigBench V2.
○ New queries with similar business questions - focus on analytics on the
semi-structured web-logs.
4
● 1 – many relationship :
● Semi-structured : key-value WebLog
● Un-structured: Product Reviews
15 June 2018
Motivation
● Growing number of industry scenarios requiring streaming and new streaming engines:
● New functionalities combining analytical with streaming features
○ Spark Structured Streaming
○ Calcite adapted by Flink SQL, Samza SQL, Drill, etc.
○ Kafka Streaming SQL - KSQL
● Need of standardized end-to-end application benchmarks covering all Big Data characteristics
including velocity:
○ micro-benchmarks: StreamBench, HiBench, SparkBench
○ application benchmarks: Linear Road, AIM Benchmark, Yahoo Streaming Benchmark,
RIoTBench
→ none of the above benchmarks integrates an end-to-end real-world scenario
implementing a Big Data architecture integrating storage, batch and stream processing
components
5
15 June 2018
Our Requirements
● Create configurable data stream to simulate multiple scenarios:
○ real-time monitoring and dashboards (refresh rate in less than 3 seconds)
○ streaming hours of history data for batch processing
● Create deterministic data stream to:
○ compare accurately systems under test
○ validate and verify the workload results
● Isolate the stream engine execution as much as possible to avoid any external
influence/bottlenecks, for example by the stream generation.
● Preserve the current BigBench specification, architecture, workload execution and metric.
6
15 June 2018
Streaming Methodology (I)
● Web-logs are key-value pairs representing user clicks (JSON file), for example:
● Web-sales example:
● Web-logs and web-sales are generated in
session window manner.
● Sort the entries according to the event timestamp
and create data windows depending on the simulated
scenario.
7
15 June 2018
Streaming Methodology (II)
● Support for two window types:
● Configurable window parameters:
○ window size (x)
○ window slide (y) (e.g., hourly windows, starting every 30 minutes)
○ total runtime
8
Fixed Window Sliding (Hopping) Window (x = 2*y)
15 June 2018
Design Overview
● Adding 3 new components:
○ Stream Generator
○ Fast-access Layer
○ Stream Processing
● Support for 2 stream execution modes:
○ Active Mode - simulate real-time data streaming (in second ranges)
○ Passive Mode - simulate data ingestion and transformation on micro-batch
processing (in hour ranges)
9
15 June 2018
Active and Passive Streaming Modes
● Active mode: parallel execution of the data stream generation and the actual stream
processing.
● Passive mode: sequential execution of data stream generation and the actual stream
processing.
10
15 June 2018
Workloads
● The streaming workload consists of five queries executed periodically on a stream
of data (web-logs and web-sales), covering simple aggregation and pattern
detection operations:
○ QS1
: Find the 10 most browsed products in the last 120 seconds.
○ QS2
: Find the 5 most browsed products that are not purchased across all users (or specific
user) in the last 120 seconds.
○ Q
S3
: Find the top ten pages visited by all users (or specific user) in the last 120 seconds.
○ Q
S4
: Show the number of unique visitors in the last 120 seconds.
○ Q
S5
: Show the sold products (of a certain type or category) in the last 120 seconds.
11
15 June 2018
Metrics & Result Validation
● Execution time is the time between start and end of the query execution against the
streaming data.
● End-to-end streaming execution time (Latency) - starting from the Stream Generator and
stopping at the point where the data result is produced.
● Result validation based on scale factor similar to current BigBench validation (SF1):
1. Store persistently the results of every query execution over a streaming window.
2. Compare the results against the golden result once the benchmark run is finished.
12
15 June 2018
Proof of Concept Implementation
Passive Mode Components:
● Stream Generator in Spark
● Persistent Storage Layer in HDFS
● Fast-access Layer as In-memory Buffer
● Stream Processing in Spark Streaming
13
Active Mode Components:
● Stream Generator in Spark
● Persistent Storage Layer in HDFS
● Fast-access Layer in Kafka
● Stream Processing in Spark
Streaming
15 June 2018
Conclusion
● We present a stream processing extension of the BigBench benchmark.
● Our approach proposes configurable active and passive streaming modes in order to
cover the different streaming requirements (ranging from seconds to hours).
● It supports fixed and sliding window streaming to better address the common data
streaming use cases.
14
15 June 2018
Next Steps
● New implementation on Spark Structured Streaming replacing Spark Streaming.
● Adding other engines such as Flink and Samza.
● Extending the coverage of the stream SQL operators (new workloads) including
clustering, pattern detection and machine learning.
● Support for:
○ sliding windows in active mode
○ out-of-order record processing within and outside of a window
○ parallel query execution
● Validation experiments on a large-scale cluster with different active and passive mode
architectures.
15
15 June 2018
Acknowledgments. This work has been partially funded by the European Commission H2020
project DataBench - Evidence Based Big Data Benchmarking to Improve Business Performance,
under project No. 780966. This work expresses the opinions of the authors and not necessarily
those of the European Commission. The European Commission is not liable for any use that may
be made of the information contained in this work. The authors thank all the participants in the
project for discussions and common work.
www.databench.eu
Thank you for your attention!
15 June 2018
References
[Ghazal et al. 2013] Ahmad Ghazal, Tilmann Rabl, Minqing Hu, Francois Raab, Meikel Poess,
Alain Crolotte, and Hans-Arno Jacobsen. 2013. BigBench: Towards An Industry Standard
Benchmark for Big Data Analytics. In SIGMOD 2013. 1197–1208.
[Ghazal et al. 2017] Ahmad Ghazal, Todor Ivanov, Pekka Kostamaa, Alain Crolotte, Ryan
Voong, Mohammed Al-Kateb, Waleed Ghazal, and Roberto V. Zicari. 2017. BigBench V2:
The New and Improved BigBench. In ICDE 2017, San Diego, CA, USA, April 19-22.
17
15 June 2018
Backup Slides
15 June 2018
QS1
(HiveQL Q5 in BigBench V2)
Find the 10 most browsed products in the last 120 seconds.
SELECT wl_item_id, COUNT(wl_item_id) as cnt
FROM web_logs
WHERE wl_item_id IS NOT NULL
GROUP BY wl_item_id
ORDER BY cnt DESC LIMIT 10;
19
15 June 2018
QS2
(HiveQL Q6 in BigBench V2)
Find the 5 most browsed products that are not purchased across all users (or specific user) in
the last 120 seconds.
SELECT wl_item_id AS br_id, COUNT(wl_item_id) AS br_count
FROM web_logs
WHERE wl_item_id IS NOT NULL
GROUP BY wl_item_id;
view_browsed.createOrReplaceTempView("browsed");
SELECT ws_product_id AS pu_id
FROM web_logs
WHERE ws_product_id IS NOT NULL
GROUP BY ws_product_id;
view_purchased.createOrReplaceTempView("purchased");
SELECT br_id, COUNT(br_id)
FROM browsed LEFT JOIN purchased ON browsed.br_id = purchased.pu_id
WHERE purchased.pu_id IS NULL
GROUP BY browsed.br_id LIMIT 5;
20
15 June 2018
QS3
(HiveQL Q16 in BigBench V2)
Find the top ten pages visited by all users (or specific user) in the last
120 seconds.
SELECT wl_webpage_name, COUNT(wl_webpage_name) AS cnt
FROM web_logs
WHERE wl_webpage_name IS NOT NULL
GROUP BY wl_webpage_name
ORDER BY cnt DESC LIMIT 10;
21
15 June 2018
QS4
(HiveQL Q22 in BigBench V2)
Show the number of unique visitors in the last 120 seconds.
SELECT COUNT(DISTINCT wl_customer_id) AS uniqueVisitors
FROM web_logs
WHERE wl_customer_id IS NOT NULL
ORDER BY uniqueVisitors DESC LIMIT 10;
22
15 June 2018
QS5
HiveQL
Show the sold products (of a certain type or category) in the last 120
seconds.
SELECT ws_product_id, COUNT(ws_product_id)
FROM web_sales
WHERE ws_product_id IS NOT NULL
GROUP BY ws_product_id
ORDER BY COUNT(ws_product_id) DESC LIMIT 10;
23
Ad

More Related Content

Similar to Adding Velocity to BigBench (20)

WSO2Con USA 2015: An Introduction to the WSO2 Analytics Platform
WSO2Con USA 2015: An Introduction to the WSO2 Analytics PlatformWSO2Con USA 2015: An Introduction to the WSO2 Analytics Platform
WSO2Con USA 2015: An Introduction to the WSO2 Analytics Platform
WSO2
 
Streaming analytics state of the art
Streaming analytics state of the artStreaming analytics state of the art
Streaming analytics state of the art
Stavros Kontopoulos
 
Big Data Technical Benchmarking, Arne Berre, BDVe Webinar series, 09/10/2018
Big Data Technical Benchmarking, Arne Berre, BDVe Webinar series, 09/10/2018 Big Data Technical Benchmarking, Arne Berre, BDVe Webinar series, 09/10/2018
Big Data Technical Benchmarking, Arne Berre, BDVe Webinar series, 09/10/2018
DataBench
 
Google BigQuery for Everyday Developer
Google BigQuery for Everyday DeveloperGoogle BigQuery for Everyday Developer
Google BigQuery for Everyday Developer
Márton Kodok
 
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
Databricks
 
Apache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyApache Flink Adoption at Shopify
Apache Flink Adoption at Shopify
Yaroslav Tkachenko
 
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
Márton Kodok
 
Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...
Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...
Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...
Big Data Spain
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
DataWorks Summit
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scale
Itai Yaffe
 
Big Query Basics
Big Query BasicsBig Query Basics
Big Query Basics
Ido Green
 
DevTalks Keynote Powering interactive data analysis with Google BigQuery
DevTalks Keynote Powering interactive data analysis with Google BigQueryDevTalks Keynote Powering interactive data analysis with Google BigQuery
DevTalks Keynote Powering interactive data analysis with Google BigQuery
Márton Kodok
 
Building the DataBench Workflow and Architecture, Todor Ivanov, Bench 2019 - ...
Building the DataBench Workflow and Architecture, Todor Ivanov, Bench 2019 - ...Building the DataBench Workflow and Architecture, Todor Ivanov, Bench 2019 - ...
Building the DataBench Workflow and Architecture, Todor Ivanov, Bench 2019 - ...
DataBench
 
Building the DataBench Workflow and Architecture
Building the DataBench Workflow and ArchitectureBuilding the DataBench Workflow and Architecture
Building the DataBench Workflow and Architecture
t_ivanov
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
Prasad Wagle
 
Understanding Business APIs through statistics
Understanding Business APIs through statisticsUnderstanding Business APIs through statistics
Understanding Business APIs through statistics
WSO2
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
Karthik Murugesan
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
markgrover
 
Reducing Cost of Production ML: Feature Engineering Case Study
Reducing Cost of Production ML: Feature Engineering Case StudyReducing Cost of Production ML: Feature Engineering Case Study
Reducing Cost of Production ML: Feature Engineering Case Study
Venkata Pingali
 
An introduction to the WSO2 Analytics Platform
An introduction to the WSO2 Analytics Platform   An introduction to the WSO2 Analytics Platform
An introduction to the WSO2 Analytics Platform
Sriskandarajah Suhothayan
 
WSO2Con USA 2015: An Introduction to the WSO2 Analytics Platform
WSO2Con USA 2015: An Introduction to the WSO2 Analytics PlatformWSO2Con USA 2015: An Introduction to the WSO2 Analytics Platform
WSO2Con USA 2015: An Introduction to the WSO2 Analytics Platform
WSO2
 
Streaming analytics state of the art
Streaming analytics state of the artStreaming analytics state of the art
Streaming analytics state of the art
Stavros Kontopoulos
 
Big Data Technical Benchmarking, Arne Berre, BDVe Webinar series, 09/10/2018
Big Data Technical Benchmarking, Arne Berre, BDVe Webinar series, 09/10/2018 Big Data Technical Benchmarking, Arne Berre, BDVe Webinar series, 09/10/2018
Big Data Technical Benchmarking, Arne Berre, BDVe Webinar series, 09/10/2018
DataBench
 
Google BigQuery for Everyday Developer
Google BigQuery for Everyday DeveloperGoogle BigQuery for Everyday Developer
Google BigQuery for Everyday Developer
Márton Kodok
 
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
Databricks
 
Apache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyApache Flink Adoption at Shopify
Apache Flink Adoption at Shopify
Yaroslav Tkachenko
 
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
Márton Kodok
 
Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...
Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...
Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...
Big Data Spain
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
DataWorks Summit
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scale
Itai Yaffe
 
Big Query Basics
Big Query BasicsBig Query Basics
Big Query Basics
Ido Green
 
DevTalks Keynote Powering interactive data analysis with Google BigQuery
DevTalks Keynote Powering interactive data analysis with Google BigQueryDevTalks Keynote Powering interactive data analysis with Google BigQuery
DevTalks Keynote Powering interactive data analysis with Google BigQuery
Márton Kodok
 
Building the DataBench Workflow and Architecture, Todor Ivanov, Bench 2019 - ...
Building the DataBench Workflow and Architecture, Todor Ivanov, Bench 2019 - ...Building the DataBench Workflow and Architecture, Todor Ivanov, Bench 2019 - ...
Building the DataBench Workflow and Architecture, Todor Ivanov, Bench 2019 - ...
DataBench
 
Building the DataBench Workflow and Architecture
Building the DataBench Workflow and ArchitectureBuilding the DataBench Workflow and Architecture
Building the DataBench Workflow and Architecture
t_ivanov
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
Prasad Wagle
 
Understanding Business APIs through statistics
Understanding Business APIs through statisticsUnderstanding Business APIs through statistics
Understanding Business APIs through statistics
WSO2
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
Karthik Murugesan
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
markgrover
 
Reducing Cost of Production ML: Feature Engineering Case Study
Reducing Cost of Production ML: Feature Engineering Case StudyReducing Cost of Production ML: Feature Engineering Case Study
Reducing Cost of Production ML: Feature Engineering Case Study
Venkata Pingali
 
An introduction to the WSO2 Analytics Platform
An introduction to the WSO2 Analytics Platform   An introduction to the WSO2 Analytics Platform
An introduction to the WSO2 Analytics Platform
Sriskandarajah Suhothayan
 

More from t_ivanov (7)

CoreBigBench: Benchmarking Big Data Core Operations
CoreBigBench: Benchmarking Big Data Core OperationsCoreBigBench: Benchmarking Big Data Core Operations
CoreBigBench: Benchmarking Big Data Core Operations
t_ivanov
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
t_ivanov
 
ABench: Big Data Architecture Stack Benchmark
ABench: Big Data Architecture Stack BenchmarkABench: Big Data Architecture Stack Benchmark
ABench: Big Data Architecture Stack Benchmark
t_ivanov
 
Lessons Learned on Benchmarking Big Data Platforms
Lessons Learned on Benchmarking  Big Data PlatformsLessons Learned on Benchmarking  Big Data Platforms
Lessons Learned on Benchmarking Big Data Platforms
t_ivanov
 
BDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBenchBDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBench
t_ivanov
 
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBenchWBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
t_ivanov
 
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop ClustersWBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
t_ivanov
 
CoreBigBench: Benchmarking Big Data Core Operations
CoreBigBench: Benchmarking Big Data Core OperationsCoreBigBench: Benchmarking Big Data Core Operations
CoreBigBench: Benchmarking Big Data Core Operations
t_ivanov
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
t_ivanov
 
ABench: Big Data Architecture Stack Benchmark
ABench: Big Data Architecture Stack BenchmarkABench: Big Data Architecture Stack Benchmark
ABench: Big Data Architecture Stack Benchmark
t_ivanov
 
Lessons Learned on Benchmarking Big Data Platforms
Lessons Learned on Benchmarking  Big Data PlatformsLessons Learned on Benchmarking  Big Data Platforms
Lessons Learned on Benchmarking Big Data Platforms
t_ivanov
 
BDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBenchBDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBench
t_ivanov
 
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBenchWBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
t_ivanov
 
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop ClustersWBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
t_ivanov
 
Ad

Recently uploaded (20)

Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
AxisTechnolabs
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025
kashifyounis067
 
Automation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath CertificateAutomation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath Certificate
VICTOR MAESTRE RAMIREZ
 
Revolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptxRevolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptx
nidhisingh691197
 
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AIScaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
danshalev
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
Exploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the FutureExploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the Future
ICS
 
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
Download YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full ActivatedDownload YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full Activated
saniamalik72555
 
Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025
mu394968
 
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Lionel Briand
 
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
Egor Kaleynik
 
Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)
Allon Mureinik
 
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
Andre Hora
 
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdfMicrosoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
TechSoup
 
Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]
saniaaftab72555
 
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New VersionPixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
saimabibi60507
 
FL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full VersionFL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full Version
tahirabibi60507
 
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
F-Secure Freedome VPN 2025 Crack Plus Activation  New VersionF-Secure Freedome VPN 2025 Crack Plus Activation  New Version
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
saimabibi60507
 
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
AxisTechnolabs
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025
kashifyounis067
 
Automation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath CertificateAutomation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath Certificate
VICTOR MAESTRE RAMIREZ
 
Revolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptxRevolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptx
nidhisingh691197
 
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AIScaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
danshalev
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
Exploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the FutureExploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the Future
ICS
 
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
Download YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full ActivatedDownload YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full Activated
saniamalik72555
 
Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025
mu394968
 
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Lionel Briand
 
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
Egor Kaleynik
 
Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)
Allon Mureinik
 
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
Andre Hora
 
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdfMicrosoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
TechSoup
 
Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]
saniaaftab72555
 
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New VersionPixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
saimabibi60507
 
FL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full VersionFL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full Version
tahirabibi60507
 
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
F-Secure Freedome VPN 2025 Crack Plus Activation  New VersionF-Secure Freedome VPN 2025 Crack Plus Activation  New Version
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
saimabibi60507
 
Ad

Adding Velocity to BigBench

  • 1. 15 June 2018 Adding Velocity to BigBench Todor Ivanov ([email protected]), Patrick Bedué, Roberto V. Zicari Frankfurt Big Data Lab, Goethe University Frankfurt, Germany Ahmad Ghazal Futurewei Technologies Inc. Santa Clara, CA, USA
  • 2. 15 June 2018 Content 1. Background BigBench 2. Motivation 3. Streaming Extension 4. Proof of Concept 5. Conclusions & Next Steps 2
  • 3. 15 June 2018 BigBench [Ghazal et al. 2013] (presented @SIGMOD 2013) ● End-to-end, technology agnostic, application-level Big Data benchmark. ○ On top of TPC-DS (decision support on retail business) ○ Adding semi-structured and unstructured data. ○ Focus on: Parallel DBMS and MR engines (Hadoop, etc.). ○ Workload: 30 queries ■ Based on big data retail analytics research ■ 11 queries from TPC-DS ● Adopted by TPC as TPCx-BB (http://www.tpc.org/tpcx-bb/). Implementation in HiveQL and Spark MLlib. 3
  • 4. 15 June 2018 BigBench V2 [Ghazal et al. 2017] (presented @ ICDE 2017) ● BigBench V2 - a major rework of BigBench ○ Separate from TPC-DS and takes care of late binding. ● New simplified data model and late binding requirements. ○ Custom made scale factor-based data generator for all components. ● Workload: ○ All 11 TPC-DS queries are replaced with new queries in BigBench V2. ○ New queries with similar business questions - focus on analytics on the semi-structured web-logs. 4 ● 1 – many relationship : ● Semi-structured : key-value WebLog ● Un-structured: Product Reviews
  • 5. 15 June 2018 Motivation ● Growing number of industry scenarios requiring streaming and new streaming engines: ● New functionalities combining analytical with streaming features ○ Spark Structured Streaming ○ Calcite adapted by Flink SQL, Samza SQL, Drill, etc. ○ Kafka Streaming SQL - KSQL ● Need of standardized end-to-end application benchmarks covering all Big Data characteristics including velocity: ○ micro-benchmarks: StreamBench, HiBench, SparkBench ○ application benchmarks: Linear Road, AIM Benchmark, Yahoo Streaming Benchmark, RIoTBench → none of the above benchmarks integrates an end-to-end real-world scenario implementing a Big Data architecture integrating storage, batch and stream processing components 5
  • 6. 15 June 2018 Our Requirements ● Create configurable data stream to simulate multiple scenarios: ○ real-time monitoring and dashboards (refresh rate in less than 3 seconds) ○ streaming hours of history data for batch processing ● Create deterministic data stream to: ○ compare accurately systems under test ○ validate and verify the workload results ● Isolate the stream engine execution as much as possible to avoid any external influence/bottlenecks, for example by the stream generation. ● Preserve the current BigBench specification, architecture, workload execution and metric. 6
  • 7. 15 June 2018 Streaming Methodology (I) ● Web-logs are key-value pairs representing user clicks (JSON file), for example: ● Web-sales example: ● Web-logs and web-sales are generated in session window manner. ● Sort the entries according to the event timestamp and create data windows depending on the simulated scenario. 7
  • 8. 15 June 2018 Streaming Methodology (II) ● Support for two window types: ● Configurable window parameters: ○ window size (x) ○ window slide (y) (e.g., hourly windows, starting every 30 minutes) ○ total runtime 8 Fixed Window Sliding (Hopping) Window (x = 2*y)
  • 9. 15 June 2018 Design Overview ● Adding 3 new components: ○ Stream Generator ○ Fast-access Layer ○ Stream Processing ● Support for 2 stream execution modes: ○ Active Mode - simulate real-time data streaming (in second ranges) ○ Passive Mode - simulate data ingestion and transformation on micro-batch processing (in hour ranges) 9
  • 10. 15 June 2018 Active and Passive Streaming Modes ● Active mode: parallel execution of the data stream generation and the actual stream processing. ● Passive mode: sequential execution of data stream generation and the actual stream processing. 10
  • 11. 15 June 2018 Workloads ● The streaming workload consists of five queries executed periodically on a stream of data (web-logs and web-sales), covering simple aggregation and pattern detection operations: ○ QS1 : Find the 10 most browsed products in the last 120 seconds. ○ QS2 : Find the 5 most browsed products that are not purchased across all users (or specific user) in the last 120 seconds. ○ Q S3 : Find the top ten pages visited by all users (or specific user) in the last 120 seconds. ○ Q S4 : Show the number of unique visitors in the last 120 seconds. ○ Q S5 : Show the sold products (of a certain type or category) in the last 120 seconds. 11
  • 12. 15 June 2018 Metrics & Result Validation ● Execution time is the time between start and end of the query execution against the streaming data. ● End-to-end streaming execution time (Latency) - starting from the Stream Generator and stopping at the point where the data result is produced. ● Result validation based on scale factor similar to current BigBench validation (SF1): 1. Store persistently the results of every query execution over a streaming window. 2. Compare the results against the golden result once the benchmark run is finished. 12
  • 13. 15 June 2018 Proof of Concept Implementation Passive Mode Components: ● Stream Generator in Spark ● Persistent Storage Layer in HDFS ● Fast-access Layer as In-memory Buffer ● Stream Processing in Spark Streaming 13 Active Mode Components: ● Stream Generator in Spark ● Persistent Storage Layer in HDFS ● Fast-access Layer in Kafka ● Stream Processing in Spark Streaming
  • 14. 15 June 2018 Conclusion ● We present a stream processing extension of the BigBench benchmark. ● Our approach proposes configurable active and passive streaming modes in order to cover the different streaming requirements (ranging from seconds to hours). ● It supports fixed and sliding window streaming to better address the common data streaming use cases. 14
  • 15. 15 June 2018 Next Steps ● New implementation on Spark Structured Streaming replacing Spark Streaming. ● Adding other engines such as Flink and Samza. ● Extending the coverage of the stream SQL operators (new workloads) including clustering, pattern detection and machine learning. ● Support for: ○ sliding windows in active mode ○ out-of-order record processing within and outside of a window ○ parallel query execution ● Validation experiments on a large-scale cluster with different active and passive mode architectures. 15
  • 16. 15 June 2018 Acknowledgments. This work has been partially funded by the European Commission H2020 project DataBench - Evidence Based Big Data Benchmarking to Improve Business Performance, under project No. 780966. This work expresses the opinions of the authors and not necessarily those of the European Commission. The European Commission is not liable for any use that may be made of the information contained in this work. The authors thank all the participants in the project for discussions and common work. www.databench.eu Thank you for your attention!
  • 17. 15 June 2018 References [Ghazal et al. 2013] Ahmad Ghazal, Tilmann Rabl, Minqing Hu, Francois Raab, Meikel Poess, Alain Crolotte, and Hans-Arno Jacobsen. 2013. BigBench: Towards An Industry Standard Benchmark for Big Data Analytics. In SIGMOD 2013. 1197–1208. [Ghazal et al. 2017] Ahmad Ghazal, Todor Ivanov, Pekka Kostamaa, Alain Crolotte, Ryan Voong, Mohammed Al-Kateb, Waleed Ghazal, and Roberto V. Zicari. 2017. BigBench V2: The New and Improved BigBench. In ICDE 2017, San Diego, CA, USA, April 19-22. 17
  • 19. 15 June 2018 QS1 (HiveQL Q5 in BigBench V2) Find the 10 most browsed products in the last 120 seconds. SELECT wl_item_id, COUNT(wl_item_id) as cnt FROM web_logs WHERE wl_item_id IS NOT NULL GROUP BY wl_item_id ORDER BY cnt DESC LIMIT 10; 19
  • 20. 15 June 2018 QS2 (HiveQL Q6 in BigBench V2) Find the 5 most browsed products that are not purchased across all users (or specific user) in the last 120 seconds. SELECT wl_item_id AS br_id, COUNT(wl_item_id) AS br_count FROM web_logs WHERE wl_item_id IS NOT NULL GROUP BY wl_item_id; view_browsed.createOrReplaceTempView("browsed"); SELECT ws_product_id AS pu_id FROM web_logs WHERE ws_product_id IS NOT NULL GROUP BY ws_product_id; view_purchased.createOrReplaceTempView("purchased"); SELECT br_id, COUNT(br_id) FROM browsed LEFT JOIN purchased ON browsed.br_id = purchased.pu_id WHERE purchased.pu_id IS NULL GROUP BY browsed.br_id LIMIT 5; 20
  • 21. 15 June 2018 QS3 (HiveQL Q16 in BigBench V2) Find the top ten pages visited by all users (or specific user) in the last 120 seconds. SELECT wl_webpage_name, COUNT(wl_webpage_name) AS cnt FROM web_logs WHERE wl_webpage_name IS NOT NULL GROUP BY wl_webpage_name ORDER BY cnt DESC LIMIT 10; 21
  • 22. 15 June 2018 QS4 (HiveQL Q22 in BigBench V2) Show the number of unique visitors in the last 120 seconds. SELECT COUNT(DISTINCT wl_customer_id) AS uniqueVisitors FROM web_logs WHERE wl_customer_id IS NOT NULL ORDER BY uniqueVisitors DESC LIMIT 10; 22
  • 23. 15 June 2018 QS5 HiveQL Show the sold products (of a certain type or category) in the last 120 seconds. SELECT ws_product_id, COUNT(ws_product_id) FROM web_sales WHERE ws_product_id IS NOT NULL GROUP BY ws_product_id ORDER BY COUNT(ws_product_id) DESC LIMIT 10; 23