SlideShare a Scribd company logo
PayPal Merchant ecosystem using Spark,
Hive, Druid, HBase & Elasticsearch
2
Who we are?
Deepika Khera
Kasi Natarajan
• Big Data Technologist for over a decade
• Focused on building scalable platforms with Hadoop ecosystem – Map Reduce,
HBase, Spark, Elasticsearch, Druid
• Senior Engineering Manager - Merchant Analytics at PayPal
• Contributed to Druid for the Spark Streaming integration
• 15+ years of industry experience
• Spark Engineer @PayPal Merchant Analytics
• Building solutions using Apache Spark, Scala, Hive, HBase, Druid and Spark ML.
• Passionate about providing Analytics at scale from Big Data platforms
3
Agenda
PayPal Data & Scale
Merchant Use Case Review
Data Pipeline
Learnings - Spark & HBase
Tools & Utilities
Behavioral Driven Development
Data Quality Tool using Spark
BI with Druid & Tableau
PayPal Data & Scale
4
PayPal is more than a button
Loyalty
Faster
Conversion
Reduction
in Cart
Abandonment
Credit
Customer
Acquisition
APV Lift
Invoicing
Offers
CBT Mobile In-Store Online
5
PayPal Datasets
6
Social
Media
Demo-
graphics
Marketing
Activity
Email Applicatio
n Logs
Invoice
Credit
Reversals
Disputes
CBT
Risk
Consumer
Merchants
Partners
Location
Payment
Products
Transactio
n
Spending
7
PayPal operates one of
the largest
PRIVATE
CLOUDSin the world*
petabytes
of data*
42markets active customer
accounts**
237M
payments in
2017**
7.6
BILLION
merchants
19Mpayments/
second at peak*
~60
0
our platform
Dedicated to with a
customer focused,
strong performance,
highly scalable,
continuously available
PLATFORM.
PayPal has one of the top five Kafka
deployments in the world, handling over
200 billion messages per day
200
+
PayPal operates one of the largest Hadoop
deployments in the world.
A 1600 Node Hadoop Cluster
with 230TB of Memory, 78PB of Storage
Running 50,000 Jobs Per day
The power of
Merchant Use Case Review
8
9
Use Case Overview
INSIGHTS MARKETING SOLUTIONS
• Help Merchants engage their
customers by personalized
shopping experience
• Offers & Campaigns
• Shoppers Insights
• Revenue & transaction trends
• Cross-Border Insights
• Customer Shopping Segments
• Products performance
• Checkout Funnel
• Behavior analysis
• Measuring effectiveness
PAYPAL ANALYTICS.com
Merchant Data Platform
1. Fast Processing platform crunching multi-terabytes of data
2. Scalable, Highly available, Low latency Serving platform
10
Technologies
Processing
Serving
Movement
11
Merchant Analytics
Merchant
Data
Platform
Pre-aggregated
Cubes
Denormalized
Schema
Analytics
Data Pipeline
12
13
Data Ingestion
PayPal
Replication
Data Lake
Data Processing
Data Serving
SQL
Data PipelineData Sources Visualization
Custom UI
Web
Servers
Data Pipeline Architecture
Learnings – Spark & HBase
14
Design Considerations for Spark
Data Serialization  Use Kyro Serializer with SparkConf, which is faster and compact
 Tune kyroserializer buffer to hold large objects
Garbage Collection
Memory Management
Parallelism
Action-Transformation
Spark Best Practices Checklist
Caching & Persisting
 Clean up cached/persisted collections when they are no longer needed
 Tuned concurrent abortable preclean time from 10sec to 30sec to push out stop the world GC
 Avoided using executors with too much memory
 Optimize number of cores & partitions*
 Minimize shuffles on join() by broadcasting the smaller collection
 Optimize wider transformations as much as possible*
 Used MEMORY_AND_DISK storage level for caching large
 Repartition data before persisting to HDFS for better performance in downstream jobs
*Specific examples later
© 2018 PayPal Inc. Confidential and proprietary. 15
Learnings
• Executor spends long time on shuffle reads. Then times out , terminates and results in job failure
• Resource constraints on executor nodes causing delay in executor node
Observations
Resolution
To address memory constraints, tuned
1. config from 200 executor * 4 cores to 400 executor * 2 cores
2. executor memory allocation (reduced)
16© 2018 PayPal Inc. Confidential and proprietary.
Spark job failures with Fetch Exceptions
Long shuffle read times
Learnings
• Series of left joins on large datasets cause shuffle exceptions
Observations
Resolution
1. Split into Small jobs and run them in parallel
2. Faster reprocessing and fail fast jobs
7day
30day
60day
90day
180da
y
Union Hive
7day
30day
60day
90day
180da
y
Job
Job1
Job2
Job3
Job4
Job5
(Multiple Partitions)
Time Series data source
Other Data sources to Join with
17© 2018 PayPal Inc. Confidential and proprietary.
Hive
Parallelism for long running jobs
Learnings
• Spark Driver was left with too many heartbeat requests to process even after the job was complete
• Yarn kills the Spark job after waiting on the Driver to complete processing the Heartbeats
• The setting “spark.executor.heartbeatInterval” was set too low. Increasing it to 50s fixed the issue
• Allocate more memory to Driver to handle overheads other than typical Driver processes
Resolution
Tuning between Spark driver and executors
Driver
Executor
Executor Heartbeats
Observations
18© 2018 PayPal Inc. Confidential and proprietary.
Executor
Executorheartbeats
Yarn RM
Waiting on Driver
With the default shuffle partitions of 200, the Join Stage was running with too many tasks causing performance
overhead
Observation
Resolution
Reduce the spark.sql.shuffle.partitions settings to a lower threshold
19© 2018 PayPal Inc. Confidential and proprietary.
Learnings
Optimize joins for efficient use of cluster resources (Memory, CPU etc..,)
Read
Table
1
Read
Table 2
Start
Join
Process
T2T1
20© 2018 PayPal Inc. Confidential and proprietary.
Learnings
Optimize wide transformations
Left Outer Join Left Outer Join with OR Operators
Resolution
• Convert expensive left joins to combination of light weight join and except/union etc..,
Results of the Sub-Joins are
being sent back to Driver causing
poor performance
left
join
T2 is NULL
7 billion rows1 billion rows
T2 is NOT NULL
T2T1
join
T2 is NOT NULL
except
T2 is NULL
rewritten as
T2T1
left join
OR
T3
25 million rows25 million
rows
T2T1
left
join
rewritten as
On T1.C1 = T2.C1 On T1.C2=T2.C2
On T1.C1 =
T2.C1
T2T1
left
join On T1.C2=T2.C2
union
T3
• Batch puts and gets slow due to HBase overloaded connections
• Since our HBase row was wide, HBase operations for partitions containing larger groups were slow
Observations
Resolution
• Implemented sliding window for HBase Operations to reduce HBase connection overload
21© 2018 PayPal Inc. Confidential and proprietary.
Learnings
Optimize throughput for HBase Spark Connection
Val rePartitionedRDD: RDD[Event] =
filledRDD.repartition(2000)
…..
…..
groupedEventRDD.mapPartitions( p =>
p.sliding(2000,2000)..foreach(
Create Hbase Connection
Batch Hbase Read or Write
Close Hbase Connection
)
)
Repartition
RDD
For each
RDD partition
Perform HBase
batch operation
For each Sliding
Window
Example
Tools & Utilities
22
Behavioral Driven Development
Feature : Identify the activity related to an event
Scenario: Should perform an iteration on events and join to activity table and identify
the activity name
Given I have a set of events
|cookie_id:String |page_id:String|last_actvty:String|
|263FHFBCBBCBV|login_provide |review_next_page|
|HFFDJFLUFBFNJL|home_page |provide_credent|
And I have a Activity table
|last_activity_id:String|activity_id:String|activity_name:String|
|review_next_page | 1494300886856 |Reviewing Next Page |
|provide_credent | 2323232323232 |Provide Credentials |
When I implement Event Activity joins
Then the final result is
|cookie_id:String |activity_id:String|activity_name:String|
|last_activity_id:String|activity_id:String|activity_name:String|
|263FHFBCBBCBV | 1494300886856 |Reviewing Next Page |
|HFFDJFLUFBFNJL | 2323232323232 |Provide Credentials |
• While Unit tests are more about the implementation, BDD emphasizes more on the behavior of the code
• Writing “Specifications” in pseudo-English.
• Enables testing at external touch-points of your application
© 2018 PayPal Inc. Confidential and proprietary. 23
import cucumber.api.scala.{EN, ScalaDsl}
import cucumber.api.DataTable
import org.scalatest.Matchers
Given("""^I have a set of events$""") { (data:DataTable) =>
eventdataDF = dataTableToDataFrame(data)
}
Given("""^I have a Activity table$""") { (data:DataTable) =>
activityDataDF = dataTableToDataFrame(data)
}
When("""^I implement Event Activity joins$"""){ () =>
eventActivityDF = Activity.findAct(eventdataDF, activityDataDF) } }
Then("""^the final result is $"""){ (expectedData:DataTable) =>
val expectedDf = dataTableToDataFrame(expectedData)
val resultDF = eventActivityDF
resultDF.except(expectedDF).count
pseudo code
Data Quality Tool
1. Define Source ,Target Query, Test Operation in Config file
Source Tables Output Table
2. Spark Job that takes the config and runs the test cases
Config in SQL
format
Operation Source
Query
Target
Query
Key Column
Count Select c1
…
Select c1,c2,c3
….
C1
Values Select
c1….…
Select c1,c2,c3
from t1
C1
……..
Reports
Alerts
Quality operations
Count Aggreagtion
DuplicateRows MissingInLookup
Values
© 2018 PayPal Inc. Confidential and proprietary.
24
• Config driven automated tool written in Spark for Quality Control
• Used extensively during functional testing of the application and once live, used as quality check for our data pipeline
• Feature to compare tables (schema agnostic and at Scale) for data validation and helping engineers troubleshoot effectively
Quality Tool Flow
Batch
Ingestion
Druid Integration with BI
• Druid is an open-source time series data store designed for sub-second queries on real-time and historical data. It is
primarily used for business intelligence queries on event data*
• Traditional Databases did not scale and perform with Tableau dashboards (for many use cases)
• Enable Tableau dashboards with Druid as the serving platform
• Live connection from tableau to druid avoids getting limited by storage at any layer.
*.from http://Druid.io
Our Datasets
Druid Cluster
Historicals serving Data
Segments
Druid
Broker
Deep Storage
(HDFS)
Visualization
Druid SQL
SQL Client
Custom App
© 2018 PayPal Inc. Confidential and proprietary.
25
Hadoop
HDFS
Visualization at scale
ConclusionConclusion
© 2018 PayPal Inc. Confidential and proprietary. 26
 Spark Applications on Yarn (Hortonworks distribution).
 Spark jobs were easy to write and had excellent performance (though little hard to troubleshoot)
 Spark-HBase optimization improved performance
 Pre-aggregated datasets to Elasticsearch
 Denormalized datasets to Druid
 Pushed lowest-granularity denormalized datasets to Druid
 Behavior Driven Development a great add-on for Product-backed applications
QUESTIONS?
Ad

More Related Content

What's hot (17)

Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE
confluent
 
data platform on kubernetes
data platform on kubernetesdata platform on kubernetes
data platform on kubernetes
창언 정
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Mohamed Ali Mahmoud khouder
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
 
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
Flink Forward
 
Tableau Architecture
Tableau ArchitectureTableau Architecture
Tableau Architecture
Vivek Mohan
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
StreamNative
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They Work
Ilya Ganelin
 
Big Data Tech Stack
Big Data Tech StackBig Data Tech Stack
Big Data Tech Stack
Abdullah Çetin ÇAVDAR
 
PLNOG15: Practical deployments of Kea, a high performance scalable DHCP - Tom...
PLNOG15: Practical deployments of Kea, a high performance scalable DHCP - Tom...PLNOG15: Practical deployments of Kea, a high performance scalable DHCP - Tom...
PLNOG15: Practical deployments of Kea, a high performance scalable DHCP - Tom...
PROIDEA
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
Databricks
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
HBaseCon
 
Data Warehousing with Python
Data Warehousing with PythonData Warehousing with Python
Data Warehousing with Python
Martin Loetzsch
 
Kafka used at scale to deliver real-time notifications
Kafka used at scale to deliver real-time notificationsKafka used at scale to deliver real-time notifications
Kafka used at scale to deliver real-time notifications
Sérgio Nunes
 
NVIDIA vGPU Talk – Sizing and Common Mistakes
NVIDIA vGPU Talk – Sizing and Common MistakesNVIDIA vGPU Talk – Sizing and Common Mistakes
NVIDIA vGPU Talk – Sizing and Common Mistakes
Lee Bushen
 
How we solved Real-time User Segmentation using HBase
How we solved Real-time User Segmentation using HBaseHow we solved Real-time User Segmentation using HBase
How we solved Real-time User Segmentation using HBase
DataWorks Summit
 
Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE
confluent
 
data platform on kubernetes
data platform on kubernetesdata platform on kubernetes
data platform on kubernetes
창언 정
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
 
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
Flink Forward
 
Tableau Architecture
Tableau ArchitectureTableau Architecture
Tableau Architecture
Vivek Mohan
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
StreamNative
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They Work
Ilya Ganelin
 
PLNOG15: Practical deployments of Kea, a high performance scalable DHCP - Tom...
PLNOG15: Practical deployments of Kea, a high performance scalable DHCP - Tom...PLNOG15: Practical deployments of Kea, a high performance scalable DHCP - Tom...
PLNOG15: Practical deployments of Kea, a high performance scalable DHCP - Tom...
PROIDEA
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
Databricks
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
HBaseCon
 
Data Warehousing with Python
Data Warehousing with PythonData Warehousing with Python
Data Warehousing with Python
Martin Loetzsch
 
Kafka used at scale to deliver real-time notifications
Kafka used at scale to deliver real-time notificationsKafka used at scale to deliver real-time notifications
Kafka used at scale to deliver real-time notifications
Sérgio Nunes
 
NVIDIA vGPU Talk – Sizing and Common Mistakes
NVIDIA vGPU Talk – Sizing and Common MistakesNVIDIA vGPU Talk – Sizing and Common Mistakes
NVIDIA vGPU Talk – Sizing and Common Mistakes
Lee Bushen
 
How we solved Real-time User Segmentation using HBase
How we solved Real-time User Segmentation using HBaseHow we solved Real-time User Segmentation using HBase
How we solved Real-time User Segmentation using HBase
DataWorks Summit
 

Similar to PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase (20)

HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Data Con LA
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
kgshukla
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Value Association
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Tech Triveni
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
SolidQ
 
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
Deepak Chandramouli
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
Nike tech talk.2
Nike tech talk.2Nike tech talk.2
Nike tech talk.2
Jags Ramnarayan
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Databricks
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks
 
Big Data with SQL Server
Big Data with SQL ServerBig Data with SQL Server
Big Data with SQL Server
Mark Kromer
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
Cloudera, Inc.
 
Realtime Analytics on AWS
Realtime Analytics on AWSRealtime Analytics on AWS
Realtime Analytics on AWS
Sungmin Kim
 
SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15
SnappyData
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
AboutYouGmbH
 
Cloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeCloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data Lake
Databricks
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Data Con LA
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
kgshukla
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Value Association
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Tech Triveni
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
SolidQ
 
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
Deepak Chandramouli
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Databricks
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks
 
Big Data with SQL Server
Big Data with SQL ServerBig Data with SQL Server
Big Data with SQL Server
Mark Kromer
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
Cloudera, Inc.
 
Realtime Analytics on AWS
Realtime Analytics on AWSRealtime Analytics on AWS
Realtime Analytics on AWS
Sungmin Kim
 
SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15
SnappyData
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
AboutYouGmbH
 
Cloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeCloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data Lake
Databricks
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
Ad

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Ad

Recently uploaded (20)

AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 

PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase

  • 1. PayPal Merchant ecosystem using Spark, Hive, Druid, HBase & Elasticsearch
  • 2. 2 Who we are? Deepika Khera Kasi Natarajan • Big Data Technologist for over a decade • Focused on building scalable platforms with Hadoop ecosystem – Map Reduce, HBase, Spark, Elasticsearch, Druid • Senior Engineering Manager - Merchant Analytics at PayPal • Contributed to Druid for the Spark Streaming integration • 15+ years of industry experience • Spark Engineer @PayPal Merchant Analytics • Building solutions using Apache Spark, Scala, Hive, HBase, Druid and Spark ML. • Passionate about providing Analytics at scale from Big Data platforms
  • 3. 3 Agenda PayPal Data & Scale Merchant Use Case Review Data Pipeline Learnings - Spark & HBase Tools & Utilities Behavioral Driven Development Data Quality Tool using Spark BI with Druid & Tableau
  • 4. PayPal Data & Scale 4
  • 5. PayPal is more than a button Loyalty Faster Conversion Reduction in Cart Abandonment Credit Customer Acquisition APV Lift Invoicing Offers CBT Mobile In-Store Online 5
  • 6. PayPal Datasets 6 Social Media Demo- graphics Marketing Activity Email Applicatio n Logs Invoice Credit Reversals Disputes CBT Risk Consumer Merchants Partners Location Payment Products Transactio n Spending
  • 7. 7 PayPal operates one of the largest PRIVATE CLOUDSin the world* petabytes of data* 42markets active customer accounts** 237M payments in 2017** 7.6 BILLION merchants 19Mpayments/ second at peak* ~60 0 our platform Dedicated to with a customer focused, strong performance, highly scalable, continuously available PLATFORM. PayPal has one of the top five Kafka deployments in the world, handling over 200 billion messages per day 200 + PayPal operates one of the largest Hadoop deployments in the world. A 1600 Node Hadoop Cluster with 230TB of Memory, 78PB of Storage Running 50,000 Jobs Per day The power of
  • 8. Merchant Use Case Review 8
  • 9. 9 Use Case Overview INSIGHTS MARKETING SOLUTIONS • Help Merchants engage their customers by personalized shopping experience • Offers & Campaigns • Shoppers Insights • Revenue & transaction trends • Cross-Border Insights • Customer Shopping Segments • Products performance • Checkout Funnel • Behavior analysis • Measuring effectiveness PAYPAL ANALYTICS.com Merchant Data Platform 1. Fast Processing platform crunching multi-terabytes of data 2. Scalable, Highly available, Low latency Serving platform
  • 13. 13 Data Ingestion PayPal Replication Data Lake Data Processing Data Serving SQL Data PipelineData Sources Visualization Custom UI Web Servers Data Pipeline Architecture
  • 14. Learnings – Spark & HBase 14
  • 15. Design Considerations for Spark Data Serialization  Use Kyro Serializer with SparkConf, which is faster and compact  Tune kyroserializer buffer to hold large objects Garbage Collection Memory Management Parallelism Action-Transformation Spark Best Practices Checklist Caching & Persisting  Clean up cached/persisted collections when they are no longer needed  Tuned concurrent abortable preclean time from 10sec to 30sec to push out stop the world GC  Avoided using executors with too much memory  Optimize number of cores & partitions*  Minimize shuffles on join() by broadcasting the smaller collection  Optimize wider transformations as much as possible*  Used MEMORY_AND_DISK storage level for caching large  Repartition data before persisting to HDFS for better performance in downstream jobs *Specific examples later © 2018 PayPal Inc. Confidential and proprietary. 15
  • 16. Learnings • Executor spends long time on shuffle reads. Then times out , terminates and results in job failure • Resource constraints on executor nodes causing delay in executor node Observations Resolution To address memory constraints, tuned 1. config from 200 executor * 4 cores to 400 executor * 2 cores 2. executor memory allocation (reduced) 16© 2018 PayPal Inc. Confidential and proprietary. Spark job failures with Fetch Exceptions Long shuffle read times
  • 17. Learnings • Series of left joins on large datasets cause shuffle exceptions Observations Resolution 1. Split into Small jobs and run them in parallel 2. Faster reprocessing and fail fast jobs 7day 30day 60day 90day 180da y Union Hive 7day 30day 60day 90day 180da y Job Job1 Job2 Job3 Job4 Job5 (Multiple Partitions) Time Series data source Other Data sources to Join with 17© 2018 PayPal Inc. Confidential and proprietary. Hive Parallelism for long running jobs
  • 18. Learnings • Spark Driver was left with too many heartbeat requests to process even after the job was complete • Yarn kills the Spark job after waiting on the Driver to complete processing the Heartbeats • The setting “spark.executor.heartbeatInterval” was set too low. Increasing it to 50s fixed the issue • Allocate more memory to Driver to handle overheads other than typical Driver processes Resolution Tuning between Spark driver and executors Driver Executor Executor Heartbeats Observations 18© 2018 PayPal Inc. Confidential and proprietary. Executor Executorheartbeats Yarn RM Waiting on Driver
  • 19. With the default shuffle partitions of 200, the Join Stage was running with too many tasks causing performance overhead Observation Resolution Reduce the spark.sql.shuffle.partitions settings to a lower threshold 19© 2018 PayPal Inc. Confidential and proprietary. Learnings Optimize joins for efficient use of cluster resources (Memory, CPU etc..,) Read Table 1 Read Table 2 Start Join Process
  • 20. T2T1 20© 2018 PayPal Inc. Confidential and proprietary. Learnings Optimize wide transformations Left Outer Join Left Outer Join with OR Operators Resolution • Convert expensive left joins to combination of light weight join and except/union etc.., Results of the Sub-Joins are being sent back to Driver causing poor performance left join T2 is NULL 7 billion rows1 billion rows T2 is NOT NULL T2T1 join T2 is NOT NULL except T2 is NULL rewritten as T2T1 left join OR T3 25 million rows25 million rows T2T1 left join rewritten as On T1.C1 = T2.C1 On T1.C2=T2.C2 On T1.C1 = T2.C1 T2T1 left join On T1.C2=T2.C2 union T3
  • 21. • Batch puts and gets slow due to HBase overloaded connections • Since our HBase row was wide, HBase operations for partitions containing larger groups were slow Observations Resolution • Implemented sliding window for HBase Operations to reduce HBase connection overload 21© 2018 PayPal Inc. Confidential and proprietary. Learnings Optimize throughput for HBase Spark Connection Val rePartitionedRDD: RDD[Event] = filledRDD.repartition(2000) ….. ….. groupedEventRDD.mapPartitions( p => p.sliding(2000,2000)..foreach( Create Hbase Connection Batch Hbase Read or Write Close Hbase Connection ) ) Repartition RDD For each RDD partition Perform HBase batch operation For each Sliding Window Example
  • 23. Behavioral Driven Development Feature : Identify the activity related to an event Scenario: Should perform an iteration on events and join to activity table and identify the activity name Given I have a set of events |cookie_id:String |page_id:String|last_actvty:String| |263FHFBCBBCBV|login_provide |review_next_page| |HFFDJFLUFBFNJL|home_page |provide_credent| And I have a Activity table |last_activity_id:String|activity_id:String|activity_name:String| |review_next_page | 1494300886856 |Reviewing Next Page | |provide_credent | 2323232323232 |Provide Credentials | When I implement Event Activity joins Then the final result is |cookie_id:String |activity_id:String|activity_name:String| |last_activity_id:String|activity_id:String|activity_name:String| |263FHFBCBBCBV | 1494300886856 |Reviewing Next Page | |HFFDJFLUFBFNJL | 2323232323232 |Provide Credentials | • While Unit tests are more about the implementation, BDD emphasizes more on the behavior of the code • Writing “Specifications” in pseudo-English. • Enables testing at external touch-points of your application © 2018 PayPal Inc. Confidential and proprietary. 23 import cucumber.api.scala.{EN, ScalaDsl} import cucumber.api.DataTable import org.scalatest.Matchers Given("""^I have a set of events$""") { (data:DataTable) => eventdataDF = dataTableToDataFrame(data) } Given("""^I have a Activity table$""") { (data:DataTable) => activityDataDF = dataTableToDataFrame(data) } When("""^I implement Event Activity joins$"""){ () => eventActivityDF = Activity.findAct(eventdataDF, activityDataDF) } } Then("""^the final result is $"""){ (expectedData:DataTable) => val expectedDf = dataTableToDataFrame(expectedData) val resultDF = eventActivityDF resultDF.except(expectedDF).count pseudo code
  • 24. Data Quality Tool 1. Define Source ,Target Query, Test Operation in Config file Source Tables Output Table 2. Spark Job that takes the config and runs the test cases Config in SQL format Operation Source Query Target Query Key Column Count Select c1 … Select c1,c2,c3 …. C1 Values Select c1….… Select c1,c2,c3 from t1 C1 …….. Reports Alerts Quality operations Count Aggreagtion DuplicateRows MissingInLookup Values © 2018 PayPal Inc. Confidential and proprietary. 24 • Config driven automated tool written in Spark for Quality Control • Used extensively during functional testing of the application and once live, used as quality check for our data pipeline • Feature to compare tables (schema agnostic and at Scale) for data validation and helping engineers troubleshoot effectively Quality Tool Flow
  • 25. Batch Ingestion Druid Integration with BI • Druid is an open-source time series data store designed for sub-second queries on real-time and historical data. It is primarily used for business intelligence queries on event data* • Traditional Databases did not scale and perform with Tableau dashboards (for many use cases) • Enable Tableau dashboards with Druid as the serving platform • Live connection from tableau to druid avoids getting limited by storage at any layer. *.from http://Druid.io Our Datasets Druid Cluster Historicals serving Data Segments Druid Broker Deep Storage (HDFS) Visualization Druid SQL SQL Client Custom App © 2018 PayPal Inc. Confidential and proprietary. 25 Hadoop HDFS Visualization at scale
  • 26. ConclusionConclusion © 2018 PayPal Inc. Confidential and proprietary. 26  Spark Applications on Yarn (Hortonworks distribution).  Spark jobs were easy to write and had excellent performance (though little hard to troubleshoot)  Spark-HBase optimization improved performance  Pre-aggregated datasets to Elasticsearch  Denormalized datasets to Druid  Pushed lowest-granularity denormalized datasets to Druid  Behavior Driven Development a great add-on for Product-backed applications

Editor's Notes

  • #6: Today PayPal is much more than a button on a website. We have an extensive portfolio of products & services. Enabling CBT, easy Mobile & Web access, Credit Options to customers, Marketing solutions for merchants and many more help merchants grow their business and enable customers to safe digital commerce.
  • #7: All of these also translate into a rich set of data that PayPal has to inform strategic and operational decisions
  • #16: Concurrent Mark Sweep – If it doesn’t finish garbage collection, it starts stop the world GC. Tuned it from 10s ec to 30seconds CMSMaxAbortablePrecleanTime
  • #17: https://community.hortonworks.com/questions/44950/spark-memory-issue.html org.apache.spark.shuffle.MetadataFetchFailedException Running this job with 4 cores and 200 executors. Although there could be multiple reasons for delay like skewness in data . For us it turned out that the datanode that the executor was running on was busy , a lot of times this happened with nodes with limited capacity having more number of tasks in per executor theoretically puts more pressure on the executor where if there are memory constraints the chances of having an executor failure increases metafetchfailed happens usually due to executor failure or due to executor termination
  • #18: https://engineering.paypalcorp.com/confluence/display/EDS/Muse+Visitor+Count+Job+Split+Design
  • #19:
  • #21: Combinations
  • #25: Points The tool was completely customizable for each project The tool was build to be schema agnostic of the table and scalable to run on datasets of large size Report was generated on Match/Mismatch count by Key Columns like Product and Geography as needed