From Insights to Value - Building a Modern Logical Data Lake to Drive User Adoption and Business Value

From Insights to Value
Building a Modern Logical Data Lake To Drive User Adoption and Business Value
Vineet Tyagi / Impetus

We Make Big Data Work
We have been supporting several Fortune 500 customers on their Big Data Journey since last 10
years
Across the board we have seen
• Fast changing analytic and reporting requirements
• Lack of end user self service capabilities
• Need for better collaboration and agility in working with trusted data
• Data is in silos making it difficult to get closer to customers, additionally data from traditional
sources still remains important.

© 2017 Impetus Technologies – Confidential
“Enterprises today are realizing about
15% of potential ROI on BI investments”

“Fragmented purpose driven Hadoop data
lakes are creating integration challenges”

The new normal for enterprise IT
EDW + BDW (Lake/s) == Unified Enterprise Data

Making insights and data in the lake readily discoverable, accessible and usable
“Visual data-discovery, an important enabler of end user self-service”
Challenge 1 : Providing a complete seamless view of business data

Challenge 2 : Simplified & Self-Serve enablement of BI
Provision
Cluster
Discover and Blend
New Sources
Data Access and
Exploration
Ingest and
Transform data
Security and
Governance
BI, Analytics and
Models

Challenge 3 : Support Use-Case driven Data Access mechanisms
Specific Query &
Reporting
SQL
Cross Dimensional
Fast Slice Dice and
Drill Down
OLAP
Data from MPP,
Relational and
Hadoop
Data
Virtualization
Finding the “Needle
in a Haystack”
Search
“Don’t Know What
You Don’t Know”
Self Service Data
Discovery

Challenge 4 : Leverage EDW and BDW coexistence
Optimizing the placement of enterprise workloads and the data on which they operate
The multi-platform environment is the warehouse
Frees capacity on high-end analytic and data warehouse systems
• Immediate ROI on Hadoop
Get a platform better suited to advanced analytics

Challenge 5 : Collaboration & Reuse
Collaboration & Reuse of data and analytical assets on a logical data lake
Data Democratization
• Lowering adoption barriers for your stakeholders
• Getting the data they want should be fast and easy
Analytical Democratization
• Publishing and Discovery of analytical assets
• Ability to reuse in simple and consistent way

Let’s Build the Logical Lake

Logical Data Lake : Modern Analytical Data Fabric
Landing and
ingestion
Structured
Unstructured
External Social
Machine
Geospatial
Time Series
Streaming
Enterprise
Data Lake
Real-Time applications
Data Federation/
Virtualization
Exploration &
discovery
Data Wrangling
RDBMS MPP
Enterprise Meta Data Management
Accelerators
Traditional
data
repositories
Provisioning, Workflow, Monitoring and Governance

Providing a complete seamless view of business data
• Simple, consistent view of meta information
• Automated sourcing and seeding
• Social and usage based enrichment
• Not ONLY a data catalogue
• Analytical asset catalogue
• Leverage and supplement existing business ontologies

Leverage EDW and BDW coexistence
Optimizing the placement of enterprise workloads and the data on which they operate
“Right Positioning” of workloads based on price / performance
• Most bang for the buck
Build a platform better suited to advanced analytics with Big Data technologies
• Retaining what works “in situ”

Technical Perspective and Choices

Architectural Patterns
Architectural Patterns: Streaming
Pattern 1: Streaming Ingestion
Pattern 2: Near real time event Processing with external context.
Pattern 3: Near Real Time Partitioned event processing with external context.
Pattern 4: Complex topology for Aggregations or Machine Learning.
Architectural Patterns: Batch + Streaming
Pattern 5: The Lambda Architecture: Hadoop and Storm
Pattern 6: Merging Batch and Streaming: Kappa a post lambda architecture.
Pattern 7: Unified Batch and Stream Processing: Flink or Spark

Pattern 1: Streaming Ingestion
Use Case Scenarios
1. Efficiently collecting, aggregating, and moving large amounts of streaming
data into Hadoop cluster
2. Emphasis on low-latency persisting of events to
• HDFS
• Apache HBase
• Apache Solr

Pattern 2: Near real time event Processing
Use Case Scenarios:
Alerting, flagging, transforming and filtering of events as they arrive.
Take immediate decisions to transform the data
Take some sort of external action.
The decision often depends on external profile or metadata.
The user code can interact with local memory or distributed cache
The user code can interact with external storage system like Hbase

Pattern 3: Near Real Time with partitioned external context
Use Case Scenarios:
When external context information required for event processing doesn’t fit in
local memory
Calling to external system like HBase does not meet the SLA requirements

Pattern 4: Complex topology for Aggregations or ML
Use Case Scenarios:
• Real time data from complex and flexible set of operations
• Complex operations like counts, averages, sessionization
• Results often depend upon windowed computations or require more active data
• Focus shifts from ultra low frequency to functionality and accuracy.
• Machine-learning model building that operate on batches of data.

Speed Layer 1. Compensate for recently updated data
2. Do fast incremental computations on the newly arrived data
3. Batch layer would eventually overwrite speed layer.
4. Provide random reads and random writes Storm, Flume etc.
Serving Layer 1. Random access to batch views
2. Bulk updates from Batch Layer
3. No random writes Hbase, Solr etc.
Batch Layer 1. Stores master data set.
2. Compute Batch views Hadoop, Solr cluster
Pattern 5: Lambda Architecture

Problems with Lambda Architecture.
• You implement your transformation logic twice once in the batch system and another time in
the stream processing system. The two needs to be in sync to give the right result.
• You stitch together the results from both the batch views and the real time views to produce a
complete answer.
Kappa architecture swtiches over to using a canonical data store that is an append only immutable log instead
of relational DB like SQL or a key-value store like Cassandra, The seving layer uses the data streamed through
the computational system and stored in auxiliary stores for serving.
Pattern 6: Kappa Architecture

No. Technology Strength
1 Flink + Distributed stream and batch data processing
+ Distributed computations over data streams
+ High throughput and low latency
+ Exactly-once guaranty
+ Batch processing applications as special cases of stream processing
2 Spark + Fast
+ Low latency
Pattern 7: Unified Batch and Streaming Framework

Time Range based technology choices
Map Reduce
Impala Impala Impala
Flume Interceptors Spark Streaming Spark Spark
Custom Storm Trident Tez Tez
50ms > 500 ms >30,000 ms >90,000 ms

Logical Data Lake : Modern Analytical Data Fabric
Landing and
Ingestion
Structured
Unstructured
External Social
Machine
Geospatial
Time Series
Streaming Real-Time Applications
Data Federation/
Virtualization
Exploration &
Discovery
Data Wrangling
RDBMS MPP
Enterprise Meta Data Management
Accelerators
Traditional
Data
Repositories
Provisioning, Workflow, Monitoring and Governance
DATA BLENDING
Metadata & Discovery
WORKLOAD MIGRATION
Data Blending
Enterprise
Data Lake

Thank you.
Questions?

From Insights to Value - Building a Modern Logical Data Lake to Drive User Adoption and Business Value

Recommended

More Related Content

What's hot (20)

Similar to From Insights to Value - Building a Modern Logical Data Lake to Drive User Adoption and Business Value (20)

More from DataWorks Summit (20)

Recently uploaded (20)

From Insights to Value - Building a Modern Logical Data Lake to Drive User Adoption and Business Value