SlideShare a Scribd company logo
© Hortonworks Inc. 2013
Modern Data Architecture
… and the Data Lake
John Haddad
Senior Director Product Marketing - Informatica
Jim Walker
Director Product Marketing - Hortonworks
Page 1
© Hortonworks Inc. 2013
Your Presenters
• John Haddad
– Senior Director Product Marketing, Informatica
– Over 25 years experience developing and
marketing enterprise applications
– Enjoys art, science, and the great outdoors
• Jim Walker
– Director Product Marketing, Hortonworks
– Over 20 years in data management as a
developer and a marketer
– Amateur Photographer
Page 2
© Hortonworks Inc. 2013
Today’s Topics
• Introduction
• Drivers for the Modern Data Architecture (MDA)
• Apache Hadoop in the MDA
• Informatica’s role in the MDA
• Q&A
Page 3
© Hortonworks Inc. 2013
Enterprise Data Architecture
Page 4
APPLICATIONS	
  DATA	
  SYSTEMS	
  
REPOSITORIES	
  
RDBMS	
   EDW	
   MPP	
  
DATA	
  SOURCES	
  
OLTP,	
  POS	
  
SYSTEMS	
  
Tradi8onal	
  Sources	
  	
  
(RDBMS,	
  OLTP,	
  OLAP)	
  
Business	
  
Analy8cs	
  
Custom	
  
Applica8ons	
  
Packaged	
  
Applica8ons	
  
© Hortonworks Inc. 2013
Traditional Approach – Under Pressure
Page 5
APPLICATIONS	
  DATA	
  SYSTEMS	
  
REPOSITORIES	
  
RDBMS	
   EDW	
   MPP	
  
DATA	
  SOURCES	
  
OLTP,	
  POS	
  
SYSTEMS	
  
Tradi8onal	
  Sources	
  	
  
(RDBMS,	
  OLTP,	
  OLAP)	
  
Business	
  
Analy8cs	
  
Custom	
  
Applica8ons	
  
Packaged	
  
Applica8ons	
  
New	
  Sources	
  	
  
(sen8ment,	
  clickstream,	
  geo,	
  sensor,	
  …)	
  
Source: IDC
2.8	
  ZB	
  in	
  2012	
  
85%	
  from	
  New	
  Data	
  Types	
  
15x	
  Machine	
  Data	
  by	
  2020	
  
40	
  ZB	
  by	
  2020	
  
© Hortonworks Inc. 2013
Modern Data Architecture Enabled
Page 6
APPLICATIONS	
  DATA	
  SYSTEMS	
  
REPOSITORIES	
  
RDBMS	
   EDW	
   MPP	
  
DATA	
  SOURCES	
  
OLTP,	
  POS	
  
SYSTEMS	
  
Tradi8onal	
  Sources	
  	
  
(RDBMS,	
  OLTP,	
  OLAP)	
  
Business	
  
Analy8cs	
  
Custom	
  
Applica8ons	
  
Packaged	
  
Applica8ons	
  
New	
  Sources	
  	
  
(sen8ment,	
  clickstream,	
  geo,	
  sensor,	
  …)	
  
OPERATIONAL	
  
TOOLS	
  
MANAGE	
  &	
  
MONITOR	
  
DEV	
  &	
  DATA	
  
TOOLS	
  
BUILD	
  &	
  
TEST	
  
© Hortonworks Inc. 2013
Hadoop Powers Modern Data Architecture
Page 7
Apache Hadoop is an open source project
governed by the Apache Software Foundation
(ASF) that allows you to gain insight from massive
amounts of structured and unstructured data
quickly and without significant investment.
Hadoop Cluster
compute
&
storage
. . .
. . .
. .
compute
&
storage
.
.
Hadoop clusters provide
scale-out storage and
distributed data processing
on commodity hardware
© Hortonworks Inc. 2013
Driving Efficiency Driving Opportunity
Drivers for Hadoop Adoption
Modern Data Architecture
Hadoop has a central role in next
generation data architectures while
integrating with existing data systems
Business Applications
Use Hadoop to extract insights that
enable new customer value and
competitive edge
Existing
Traditional
Server log
Clickstream
Big Data Sets
Emerging
Sentiment/Social
Machine/Sensor
Geo-locations
© Hortonworks Inc. 2013
Opportunity in types of data
1.  Sentiment
Understand how your customers feel about your brand and
products – right now
2.  Clickstream
Capture and analyze website visitors’ data trails and
optimize your website
3.  Sensor/Machine
Discover patterns in data streaming automatically from
remote sensors and machines
4.  Geographic
Analyze location-based data to manage operations where
they occur
5.  Server Logs
Research logs to diagnose process failures and prevent
security breaches
6.  Unstructured (txt, video, pictures, etc..)
Understand patterns in files across millions of web pages,
emails, and documents
Value
Page 9
© Hortonworks Inc. 2013
Efficiency in Modern Data Architecture
•  Drive efficiency via
modern data
architecture
•  Store data once and
access it in many
ways
•  Often referred to a
data lake or data
repository
•  Infrastructure
platform driven
•  IT-oriented, TCO
based
Page 10
APPLICATIONS	
  DATA	
  SYSTEMS	
  
REPOSITORIES	
  
RDBMS	
   EDW	
   MPP	
  
DATA	
  SOURCES	
  
OLTP,	
  POS	
  
SYSTEMS	
  
Tradi8onal	
  Sources	
  	
  
(RDBMS,	
  OLTP,	
  OLAP)	
  
Business	
  
Analy8cs	
  
Custom	
  
Applica8ons	
  
Packaged	
  
Applica8ons	
  
New	
  Sources	
  	
  
(sen8ment,	
  clickstream,	
  geo,	
  sensor,	
  …)	
  
© Hortonworks Inc. 2013
Page 11
APPLICATIONS	
  DATA	
  SYSTEMS	
  
TRADITIONAL	
  REPOS	
  
DEV	
  &	
  DATA	
  
TOOLS	
  
OPERATIONAL	
  
TOOLS	
  
Viewpoint
Microsoft Applications
DATA	
  SOURCES	
  
DATA	
  INTEGRATION	
  
Engineered for Interoperability
Tradi8onal	
  Sources	
  	
  
(RDBMS,	
  OLTP,	
  OLAP)	
  
New	
  Sources	
  	
  
(sen8ment,	
  clickstream,	
  geo,	
  sensor,	
  …)	
  
© Hortonworks Inc. 2013
Integrated
Interoperable with
existing data center
investments Skills
Leverage your existing
skills: development,
operations, analytics
Requirements for Hadoop Adoption
Page 12
Key Services
Platform, operational and
data services essential for
the enterprise
3Requirements for Hadoop’s Role
in the Modern Data Architecture
© Hortonworks Inc. 2013
Today’s Topics
• Introduction
• Drivers for the Modern Data Architecture (MDA)
• Apache Hadoop’s role in the MDA
• Informatica’s role in the MDA
• Q&A
Page 13
© Hortonworks Inc. 2013
Hortonworks & Informatica
Visual Development Environment
Enterprise
Repositories
EDW
LOAD
Data
Virtualization
Batch
CEP
MDM
INTERFACE
HIVE
JDBC
HDFS API
AMBARI
MAPREDUCE
YARN
HDFS
DATA REFINEMENT
HIVE (HiveQL and UDFs)
ProfileProfile
Parse
ETL
Cleanse
Match
HDFSAPI
LOAD
Reference
Architecture
SOURCE
DATA
Batch
Replicate
Stream
Archive
JMS Queue’s
Servers &
Mainframe
Files
Databases
Sensor data
Social
Data Lake Processes
Mobile Apps
Transactions,
OLTP, OLAP
Social Media, Web Logs
Machine Device,
Scientific
Documents and Emails
9. Govern & enrich
with metadata
3. Stream real-time
data
8. Explore & validate
data
4. Mask
sensitive data
2. Replicate changed
data & schemas
Visualization
& Analytics
11. Subscribe to
datasets
Data
Integration
Hub
1. Load or archive
batch data
Data
Virtualization
5. Access customer
“golden record
MDM
Enterprise
Data Warehouse
10. Correlate real-time
events with historical
patterns & trends
6. Refine &
curate data
7. Move results
to EDW
Telco Call Detail Record (CDR)
Use-Case
Use-Case: CDR Processing
•  Each job picks up a number of files containing Text CDRs
(Call Detail Records)
•  First task is to merge partial call records
•  Some records may be partial – ex. multiple records for a single
call due to a dropped line or switching cell towers
•  Partial records need to be merged and total call time needs to be
added to duration for the merged record
•  Partial records for a single call may reside in multiple files or be
included in different jobs.
•  Incomplete partial records need to be reprocessed by
consecutive jobs
•  Second task is to sort all processed CDRs by calling number
Input CDR File Example
These 3 numbers
uniquely identify a call Partial calls starts with
1 and end with 0
Some partial
records are
incomplete
Processed
completed records
are sorted by caller
Output CDR Files
Completed Calls
Partial Calls
Duration times are
added to the
merged records
Partial records are
merged into a single
completed record
Partial records will
be reprocessed
Logical Design
Partial records only
Separate partial records
from completed records
Completed
records only
Separate
incomplete and
complete partial
records
Select incomplete
partial records
Aggregate all
completed and
partial-completed
records
Viewing Data at Design Time
More Data at Design Time
Constructing Logical Expressions
More Logic
Check Outcomes
Choose Where to Process
Hadoop Execution Plan
Monitor Processing
Results in HDFS
CDR Pipeline
Sort records
by Key Summarize
by Key Group
Filter by
Province ID
Filter by
Collection
Date
City Code
Lookup
Read Files
Write report
•  Scenario – Filter records by Date, City and Province;
Aggregate and summarize records by a composite Key
Design Environment
Adding Transactional Source
•  Scenario - Report website use (Facebook, Twitter, etc.)
by Age and by Postal Code
Read WAP
log records
Get MSISDN
and URL fields
Lookup Age,
Postal Code by
MSISDN
Count URL
frequency Calculate
percentages
Connecting to Relational Source
Result
•  Easily combine big data sources with transactional data
•  Example – Report website use (Facebook, Twitter, etc.) by Age and by
Region
Look-up of
Age, Region
by MSISDN
CRM
EDW
Log
Files,
HDFS
© Hortonworks Inc. 2013
Integrated
Interoperable with
existing data center
investments Skills
Leverage your existing
skills: development,
operations, analytics
Requirements for Hadoop Adoption
Page 35
Key Services
Platform, operational and
data services essential for
the enterprise
3Requirements for Hadoop’s Role
in the Modern Data Architecture
© Hortonworks Inc. 2013
Next Steps:
Page 36
Learn more about Informatica and Hadoop
http://www.informatica.com/us/vision/harnessing-big-
data/hadoop/
Get started on Hadoop with Hortonworks
Sandbox
http://hortonworks.com/products/hortonworks-
sandbox/
Follow us:
@hortonworks, @informatica

More Related Content

What's hot (20)

PPTX
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
DataScienceConferenc1
 
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
PDF
Data Mesh Part 4 Monolith to Mesh
Jeffrey T. Pollock
 
PDF
Snowflake for Data Engineering
Harald Erb
 
PPTX
Architecting a datalake
Laurent Leturgez
 
PPTX
Snowflake essentials
qureshihamid
 
PDF
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
PDF
Modernizing to a Cloud Data Architecture
Databricks
 
PPTX
Data Warehousing Trends, Best Practices, and Future Outlook
James Serra
 
PPTX
Demystifying data engineering
Thang Bui (Bob)
 
PPT
Data Architecture for Data Governance
DATAVERSITY
 
PDF
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Tristan Baker
 
PPTX
Snowflake Datawarehouse Architecturing
Ishan Bhawantha Hewanayake
 
PDF
Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...
Cathrine Wilhelmsen
 
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
PDF
How to Use a Semantic Layer to Deliver Actionable Insights at Scale
DATAVERSITY
 
PDF
Learn to Use Databricks for the Full ML Lifecycle
Databricks
 
PDF
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
HostedbyConfluent
 
PDF
Data Modeling, Data Governance, & Data Quality
DATAVERSITY
 
PPTX
Azure DataBricks for Data Engineering by Eugene Polonichko
Dimko Zhluktenko
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
DataScienceConferenc1
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Data Mesh Part 4 Monolith to Mesh
Jeffrey T. Pollock
 
Snowflake for Data Engineering
Harald Erb
 
Architecting a datalake
Laurent Leturgez
 
Snowflake essentials
qureshihamid
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
Modernizing to a Cloud Data Architecture
Databricks
 
Data Warehousing Trends, Best Practices, and Future Outlook
James Serra
 
Demystifying data engineering
Thang Bui (Bob)
 
Data Architecture for Data Governance
DATAVERSITY
 
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Tristan Baker
 
Snowflake Datawarehouse Architecturing
Ishan Bhawantha Hewanayake
 
Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...
Cathrine Wilhelmsen
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
How to Use a Semantic Layer to Deliver Actionable Insights at Scale
DATAVERSITY
 
Learn to Use Databricks for the Full ML Lifecycle
Databricks
 
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
HostedbyConfluent
 
Data Modeling, Data Governance, & Data Quality
DATAVERSITY
 
Azure DataBricks for Data Engineering by Eugene Polonichko
Dimko Zhluktenko
 

Viewers also liked (6)

PDF
The Future of Data
blynnbuckley
 
PPTX
Top Trends in Building Data Lakes for Machine Learning and AI
Holden Ackerman
 
PPTX
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
Capgemini
 
PPTX
Ai big dataconference_eugene_polonichko_azure data lake
Olga Zinkevych
 
PPTX
50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...
Lucas Jellema
 
PDF
A beginners guide to Cloudera Hadoop
David Yahalom
 
The Future of Data
blynnbuckley
 
Top Trends in Building Data Lakes for Machine Learning and AI
Holden Ackerman
 
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
Capgemini
 
Ai big dataconference_eugene_polonichko_azure data lake
Olga Zinkevych
 
50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...
Lucas Jellema
 
A beginners guide to Cloudera Hadoop
David Yahalom
 
Ad

Similar to Modern Data Architecture for a Data Lake with Informatica and Hortonworks Data Platform (20)

PDF
Hortonworks kognitio webinar 10 dec 2013
Michael Hiskey
 
PDF
Modern Data Architecture: In-Memory with Hadoop - the new BI
Kognitio
 
PDF
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
Hortonworks
 
PDF
Non-Stop Hadoop for Hortonworks
Hortonworks
 
PDF
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
Hortonworks
 
PPTX
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
Revolution Analytics
 
PDF
Apache Hadoop on the Open Cloud
Hortonworks
 
PDF
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
jaxconf
 
PPTX
Hortonworks Oracle Big Data Integration
Hortonworks
 
PDF
Building a Modern Data Architecture with Enterprise Hadoop
Slim Baltagi
 
PDF
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Hortonworks
 
PPTX
OOP 2014
Emil Andreas Siemes
 
PDF
Hortonworks Big Data & Hadoop
Mark Ginnebaugh
 
PDF
Web Briefing: Unlock the power of Hadoop to enable interactive analytics
Kognitio
 
PDF
Splunk-hortonworks-risk-management-oct-2014
Hortonworks
 
PDF
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Innovative Management Services
 
PDF
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
Hortonworks
 
PDF
Hortonworks Hadoop @ Oslo Hadoop User Group
Mats Johansson
 
PDF
Meetup oslo hortonworks HDP
Alexander Bakos Leirvåg
 
PPTX
Big Data Made Easy: A Simple, Scalable Solution for Getting Started with Hadoop
Precisely
 
Hortonworks kognitio webinar 10 dec 2013
Michael Hiskey
 
Modern Data Architecture: In-Memory with Hadoop - the new BI
Kognitio
 
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
Hortonworks
 
Non-Stop Hadoop for Hortonworks
Hortonworks
 
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
Hortonworks
 
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
Revolution Analytics
 
Apache Hadoop on the Open Cloud
Hortonworks
 
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
jaxconf
 
Hortonworks Oracle Big Data Integration
Hortonworks
 
Building a Modern Data Architecture with Enterprise Hadoop
Slim Baltagi
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Hortonworks
 
Hortonworks Big Data & Hadoop
Mark Ginnebaugh
 
Web Briefing: Unlock the power of Hadoop to enable interactive analytics
Kognitio
 
Splunk-hortonworks-risk-management-oct-2014
Hortonworks
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Innovative Management Services
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
Hortonworks
 
Hortonworks Hadoop @ Oslo Hadoop User Group
Mats Johansson
 
Meetup oslo hortonworks HDP
Alexander Bakos Leirvåg
 
Big Data Made Easy: A Simple, Scalable Solution for Getting Started with Hadoop
Precisely
 
Ad

More from Hortonworks (20)

PDF
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks
 
PDF
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
Hortonworks
 
PDF
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Hortonworks
 
PDF
Johns Hopkins - Using Hadoop to Secure Access Log Events
Hortonworks
 
PDF
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Hortonworks
 
PDF
HDF 3.2 - What's New
Hortonworks
 
PPTX
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Hortonworks
 
PDF
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Hortonworks
 
PDF
IBM+Hortonworks = Transformation of the Big Data Landscape
Hortonworks
 
PDF
Premier Inside-Out: Apache Druid
Hortonworks
 
PDF
Accelerating Data Science and Real Time Analytics at Scale
Hortonworks
 
PDF
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
Hortonworks
 
PDF
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Hortonworks
 
PDF
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Hortonworks
 
PDF
Making Enterprise Big Data Small with Ease
Hortonworks
 
PDF
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Hortonworks
 
PDF
Driving Digital Transformation Through Global Data Management
Hortonworks
 
PPTX
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
Hortonworks
 
PDF
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks
 
PDF
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Hortonworks
 
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
Hortonworks
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Hortonworks
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Hortonworks
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Hortonworks
 
HDF 3.2 - What's New
Hortonworks
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Hortonworks
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Hortonworks
 
IBM+Hortonworks = Transformation of the Big Data Landscape
Hortonworks
 
Premier Inside-Out: Apache Druid
Hortonworks
 
Accelerating Data Science and Real Time Analytics at Scale
Hortonworks
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
Hortonworks
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Hortonworks
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Hortonworks
 
Making Enterprise Big Data Small with Ease
Hortonworks
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Hortonworks
 
Driving Digital Transformation Through Global Data Management
Hortonworks
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
Hortonworks
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Hortonworks
 

Recently uploaded (20)

PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
Designing Production-Ready AI Agents
Kunal Rai
 
PDF
July Patch Tuesday
Ivanti
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Designing Production-Ready AI Agents
Kunal Rai
 
July Patch Tuesday
Ivanti
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 

Modern Data Architecture for a Data Lake with Informatica and Hortonworks Data Platform

  • 1. © Hortonworks Inc. 2013 Modern Data Architecture … and the Data Lake John Haddad Senior Director Product Marketing - Informatica Jim Walker Director Product Marketing - Hortonworks Page 1
  • 2. © Hortonworks Inc. 2013 Your Presenters • John Haddad – Senior Director Product Marketing, Informatica – Over 25 years experience developing and marketing enterprise applications – Enjoys art, science, and the great outdoors • Jim Walker – Director Product Marketing, Hortonworks – Over 20 years in data management as a developer and a marketer – Amateur Photographer Page 2
  • 3. © Hortonworks Inc. 2013 Today’s Topics • Introduction • Drivers for the Modern Data Architecture (MDA) • Apache Hadoop in the MDA • Informatica’s role in the MDA • Q&A Page 3
  • 4. © Hortonworks Inc. 2013 Enterprise Data Architecture Page 4 APPLICATIONS  DATA  SYSTEMS   REPOSITORIES   RDBMS   EDW   MPP   DATA  SOURCES   OLTP,  POS   SYSTEMS   Tradi8onal  Sources     (RDBMS,  OLTP,  OLAP)   Business   Analy8cs   Custom   Applica8ons   Packaged   Applica8ons  
  • 5. © Hortonworks Inc. 2013 Traditional Approach – Under Pressure Page 5 APPLICATIONS  DATA  SYSTEMS   REPOSITORIES   RDBMS   EDW   MPP   DATA  SOURCES   OLTP,  POS   SYSTEMS   Tradi8onal  Sources     (RDBMS,  OLTP,  OLAP)   Business   Analy8cs   Custom   Applica8ons   Packaged   Applica8ons   New  Sources     (sen8ment,  clickstream,  geo,  sensor,  …)   Source: IDC 2.8  ZB  in  2012   85%  from  New  Data  Types   15x  Machine  Data  by  2020   40  ZB  by  2020  
  • 6. © Hortonworks Inc. 2013 Modern Data Architecture Enabled Page 6 APPLICATIONS  DATA  SYSTEMS   REPOSITORIES   RDBMS   EDW   MPP   DATA  SOURCES   OLTP,  POS   SYSTEMS   Tradi8onal  Sources     (RDBMS,  OLTP,  OLAP)   Business   Analy8cs   Custom   Applica8ons   Packaged   Applica8ons   New  Sources     (sen8ment,  clickstream,  geo,  sensor,  …)   OPERATIONAL   TOOLS   MANAGE  &   MONITOR   DEV  &  DATA   TOOLS   BUILD  &   TEST  
  • 7. © Hortonworks Inc. 2013 Hadoop Powers Modern Data Architecture Page 7 Apache Hadoop is an open source project governed by the Apache Software Foundation (ASF) that allows you to gain insight from massive amounts of structured and unstructured data quickly and without significant investment. Hadoop Cluster compute & storage . . . . . . . . compute & storage . . Hadoop clusters provide scale-out storage and distributed data processing on commodity hardware
  • 8. © Hortonworks Inc. 2013 Driving Efficiency Driving Opportunity Drivers for Hadoop Adoption Modern Data Architecture Hadoop has a central role in next generation data architectures while integrating with existing data systems Business Applications Use Hadoop to extract insights that enable new customer value and competitive edge Existing Traditional Server log Clickstream Big Data Sets Emerging Sentiment/Social Machine/Sensor Geo-locations
  • 9. © Hortonworks Inc. 2013 Opportunity in types of data 1.  Sentiment Understand how your customers feel about your brand and products – right now 2.  Clickstream Capture and analyze website visitors’ data trails and optimize your website 3.  Sensor/Machine Discover patterns in data streaming automatically from remote sensors and machines 4.  Geographic Analyze location-based data to manage operations where they occur 5.  Server Logs Research logs to diagnose process failures and prevent security breaches 6.  Unstructured (txt, video, pictures, etc..) Understand patterns in files across millions of web pages, emails, and documents Value Page 9
  • 10. © Hortonworks Inc. 2013 Efficiency in Modern Data Architecture •  Drive efficiency via modern data architecture •  Store data once and access it in many ways •  Often referred to a data lake or data repository •  Infrastructure platform driven •  IT-oriented, TCO based Page 10 APPLICATIONS  DATA  SYSTEMS   REPOSITORIES   RDBMS   EDW   MPP   DATA  SOURCES   OLTP,  POS   SYSTEMS   Tradi8onal  Sources     (RDBMS,  OLTP,  OLAP)   Business   Analy8cs   Custom   Applica8ons   Packaged   Applica8ons   New  Sources     (sen8ment,  clickstream,  geo,  sensor,  …)  
  • 11. © Hortonworks Inc. 2013 Page 11 APPLICATIONS  DATA  SYSTEMS   TRADITIONAL  REPOS   DEV  &  DATA   TOOLS   OPERATIONAL   TOOLS   Viewpoint Microsoft Applications DATA  SOURCES   DATA  INTEGRATION   Engineered for Interoperability Tradi8onal  Sources     (RDBMS,  OLTP,  OLAP)   New  Sources     (sen8ment,  clickstream,  geo,  sensor,  …)  
  • 12. © Hortonworks Inc. 2013 Integrated Interoperable with existing data center investments Skills Leverage your existing skills: development, operations, analytics Requirements for Hadoop Adoption Page 12 Key Services Platform, operational and data services essential for the enterprise 3Requirements for Hadoop’s Role in the Modern Data Architecture
  • 13. © Hortonworks Inc. 2013 Today’s Topics • Introduction • Drivers for the Modern Data Architecture (MDA) • Apache Hadoop’s role in the MDA • Informatica’s role in the MDA • Q&A Page 13
  • 14. © Hortonworks Inc. 2013 Hortonworks & Informatica Visual Development Environment Enterprise Repositories EDW LOAD Data Virtualization Batch CEP MDM INTERFACE HIVE JDBC HDFS API AMBARI MAPREDUCE YARN HDFS DATA REFINEMENT HIVE (HiveQL and UDFs) ProfileProfile Parse ETL Cleanse Match HDFSAPI LOAD Reference Architecture SOURCE DATA Batch Replicate Stream Archive JMS Queue’s Servers & Mainframe Files Databases Sensor data Social
  • 15. Data Lake Processes Mobile Apps Transactions, OLTP, OLAP Social Media, Web Logs Machine Device, Scientific Documents and Emails 9. Govern & enrich with metadata 3. Stream real-time data 8. Explore & validate data 4. Mask sensitive data 2. Replicate changed data & schemas Visualization & Analytics 11. Subscribe to datasets Data Integration Hub 1. Load or archive batch data Data Virtualization 5. Access customer “golden record MDM Enterprise Data Warehouse 10. Correlate real-time events with historical patterns & trends 6. Refine & curate data 7. Move results to EDW
  • 16. Telco Call Detail Record (CDR) Use-Case
  • 17. Use-Case: CDR Processing •  Each job picks up a number of files containing Text CDRs (Call Detail Records) •  First task is to merge partial call records •  Some records may be partial – ex. multiple records for a single call due to a dropped line or switching cell towers •  Partial records need to be merged and total call time needs to be added to duration for the merged record •  Partial records for a single call may reside in multiple files or be included in different jobs. •  Incomplete partial records need to be reprocessed by consecutive jobs •  Second task is to sort all processed CDRs by calling number
  • 18. Input CDR File Example These 3 numbers uniquely identify a call Partial calls starts with 1 and end with 0 Some partial records are incomplete Processed completed records are sorted by caller
  • 19. Output CDR Files Completed Calls Partial Calls Duration times are added to the merged records Partial records are merged into a single completed record Partial records will be reprocessed
  • 20. Logical Design Partial records only Separate partial records from completed records Completed records only Separate incomplete and complete partial records Select incomplete partial records Aggregate all completed and partial-completed records
  • 21. Viewing Data at Design Time
  • 22. More Data at Design Time
  • 26. Choose Where to Process
  • 30. CDR Pipeline Sort records by Key Summarize by Key Group Filter by Province ID Filter by Collection Date City Code Lookup Read Files Write report •  Scenario – Filter records by Date, City and Province; Aggregate and summarize records by a composite Key
  • 32. Adding Transactional Source •  Scenario - Report website use (Facebook, Twitter, etc.) by Age and by Postal Code Read WAP log records Get MSISDN and URL fields Lookup Age, Postal Code by MSISDN Count URL frequency Calculate percentages
  • 34. Result •  Easily combine big data sources with transactional data •  Example – Report website use (Facebook, Twitter, etc.) by Age and by Region Look-up of Age, Region by MSISDN CRM EDW Log Files, HDFS
  • 35. © Hortonworks Inc. 2013 Integrated Interoperable with existing data center investments Skills Leverage your existing skills: development, operations, analytics Requirements for Hadoop Adoption Page 35 Key Services Platform, operational and data services essential for the enterprise 3Requirements for Hadoop’s Role in the Modern Data Architecture
  • 36. © Hortonworks Inc. 2013 Next Steps: Page 36 Learn more about Informatica and Hadoop http://www.informatica.com/us/vision/harnessing-big- data/hadoop/ Get started on Hadoop with Hortonworks Sandbox http://hortonworks.com/products/hortonworks- sandbox/ Follow us: @hortonworks, @informatica