SlideShare a Scribd company logo
8
Most read
12
Most read
18
Most read
Advanced Analytics in Banking
Juan M. Huerta
Global Decision Management
VP Advanced Analytics
Citibank
I will talk about

‱ Big Data Adoption process at Citi
‱ Realizing the Technical Value of Big Data
‱ Global Solutions
1
140
countries2
200 million
accounts
Citi: A Customer Centered Organization
3
As a customer-centered bank, the goal of our Big Data strategy to shift
the focus from independent vertical silos to Common Horizontal Solutions
focused around Citi’s 200-million customer accounts
Big Data Adoption Stakeholders
‱ Lines of Business
‱ Strategy & Decision Management Organizations: cross LOB & Geo,
global
‱ Data innovation Office: Governance & Regulatory
‱ CitiData – Big Data & Analytics Engineering
4
Big Data Adoption Roadmap
5
Adoption will not occur at once. The level of capability maturity across the
organization will vary significantly.
On theory we think in terms of Staged Competencies of a Big Data
Maturity Model.
In practice, a hybrid process, which fits the level of maturity of
participants, is needed.
Common
Data
Common
Analytic
Platform
Common
Tools &
Techniques
Common
Solutions
Common
Focus
Strategy
Big Data Adoption Hybrid Participation Model
‱ Novice: Proof of Concept
‱ Expert: R&D Environment
‱ Shadowed
6
7
End-to-end Analytic Process for a POC Project
This is one component of the hybrid model
Ideas and
Hypotheses
Information Asset
Inventory
Navigator
(“IAIN”)
‱ Pipeline of ideas
to use data for
competitive
advantage
‱ Robust,
comprehensive
ontology
allowing analysts
and economists
to search, sort,
and select data
for analysis
‱ Preliminary
assessment
for business
value, data
safekeeping
and
alignment to
business
practices
Data
Transformation &
Provisioning
‱ Transformation rules
executed to
normalize and
conform production
data
‱ Conformed data set
made available in
production
environment
Production Model
Development
‱ Develop scalable,
productizable
analytics
Model
Deployment
‱ Exploit insights and
analyses across the
enterprise to
maximize value
‱ Models measured
for quality / usage
‱ Formal approval
process through
Business
Steering
Committee
based on
understanding
expected use of
production data
R&D process
R&D
Project
Approval
Product
Approval
Engineering / Production process
Analytics
Knowledge
Management
‱ Robust, compreh
ensive ontology
allowing analysts
and economists
to
search, sort, an
d select data for
analysis
Data Set
Preparation
&
Provisioning
‱ Basic preparation
of data set (e.g.,
consolidation,
conformation)
‱ Permission-based
provisioning of
data set into a Big
Data Analytics
environment
Analytics
Execution
‱ Advanced
analytic tools
mine business
insight from
large volumes of
data
‱ Data scientist
peers review
model findings
and results
Analytics Peer
Review
Data
Acquisition
‱ Where
necessary,
acquire new
data sets to
support R&D
project
Advanced Global Solutions
‱ A global solution is a tested algorithm or analytic model that carries
out a particular business analysis and which is leveraged at a global
scale
‱ A big data global solution enables the interplay of complex algorithms
and large datasets
‱ When a global solution is built upon big data approaches a delivery
roadmap should be considered
‱ In the exploratory process a Global Solution is developed in the
Innovation R/D environment and validated through a POC process
‱ Alignment with Innovation, UAT, PRD environments
8
Technical Value of Big Data:
Benchmarks and Analysis
The Boom Driving Big Data is Technological
Heebyung Koh , Christopher L. Magee
A functional approach for studying technological progress:
Extension to energy technology
Technological Forecasting and Social Change, Volume 75, Issue 6,
July 2008, Pages 735–758
The Quadrant Of Analytic Opportunity
Run Time is affected by Data Size and Algorithmic Complexity
Algorithmic Complexity
Database
Interaction
Mtg+Cards+
Banking
Accounts Transaction
features
Accounts Transactions
Branches Transactions
Accounts Summary Stats.
Employees Summary Stats.
GL-GOCS GL-Entries
Branches Summary Stats.
10^10
10^9
10^9
10^8
10^7
10^6
10^5
Data Size
Sequence
Mining
Predictive
filtering
Latent
Dirichlet Allocation
HMM Baum-
Welch
O(ns nf nt)
CART
O(nf ns log ns)
Iterative
SVD- CF
K-means
Logistic
Regression
PCAPage
Rank
Self-Org.
Maps
Neural Nets
Collaborative
Filtering
(CF)
Vector based
Approaches
HMM
Machine
Learning
Traditional
Statistical
Big Data/Pattern
Mining
Conditional
Random
Fields
Support Vector
Machines
Breaking down the gains of P13n:
A Controlled Incremental Benchmark on a
Workstation grade processor (x500)
Implemented an incremental-SVD (Netflix Cup) predictive model that
runs on midsize of datasets

X30
‱ Compiled Code (vs. interpreted)
x4
‱ In Memory (vs. Disk access)
X3.12
‱ Multithread (vs. single thread)
X1.3
‱ Workstation grade processor
Basic Map Reduce Benchmarks
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1 2 3 4 5 6
Series1
Impact of overhead as function
Of input volume:
Relative Map Throughput
as a function of # Mappers
0
5
10
15
20
25
0 5 10 15 20
RelativeMapCPUtimespeedup
Number of Maps
0.003351955
0.032258065
0.319148936 1
2.631578947
21.12676056
Linear (0.003351955
0.032258065
0.319148936 1
2.631578947
21.12676056)
0
200
400
600
800
1000
1200
1400
1600
0 5 10 15 20
TokensperWallClockSecond
Number of Maps
Series1
Linear (Series1)
HAMSTER: Hadoop Multi-signature Search
for Text-based Entity Retrieval
‱ Core algorithm: String Edit Distance O(mnk2)
‱ Baseline runs at 100 matches per day
‱ HAMSTER speedup: 33x (5 node speedup) 60x (java speedup) =
2000x faster
Source
Items
Target
Items
Source
items
per
target
Input
Size
MAP
Records
Cluster
Max Map
Tasks
Effective
Map
Tasks
CPU
map
(secs)
Wall time
34k 618k 100 4.40GB 345 33 33 196k 2h 14
secs
34k 618k 50 8.8GB 690 40 66 196k 1h
47min
34k 618k 30 14.6GB 1,149 40 110 199k 1h 39
min
Leveraging Global Big Data Global Solutions
Creating Global Big Data solutions
Our goal is to evolve from Big Data algorithms to Big Data
Solutions
Example of Advanced Global Solution Matrix
17
Outlier
Detection
Multivariate
Segmentation
Sequence
Matching
Network
Analysis
Customer Contextual Clickstream
Action Marketing Risk/Fraud Digital
Structured
Prediction
17
K-Medoids
Clustering
Example: Transactional Time Series
AnomalousBehavior
On Demand Simulation: Generate Branches’ DNA
‱ Case Scenario: Unusual number of cash advances by 2 tellers.
Single day fraud Multi day fraudOriginal branch (August)
Creating Regions of Interest based on
On-Demand-Simulation
Minimum-Spanning-
Tree based branch
association for region
of interest generation
Multi-day fraud simulation
Original branch
Region of interest
‱ Numbers shown
are randomized
indices
Conclusion: Lessons Learned
‱ One Size does not fit all
‱ Follow a Hybrid Approach
‱ Leverage Analytic patterns: Global Solutions
‱ Big Data is about Parallelization
‱ The future: expensive Algorithms applied to large datasets
‱ Global Solutions are the combination of algorithmic building blocks
applied to specific business problems
21
Thank You!
22

More Related Content

PPT
How to pitch a VC (from Dave McClure)
Willy Braun
 
PDF
Business intelligence in the real time economy
Johan Blomme
 
PPTX
Application of predictive analytics
Prasad Narasimhan
 
PPTX
Artificial Intelligence: a driver of innovation in the Banking Sector
Big Data Value Association
 
PPTX
Big data analytics in banking sector
Anil Rana
 
PDF
Artificial Intelligence in Banking
Khawar Nehal [email protected]
 
PPTX
Big Data, Business Intelligence and Data Analytics
Systems Limited
 
PDF
Data warehousing
Juhi Mahajan
 
How to pitch a VC (from Dave McClure)
Willy Braun
 
Business intelligence in the real time economy
Johan Blomme
 
Application of predictive analytics
Prasad Narasimhan
 
Artificial Intelligence: a driver of innovation in the Banking Sector
Big Data Value Association
 
Big data analytics in banking sector
Anil Rana
 
Artificial Intelligence in Banking
Khawar Nehal [email protected]
 
Big Data, Business Intelligence and Data Analytics
Systems Limited
 
Data warehousing
Juhi Mahajan
 

What's hot (20)

PPT
Business Intelligence
Hank Lin
 
PDF
The future of banking
Barbara Biro
 
PDF
Introduction to Data Science
Niko Vuokko
 
PDF
Lecture1 introduction to big data
hktripathy
 
PPTX
Big data unit 2
RojaT4
 
PDF
OLTP vs OLAP
BI_Solutions
 
PPTX
Big Data
Subhavinolin Raja
 
PPTX
Big data & Digital Marketing
Karthik Bharath
 
PPTX
Classification and prediction in data mining
Er. Nawaraj Bhandari
 
PPT
Topic 6 fintech
allen1215
 
PPTX
Predictive Analytics - An Overview
MachinePulse
 
PPTX
Data Science
Prakhyath Rai
 
PDF
Introduction to Data Science
ANOOP V S
 
PPT
Introduction to Business Intelligence
Ronan Soares
 
PDF
Data Science Use cases in Banking
Arul Bharathi
 
PPTX
Data science 101
University of West Florida
 
PPTX
Big data analytics
Vikram Nandini
 
PPTX
Presentation on Big Data
Md. Salman Ahmed
 
PPTX
Our big data
uthrarajan
 
PDF
The Rise of the DataOps - Dataiku - J On the Beach 2016
Dataiku
 
Business Intelligence
Hank Lin
 
The future of banking
Barbara Biro
 
Introduction to Data Science
Niko Vuokko
 
Lecture1 introduction to big data
hktripathy
 
Big data unit 2
RojaT4
 
OLTP vs OLAP
BI_Solutions
 
Big Data
Subhavinolin Raja
 
Big data & Digital Marketing
Karthik Bharath
 
Classification and prediction in data mining
Er. Nawaraj Bhandari
 
Topic 6 fintech
allen1215
 
Predictive Analytics - An Overview
MachinePulse
 
Data Science
Prakhyath Rai
 
Introduction to Data Science
ANOOP V S
 
Introduction to Business Intelligence
Ronan Soares
 
Data Science Use cases in Banking
Arul Bharathi
 
Data science 101
University of West Florida
 
Big data analytics
Vikram Nandini
 
Presentation on Big Data
Md. Salman Ahmed
 
Our big data
uthrarajan
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
Dataiku
 
Ad

Similar to Advanced Analytics in Banking, CITI (20)

PDF
Large Scale Data Analytics
shankar_radhakrishnan
 
PPTX
Data Analytics in Industry Verticals, Data Analytics Lifecycle, Challenges of...
Sahilakhurana
 
PPTX
Trends in data analytics
Ramakrishnan Venkataramanan
 
PDF
BIG DATA RESEARCH
Kathirvel Ayyaswamy
 
PPT
01-introduction.ppt the paper that you can unless you want to join me because...
teodroscampaus
 
PDF
Big Data Analytics
Sreedhar Chowdam
 
PPT
WWV2015: Jibes Paul van der Hulst big data
webwinkelvakdag
 
PDF
Capturing big value in big data
BSP Media Group
 
PPT
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Jonathan Seidman
 
PPT
Gartner peer forum sept 2011 orbitz
Raghu Kashyap
 
PDF
02 a holistic approach to big data
Raul Chong
 
PDF
Big Data - Hadoop and MapReduce for QA and testing by Aditya Garg
QA or the Highway
 
PDF
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
oj08
 
PPTX
Big data Analytics in Information Technology
technakama
 
PDF
Understanding your Data - Data Analytics Lifecycle and Machine Learning
Abzetdin Adamov
 
PDF
MapR Enterprise Data Hub Webinar w/ Mike Ferguson
MapR Technologies
 
PPTX
Big data analytics final
Amit Kumar
 
PDF
Minne analytics presentation 2018 12 03 final compressed
Bonnie Holub
 
DOC
Complete-SRS.doc
jadhavpravin920
 
PPTX
Big Data Practice_Planning_steps_RK
Rajesh Jayarman
 
Large Scale Data Analytics
shankar_radhakrishnan
 
Data Analytics in Industry Verticals, Data Analytics Lifecycle, Challenges of...
Sahilakhurana
 
Trends in data analytics
Ramakrishnan Venkataramanan
 
BIG DATA RESEARCH
Kathirvel Ayyaswamy
 
01-introduction.ppt the paper that you can unless you want to join me because...
teodroscampaus
 
Big Data Analytics
Sreedhar Chowdam
 
WWV2015: Jibes Paul van der Hulst big data
webwinkelvakdag
 
Capturing big value in big data
BSP Media Group
 
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Jonathan Seidman
 
Gartner peer forum sept 2011 orbitz
Raghu Kashyap
 
02 a holistic approach to big data
Raul Chong
 
Big Data - Hadoop and MapReduce for QA and testing by Aditya Garg
QA or the Highway
 
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
oj08
 
Big data Analytics in Information Technology
technakama
 
Understanding your Data - Data Analytics Lifecycle and Machine Learning
Abzetdin Adamov
 
MapR Enterprise Data Hub Webinar w/ Mike Ferguson
MapR Technologies
 
Big data analytics final
Amit Kumar
 
Minne analytics presentation 2018 12 03 final compressed
Bonnie Holub
 
Complete-SRS.doc
jadhavpravin920
 
Big Data Practice_Planning_steps_RK
Rajesh Jayarman
 
Ad

More from Innovation Enterprise (20)

PPT
Marketing Technology Organizational Models
Innovation Enterprise
 
PPTX
BI, INC - BI, INC, Boeing
Innovation Enterprise
 
PDF
Bridging the Gap between Budgets & Reality Oracle's Next Generation S&OP Solu...
Innovation Enterprise
 
PDF
Beyond the Basics: Leveraging S&OP to Deliver Results, Newell Rubbermaid
Innovation Enterprise
 
PPTX
CHAINalytics, Empowering Fact Based Decisions Across Your Supply Chain
Innovation Enterprise
 
PDF
Sales Transformation: The Role of Sales Strategy & Operations, Dow Jones & Co...
Innovation Enterprise
 
PDF
One Version of the Truth, Driving S&OP from detailed planning tools, Freescale
Innovation Enterprise
 
PDF
Making Sales and Operations Planning a Truly Collaborative Process, Dick Ling
Innovation Enterprise
 
PDF
Building a Fast and Flexible Consumer-Driven Supply Chain, Stanley Black & De...
Innovation Enterprise
 
PDF
Strengthen the Processes to reach another level of excellence, Satish Sandhir
Innovation Enterprise
 
PDF
How to Keep S&OP From Getting "Stuck", Oliver Wight, JDA
Innovation Enterprise
 
PDF
S&OP Innovation, Marietta
Innovation Enterprise
 
PDF
Cisco Strategic Planning The Journey, Cisco
Innovation Enterprise
 
PDF
Sales and Operations Planning, Supported by Demand Management Capability, Sus...
Innovation Enterprise
 
PDF
Enablers for Maturing your S&OP Processes, SherTrack
Innovation Enterprise
 
PDF
S&OP, Kinaxis
Innovation Enterprise
 
PDF
Sales, Inventory & Operations Planning During High Growth, GMCR
Innovation Enterprise
 
PDF
Predicting The Future With Big Data: No Crystal Ball Required, TrendSpottr
Innovation Enterprise
 
PDF
Big Data Toronto, Unata
Innovation Enterprise
 
PDF
Big Data in Education, Desire2Learn Inc
Innovation Enterprise
 
Marketing Technology Organizational Models
Innovation Enterprise
 
BI, INC - BI, INC, Boeing
Innovation Enterprise
 
Bridging the Gap between Budgets & Reality Oracle's Next Generation S&OP Solu...
Innovation Enterprise
 
Beyond the Basics: Leveraging S&OP to Deliver Results, Newell Rubbermaid
Innovation Enterprise
 
CHAINalytics, Empowering Fact Based Decisions Across Your Supply Chain
Innovation Enterprise
 
Sales Transformation: The Role of Sales Strategy & Operations, Dow Jones & Co...
Innovation Enterprise
 
One Version of the Truth, Driving S&OP from detailed planning tools, Freescale
Innovation Enterprise
 
Making Sales and Operations Planning a Truly Collaborative Process, Dick Ling
Innovation Enterprise
 
Building a Fast and Flexible Consumer-Driven Supply Chain, Stanley Black & De...
Innovation Enterprise
 
Strengthen the Processes to reach another level of excellence, Satish Sandhir
Innovation Enterprise
 
How to Keep S&OP From Getting "Stuck", Oliver Wight, JDA
Innovation Enterprise
 
S&OP Innovation, Marietta
Innovation Enterprise
 
Cisco Strategic Planning The Journey, Cisco
Innovation Enterprise
 
Sales and Operations Planning, Supported by Demand Management Capability, Sus...
Innovation Enterprise
 
Enablers for Maturing your S&OP Processes, SherTrack
Innovation Enterprise
 
S&OP, Kinaxis
Innovation Enterprise
 
Sales, Inventory & Operations Planning During High Growth, GMCR
Innovation Enterprise
 
Predicting The Future With Big Data: No Crystal Ball Required, TrendSpottr
Innovation Enterprise
 
Big Data Toronto, Unata
Innovation Enterprise
 
Big Data in Education, Desire2Learn Inc
Innovation Enterprise
 

Recently uploaded (20)

PDF
Software Development Methodologies in 2025
KodekX
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
This slide provides an overview Technology
mineshkharadi333
 
PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
CIFDAQ'S Market Insight: BTC to ETH money in motion
CIFDAQ
 
PDF
Software Development Company | KodekX
KodekX
 
PPTX
Comunidade Salesforce SĂŁo Paulo - Desmistificando o Omnistudio (Vlocity)
Francisco Vieira JĂșnior
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
PDF
Building High-Performance Oracle Teams: Strategic Staffing for Database Manag...
SMACT Works
 
PDF
DevOps & Developer Experience Summer BBQ
AUGNYC
 
PPT
L2 Rules of Netiquette in Empowerment technology
Archibal2
 
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
Software Development Methodologies in 2025
KodekX
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
This slide provides an overview Technology
mineshkharadi333
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
CIFDAQ'S Market Insight: BTC to ETH money in motion
CIFDAQ
 
Software Development Company | KodekX
KodekX
 
Comunidade Salesforce SĂŁo Paulo - Desmistificando o Omnistudio (Vlocity)
Francisco Vieira JĂșnior
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
Building High-Performance Oracle Teams: Strategic Staffing for Database Manag...
SMACT Works
 
DevOps & Developer Experience Summer BBQ
AUGNYC
 
L2 Rules of Netiquette in Empowerment technology
Archibal2
 
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 

Advanced Analytics in Banking, CITI

  • 1. Advanced Analytics in Banking Juan M. Huerta Global Decision Management VP Advanced Analytics Citibank
  • 2. I will talk about
 ‱ Big Data Adoption process at Citi ‱ Realizing the Technical Value of Big Data ‱ Global Solutions 1
  • 4. Citi: A Customer Centered Organization 3 As a customer-centered bank, the goal of our Big Data strategy to shift the focus from independent vertical silos to Common Horizontal Solutions focused around Citi’s 200-million customer accounts
  • 5. Big Data Adoption Stakeholders ‱ Lines of Business ‱ Strategy & Decision Management Organizations: cross LOB & Geo, global ‱ Data innovation Office: Governance & Regulatory ‱ CitiData – Big Data & Analytics Engineering 4
  • 6. Big Data Adoption Roadmap 5 Adoption will not occur at once. The level of capability maturity across the organization will vary significantly. On theory we think in terms of Staged Competencies of a Big Data Maturity Model. In practice, a hybrid process, which fits the level of maturity of participants, is needed. Common Data Common Analytic Platform Common Tools & Techniques Common Solutions Common Focus Strategy
  • 7. Big Data Adoption Hybrid Participation Model ‱ Novice: Proof of Concept ‱ Expert: R&D Environment ‱ Shadowed 6
  • 8. 7 End-to-end Analytic Process for a POC Project This is one component of the hybrid model Ideas and Hypotheses Information Asset Inventory Navigator (“IAIN”) ‱ Pipeline of ideas to use data for competitive advantage ‱ Robust, comprehensive ontology allowing analysts and economists to search, sort, and select data for analysis ‱ Preliminary assessment for business value, data safekeeping and alignment to business practices Data Transformation & Provisioning ‱ Transformation rules executed to normalize and conform production data ‱ Conformed data set made available in production environment Production Model Development ‱ Develop scalable, productizable analytics Model Deployment ‱ Exploit insights and analyses across the enterprise to maximize value ‱ Models measured for quality / usage ‱ Formal approval process through Business Steering Committee based on understanding expected use of production data R&D process R&D Project Approval Product Approval Engineering / Production process Analytics Knowledge Management ‱ Robust, compreh ensive ontology allowing analysts and economists to search, sort, an d select data for analysis Data Set Preparation & Provisioning ‱ Basic preparation of data set (e.g., consolidation, conformation) ‱ Permission-based provisioning of data set into a Big Data Analytics environment Analytics Execution ‱ Advanced analytic tools mine business insight from large volumes of data ‱ Data scientist peers review model findings and results Analytics Peer Review Data Acquisition ‱ Where necessary, acquire new data sets to support R&D project
  • 9. Advanced Global Solutions ‱ A global solution is a tested algorithm or analytic model that carries out a particular business analysis and which is leveraged at a global scale ‱ A big data global solution enables the interplay of complex algorithms and large datasets ‱ When a global solution is built upon big data approaches a delivery roadmap should be considered ‱ In the exploratory process a Global Solution is developed in the Innovation R/D environment and validated through a POC process ‱ Alignment with Innovation, UAT, PRD environments 8
  • 10. Technical Value of Big Data: Benchmarks and Analysis
  • 11. The Boom Driving Big Data is Technological Heebyung Koh , Christopher L. Magee A functional approach for studying technological progress: Extension to energy technology Technological Forecasting and Social Change, Volume 75, Issue 6, July 2008, Pages 735–758
  • 12. The Quadrant Of Analytic Opportunity Run Time is affected by Data Size and Algorithmic Complexity Algorithmic Complexity Database Interaction Mtg+Cards+ Banking Accounts Transaction features Accounts Transactions Branches Transactions Accounts Summary Stats. Employees Summary Stats. GL-GOCS GL-Entries Branches Summary Stats. 10^10 10^9 10^9 10^8 10^7 10^6 10^5 Data Size Sequence Mining Predictive filtering Latent Dirichlet Allocation HMM Baum- Welch O(ns nf nt) CART O(nf ns log ns) Iterative SVD- CF K-means Logistic Regression PCAPage Rank Self-Org. Maps Neural Nets Collaborative Filtering (CF) Vector based Approaches HMM Machine Learning Traditional Statistical Big Data/Pattern Mining Conditional Random Fields Support Vector Machines
  • 13. Breaking down the gains of P13n: A Controlled Incremental Benchmark on a Workstation grade processor (x500) Implemented an incremental-SVD (Netflix Cup) predictive model that runs on midsize of datasets
 X30 ‱ Compiled Code (vs. interpreted) x4 ‱ In Memory (vs. Disk access) X3.12 ‱ Multithread (vs. single thread) X1.3 ‱ Workstation grade processor
  • 14. Basic Map Reduce Benchmarks 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 1 2 3 4 5 6 Series1 Impact of overhead as function Of input volume: Relative Map Throughput as a function of # Mappers 0 5 10 15 20 25 0 5 10 15 20 RelativeMapCPUtimespeedup Number of Maps 0.003351955 0.032258065 0.319148936 1 2.631578947 21.12676056 Linear (0.003351955 0.032258065 0.319148936 1 2.631578947 21.12676056) 0 200 400 600 800 1000 1200 1400 1600 0 5 10 15 20 TokensperWallClockSecond Number of Maps Series1 Linear (Series1)
  • 15. HAMSTER: Hadoop Multi-signature Search for Text-based Entity Retrieval ‱ Core algorithm: String Edit Distance O(mnk2) ‱ Baseline runs at 100 matches per day ‱ HAMSTER speedup: 33x (5 node speedup) 60x (java speedup) = 2000x faster Source Items Target Items Source items per target Input Size MAP Records Cluster Max Map Tasks Effective Map Tasks CPU map (secs) Wall time 34k 618k 100 4.40GB 345 33 33 196k 2h 14 secs 34k 618k 50 8.8GB 690 40 66 196k 1h 47min 34k 618k 30 14.6GB 1,149 40 110 199k 1h 39 min
  • 16. Leveraging Global Big Data Global Solutions
  • 17. Creating Global Big Data solutions Our goal is to evolve from Big Data algorithms to Big Data Solutions
  • 18. Example of Advanced Global Solution Matrix 17 Outlier Detection Multivariate Segmentation Sequence Matching Network Analysis Customer Contextual Clickstream Action Marketing Risk/Fraud Digital Structured Prediction 17 K-Medoids Clustering
  • 19. Example: Transactional Time Series AnomalousBehavior
  • 20. On Demand Simulation: Generate Branches’ DNA ‱ Case Scenario: Unusual number of cash advances by 2 tellers. Single day fraud Multi day fraudOriginal branch (August)
  • 21. Creating Regions of Interest based on On-Demand-Simulation Minimum-Spanning- Tree based branch association for region of interest generation Multi-day fraud simulation Original branch Region of interest ‱ Numbers shown are randomized indices
  • 22. Conclusion: Lessons Learned ‱ One Size does not fit all ‱ Follow a Hybrid Approach ‱ Leverage Analytic patterns: Global Solutions ‱ Big Data is about Parallelization ‱ The future: expensive Algorithms applied to large datasets ‱ Global Solutions are the combination of algorithmic building blocks applied to specific business problems 21