SlideShare a Scribd company logo
コカ・コーライーストジャパン株式会社
From a single droplet to a full bottle,
our journey to Hadoop at Coca-Cola east Japan
October 27, 2016
Information Systems, Enterprise Architect
& Innovation project manager
Damien Contreras
ダミアン コントレラ
コカ・コーライーストジャパン株式会社
In This session
• About Coca-Cola East Japan
• Hadoop Journey at CCEJ
• Hadoop Projects
• Hadoop for the manufacturing
industry
• Hadoop for CCEJ: What’s Next
コカ・コーライーストジャパン株式会社 3
• Coca-Cola East Japan was established on Jul. 1, 2013
through the merger of four bottlers.
• On Apr. 1, 2015, it underwent further business integration with
Sendai Coca-Cola Bottling Co. , Ltd.
• Announced MOU with Coca-Cola West on April 26, 2016 to
proceed with discussions/review of business integration
opportunities
• Japan's largest Coca-Cola Bottler, with an extensive
local network, selling the most popular beverage
brands in Japan
Data as of December 2015
About Coca-Cola East Japan
コカ・コーライーストジャパン株式会社
コカ・コーライーストジャパン株式会社
CCEJ Data Landscape
DATA IN SILOS
(Datamart, ERP, DWH, Staging, Mainframe,…)
P2P INTERFACES
(No ESB, Multiple ETL & Interface Servers)
NO GOVERNANCE
(Multiple Data formats for same business
context, No Meta Data Mgt.)
BATCH ORIENTED
(File, Scheduler, …)
コカ・コーライーストジャパン株式会社
Hadoop Journey: Genesis
Yarn
HiveKNIME
WEKA Tez
Analytics System Processing Integration Data source
Data
Restitution
HDFS
MR
Centos
Flat files
July 2015
• Pilot phase
• 5 nodes
• Azure A1  A4
• 100GB
• 70GB of RAM
• Team: 1 person
Ambari
KNIME
コカ・コーライーストジャパン株式会社
Hadoop Journey: Stability
Yarn
Hive
Ranger
KNIME
Tez
Python
Notebook
NiFi
Analytics System Processing Integration Data source
Data
Restitution
Flat files
HDFS
MR
Centos
Active
Directory
November 2015
• Pilot phase
• 6 nodes
• Azure A4  D & DS13
• 1TB of data
• 336GB of RAM
• Team: 2 persons
Zeppelin
Ambari
コカ・コーライーストジャパン株式会社
Hadoop Journey: Production
Yarn
HiveSpark
BW on Hana
Ranger
KNIME Zeppelin
Tez
Python
Notebook
NiFi
Analytics System Processing Integration Data source
Data
Restitution
Flat files
Web
Services
HDFS
MR
Centos
Active
Directory
March 2016
• 8nodes
• Azure D/DS13
• 3TB of Data
• 64 cores
• 448GB Ram
• Team: 2 people
Ambari
コカ・コーライーストジャパン株式会社
• 13 nodes
• 20TB
• 104 cores
• 728GB RAM
• 1000+ Tables
• 3 Production Systems
Hadoop Eco-system at CCEJ
Analytics System Processing Integratio
n
Data
source
Data
Restituti
on
Aggregated data Visualization
2
Data Hub
Past data Forecast data
1
Analytics
3
Master Data
Centralize
Lineage
Governance
Yar
n
Hiv
e
Spark
BW on
HanaHTML
Report
Ranger
Zeppelin
Tez Presto
AirPal
Python
Notebook
MySQL
NiFi
SAP ECC
Boomi
Sparkling
Water
Tensorflow
Flat files
Web
Services
HDFS
MR
Drill
Centos
Active
Directo
ry
Ambar
i
KNIME
コカ・コーライーストジャパン株式会社
May Jun July Aug Sep
t
Oct Nov Dec Jan Feb Mar Apr May Jun July Aug Sep
t
Oct
Timeline
Hadoop / NiFi PlatformPlatform POC
VM Analytics POC Forecast ImplementationVM Analytics POC
2015 2016
POC VM Placement
Flow implementation
BW Report integration
1 SAP integration & MDM3
2 Write-Off report
コカ・コーライーストジャパン株式会社
20TB
コカ・コーライーストジャパン株式会社
HIGH Nbr. OF MACHINES
550,000VM, On/Offline
Nbr. SKUs per VM
25 SKUs, Hot & Cold
Vending Replenishment: The Business Case
EXTERNAL FACTORS
(Weather, City data, Geo-Location, Events )
VENDING ROUTES
(Visit List per truck, Logistics dependence)
ColdHot
How to:
Reduce nbr. of visits
Optimize Truck stock
Avoid out of stocks
コカ・コーライーストジャパン株式会社
Vending replenishment forecast: The Project
The Challenge:
• Deployment in 3 months
• 1 ½ hour to generate the forecast
• +20% of accuracy versus previous version
• 120 steps in the program
Picking list
Visit Plan
Online VM
Offline VM
Every day
Yes NoNoArbitration
Forecast
generation
Hadoop Has Delivered:
• Feed 5GB+ of new data everyday
• Process high volume of data (in-memory)
300GB+
• Integrate from different data sources
• Generate more complicated forecast than
legacy systems
14 Million items
コカ・コーライーストジャパン株式会社
コカ・コーライーストジャパン株式会社
Staging: The Case of “write-off” report
Drill Web ServerAzure
X7systems
Master Data
Generate
SQL query
JSON
HTML Interface
Verify & Check
Combine
Report
Challenges:
• Data set harmonization
(Sales, Billing, Inventory)
• Data volume from source
systems
• Complex Computation logic
• Not clear functional
requirements
Objectives:
• Aggregate a large number of
dataset 40+ flows 4GB of data
everyday
• Single view of data, anywhere, to
Finance, SC & Commercial
• Dynamic transaction vs. static in
excel
• Reduce manual work to zero
Comparison
=
Aggregation
=+
Enrichment

Analytics

Transformation (conversion)

コカ・コーライーストジャパン株式会社
MDM: Centralization and Dispatch
External Systems
4 Replicate data
Event driven
3 Consistency check
Rule engine Replication EngineMDM Repository
2 MDM registration
Lineage
1 MDM Creation
Challenges:
• Rule engine definition and
implementation
• MDM on Hadoop & ESB
integration
• MDM & SAP Synchronization
Objectives:
• Single MDM repository
• Centralized bridge tables &
Mapping table
• Standardization of MDM across
data landscape
• Targeted distribution / replication
of MDM to external systems
Realization:
• MySQL and Hadoop synchronization
300+ tables
• Replication engine with ESB
• MDM-Tool: Pilot with Customer
Master
• Full go-live: April 2017
コカ・コーライーストジャパン株式会社
Use case – SAP Integration / sales interface report
Objectives:
• Leverage the most granular data already
in Hadoop
• Leverage the processing power of
Hadoop
x9flows
x4flows
x7flows
x9flowsMD & Bridge
Vending
Sales Data
Legacy format data
CCEJ format data
Bridge table
& Master Combine
Calculate
x9output tables
Company 1
Company 2
Company 3
Azure
Challenges:
• Many data format requiring
complex data transformation
• Wide variety of data sources &
technologies to transfer data
• Data mapping between systems
Realization:
• Data structure in Hadoop
• Logic for one type of sales
channel implemented
• Full go-live: April 2017
コカ・コーライーストジャパン株式会社
Hadoop: What’s Next
Increase data velocity & Create a true Data Lake
Improve data collection, quality, profiling, meta-data &
propose a catalog of curated data to end users
Toward a Data Driven Decision Process
Develop Support & Operational Excellence
コカ・コーライーストジャパン株式会社
I thank CCEJ management who had the courage to believe in an Agile
approach
Thank to my team member and comrade:
Vinay Mahadev for all the long hours we’ve put together
to make this project a reality
コカ・コーライーストジャパン株式会社
Your turn, let s share ideas & a coke !
Damien Contreras
Email: damien.contreras@ccej.co.jp
LinkedIn: Damien Contreras
Twitter: @dvolute
コカ・コーライーストジャパン株式会社
The inside of Hadoop
コカ・コーライーストジャパン株式会社
BW on Hana
Integration Landscape overview
Hadoop Prod
Nifi
Prod
NiFi
Prod
Oracle
Boomi
Hive
JDBC
Drill
IDOCS
JDBC
Flat files
MySQL
SAP ECC
Other systems
Other
systems
FTP
JDBC
HTTP HTML
interface
Power users
Acquisition Transformation Restitution
dt=20161024
dt=20161025
t_my_table_txt_p
My_file_20161024.csv
My_file_20161025.csv
Myflow-data
t_my_table_txt_p
(External text tables)
t_my_table_txt_p
t_my_bridge_table_txt_p
+Myflow-data
(Database)
t_my_report_orc_p
(ORC tables)
コカ・コーライーストジャパン株式会社
Guidelines around NiFi flows
Prod
Dev
Prod
Dev
Azure
Triggers
System source NiFi
Listener
Extraction
webCall
JDBC
Groups
Encryption
/ Flow
Master Data
Transaction
Data
Processing
Group
コカ・コーライーストジャパン株式会社
Guidelines around NiFi flows
Retry
Processor
Write to error log
Success
OnError
Read from Error log
Re-Process
Update Error log
Send Data
Every 5 mins
Error
Handling / Flow
Master Data
Transaction
Data
コカ・コーライーストジャパン株式会社
NiFi enhancement: example
コカ・コーライーストジャパン株式会社
Technical Architecture
Hadoop Production environment
….
Node 3
Node 4
Node 5 Node 11
AD
NiFi
Node 0
Node 1
Node 2 Node 6
Hadoop Dev environment
Node 3Node 0
Node 1 Node 2
Prod environment
Dev environment
RDBMS
FTP Server
SAP ECC
Azure
NiFi
NiFi
NiFi
…

More Related Content

PDF
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
PPTX
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
PPTX
Rebuilding Web Tracking Infrastructure for Scale
PPTX
Time-oriented event search. A new level of scale
PPTX
Big data at United Airlines
PPTX
Preventative Maintenance of Robots in Automotive Industry
PPTX
How do you decide where your customer was?
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Rebuilding Web Tracking Infrastructure for Scale
Time-oriented event search. A new level of scale
Big data at United Airlines
Preventative Maintenance of Robots in Automotive Industry
How do you decide where your customer was?

What's hot (20)

PDF
Unified, Efficient, and Portable Data Processing with Apache Beam
PPTX
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
PDF
Damien contreras futureofdata-20170428
PPTX
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
PPTX
Insights into Real World Data Management Challenges
PPTX
Format Wars: from VHS and Beta to Avro and Parquet
PPTX
Lessons Learned Migrating from IBM BigInsights to Hortonworks Data Platform
PPTX
What's new in Hadoop Common and HDFS
PPTX
A short introduction to Spark and its benefits
PPTX
Log I am your father
PPTX
The DAP - Where YARN, HBase, Kafka and Spark go to Production
PPTX
A machine learning and data science pipeline for real companies
PDF
Realizing the promise of portable data processing with Apache Beam
PPTX
Active Learning for Fraud Prevention
PPTX
Real time fraud detection at 1+M scale on hadoop stack
PDF
The Next Generation of Data Processing and Open Source
PPTX
Scaling Deep Learning on Hadoop at LinkedIn
PDF
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
PDF
Performance tuning your Hadoop/Spark clusters to use cloud storage
PPTX
Graphene – Microsoft SCOPE on Tez
Unified, Efficient, and Portable Data Processing with Apache Beam
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Damien contreras futureofdata-20170428
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Insights into Real World Data Management Challenges
Format Wars: from VHS and Beta to Avro and Parquet
Lessons Learned Migrating from IBM BigInsights to Hortonworks Data Platform
What's new in Hadoop Common and HDFS
A short introduction to Spark and its benefits
Log I am your father
The DAP - Where YARN, HBase, Kafka and Spark go to Production
A machine learning and data science pipeline for real companies
Realizing the promise of portable data processing with Apache Beam
Active Learning for Fraud Prevention
Real time fraud detection at 1+M scale on hadoop stack
The Next Generation of Data Processing and Open Source
Scaling Deep Learning on Hadoop at LinkedIn
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Performance tuning your Hadoop/Spark clusters to use cloud storage
Graphene – Microsoft SCOPE on Tez
Ad

Viewers also liked (20)

PPTX
The truth about SQL and Data Warehousing on Hadoop
PDF
Comparison of Transactional Libraries for HBase
PDF
Case study of DevOps for Hadoop in Recruit.
PDF
The real world use of Big Data to change business
PDF
#HSTokyo16 Apache Spark Crash Course
PPTX
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
PDF
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
PPTX
SEGA : Growth hacking by Spark ML for Mobile games
PPTX
Why is my Hadoop cluster slow?
PDF
Hadoop Summit Tokyo HDP Sandbox Workshop
PPTX
Hadoop Summit Tokyo Apache NiFi Crash Course
PPTX
Use case and Live demo : Agile data integration from Legacy system to Hadoop ...
PDF
Introduction to Hadoop and Spark (before joining the other talk) and An Overv...
PPTX
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
PPTX
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
PPTX
Streamline Hadoop DevOps with Apache Ambari
PPTX
Apache Hadoop 3.0 What's new in YARN and MapReduce
PPTX
Major advancements in Apache Hive towards full support of SQL compliance
PPTX
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
PPTX
Data infrastructure architecture for medium size organization: tips for colle...
The truth about SQL and Data Warehousing on Hadoop
Comparison of Transactional Libraries for HBase
Case study of DevOps for Hadoop in Recruit.
The real world use of Big Data to change business
#HSTokyo16 Apache Spark Crash Course
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
SEGA : Growth hacking by Spark ML for Mobile games
Why is my Hadoop cluster slow?
Hadoop Summit Tokyo HDP Sandbox Workshop
Hadoop Summit Tokyo Apache NiFi Crash Course
Use case and Live demo : Agile data integration from Legacy system to Hadoop ...
Introduction to Hadoop and Spark (before joining the other talk) and An Overv...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
Streamline Hadoop DevOps with Apache Ambari
Apache Hadoop 3.0 What's new in YARN and MapReduce
Major advancements in Apache Hive towards full support of SQL compliance
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Data infrastructure architecture for medium size organization: tips for colle...
Ad

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
PPT
State of Security: Apache Spark & Apache Zeppelin
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
PDF
Enabling Digital Diagnostics with a Data Science Platform
PDF
Revolutionize Text Mining with Spark and Zeppelin
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
PDF
Hadoop Crash Course
PDF
Data Science Crash Course
PDF
Apache Spark Crash Course
PDF
Dataflow with Apache NiFi
PPTX
Schema Registry - Set you Data Free
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
PPTX
Mool - Automated Log Analysis using Data Science and ML
PPTX
How Hadoop Makes the Natixis Pack More Efficient
PPTX
HBase in Practice
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
PPTX
Backup and Disaster Recovery in Hadoop
Running Apache Spark & Apache Zeppelin in Production
State of Security: Apache Spark & Apache Zeppelin
Unleashing the Power of Apache Atlas with Apache Ranger
Enabling Digital Diagnostics with a Data Science Platform
Revolutionize Text Mining with Spark and Zeppelin
Double Your Hadoop Performance with Hortonworks SmartSense
Hadoop Crash Course
Data Science Crash Course
Apache Spark Crash Course
Dataflow with Apache NiFi
Schema Registry - Set you Data Free
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Mool - Automated Log Analysis using Data Science and ML
How Hadoop Makes the Natixis Pack More Efficient
HBase in Practice
The Challenge of Driving Business Value from the Analytics of Things (AOT)
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
Backup and Disaster Recovery in Hadoop

Recently uploaded (20)

PPTX
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
PDF
Dell Pro 14 Plus: Be better prepared for what’s coming
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
PDF
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
PDF
Sensors and Actuators in IoT Systems using pdf
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
Smarter Business Operations Powered by IoT Remote Monitoring
PPTX
Belt and Road Supply Chain Finance Blockchain Solution
PPTX
ABU RAUP TUGAS TIK kelas 8 hjhgjhgg.pptx
PDF
DevOps & Developer Experience Summer BBQ
PDF
Enable Enterprise-Ready Security on IBM i Systems.pdf
PDF
KodekX | Application Modernization Development
PDF
Reimagining Insurance: Connected Data for Confident Decisions.pdf
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
PPTX
Web Security: Login Bypass, SQLi, CSRF & XSS.pptx
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
PDF
ai-archetype-understanding-the-personality-of-agentic-ai.pdf
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
Dell Pro 14 Plus: Be better prepared for what’s coming
A Day in the Life of Location Data - Turning Where into How.pdf
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Sensors and Actuators in IoT Systems using pdf
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Smarter Business Operations Powered by IoT Remote Monitoring
Belt and Road Supply Chain Finance Blockchain Solution
ABU RAUP TUGAS TIK kelas 8 hjhgjhgg.pptx
DevOps & Developer Experience Summer BBQ
Enable Enterprise-Ready Security on IBM i Systems.pdf
KodekX | Application Modernization Development
Reimagining Insurance: Connected Data for Confident Decisions.pdf
GamePlan Trading System Review: Professional Trader's Honest Take
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
Web Security: Login Bypass, SQLi, CSRF & XSS.pptx
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Automating ArcGIS Content Discovery with FME: A Real World Use Case
ai-archetype-understanding-the-personality-of-agentic-ai.pdf

From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola East Japan

Editor's Notes

  • #4: > Coca-cola East Japan responsible to produce and distribute all the product you love from Coffee to carbonated drinks like Fanta or Sprite  Vending Machines > 5 different bottlers and soon 6 >
  • #6: Data in silos: 3 layers of technologies: mainframe and its eco system / highly customized SAP instances per bottler / single instance of SAP for CCEJ Redundancy of capabilities at each level Duplicated data, no single source of truth Replication going both ways Point to point : 2 endpoints one or several interfaces Multiple ETL tools / ESB tools, servers that stage data No governance: Data structure based on system requirements and vendors convenience: e.g: 2 different vendors working on the same project 2 IDs for the same object Requirements  solution developed by the vendors  Knowledge of the meaning of the data kept on the vendor side Master data managed in each systems  no completeness of the master data / no single source of truth Batch oriented: Data transfer through flat files / fixed width files orchestrated by schedulers (once a day) Little to no event driven flows Multiple intermediate systems that receive the data / load then send to another system
  • #7: Started humbly: Focused on analytics around vending machines & data exploration HDP on Azure with Centos VMs (First in Japan)  HDInsight without Spark difficult to install Data manually uploaded to the cluster through FTP
  • #8: Focused on analytics around vending machines data exploration Teradata : datamart NiFi to integrate between our on prem environment and the cloud environment leveraging the site-to-site connection and using certificates
  • #9: First release and production use of Hadoop Spark program running every night to generate forecast for all vending machines Integration with NiFi on multiple systems to retrive transactional data and master data First attempt to integrate with BW using JDBC (not Vora as BW was not in the same data center and we had no requirements to push back data from BW to Hadoop) NiFi : ExecuteSQL modified to leverage setfetchSize to stream data (10,000 records) Re-encoding to UTF8 in the processor Fixed-width parser Governance around data: Partition data based on delta extraction Naming convention to easily identify data source systems Atlas and AD for access rights NiFi : guidelines around processing groups, e.g: master data / transactional data NiFi: Trigger oriented with webService, error handling: bubbling errors to the top, retry logic to ensure extraction
  • #10: Scheduling & orchestration can be implemented through NiFi Most of our data are linked to a date therefore we partition tables keeping the tabular structure of the tables Hive was the de-facto solution Many data transformation can be implemented directly in HQL Presto to integrate heterogeneous system / good performance on querying data Drill to easily format in JSON and rapidly integrate with Web interfaces NiFi (250 flows) & Boomi (700 flows) integration to have full access to SAP ECC functional modules & IDOCS Program profiling: Fire Defined the use cases: Analytics Data hub and aggregate & processing data Central repository for Master Data 20TB in HDFS Pulling 20GB per day 250 in prod daily extractions in NiFi (ExecuteSQL) other in Dev for CokeOne DM : 0.249885409 HHT : 3.571921535 CMOS : 0.657761362 SC : 4.140049335 DME : 10.447961232 MM : 0.260761793 104 cores 9 datanodes 486GB of RAM HIVE tables: Prod: 402 BI:148 Default: 123 DemandForecast: 42 dm: 7 mdm: 26 sc: 40 vm: 16 Dev: 604 bi: 36 fusion:18 mdm:306 rtr:7 vm:237
  • #11: NiFi in production in October
  • #12: CCEJ Intro Hadoop Not only a datalake, integration platform: business enabler: 1 MDM 2 write off 3 ERP integration Landscape
  • #13: 1: High number of Vending machines Online Vending machines / Offline vending machines 2: Lines up of 50 products in average 25 products in the vending machines Hot and cold product seasonality New product every 3 months 3:External factors Vending inside or outside (in office / stations / sport center) Easily accessible: on top of a montain / on top of a building / in a station underground Wide variety of customers: regulars, events: baseball 4: vending routes Limited number of truck & filler  go replenish the 20 VMs that really needs it what product to put on the truck
  • #14: Processing everyday 14M items (VM x products): 5GB: today’s sales information, stock level, settlement information, … 300GB as we look into the past
  • #16: Producing that report took 6 to 8 hours and had errors
  • #18: Instead of duplicating data (silos) we can reuse the same data to generates those report IDOCS, JDBC, flat files, fixed width files Complex transformation as we are combining legacy data with new number and definition of master data to aggregate data during the rollout period of the new systems
  • #24: Whenever possible we implement extractions triggered by a web service call from the source system We group extractions by flow together under logical “Process Group” We try to have logical subdivisions as well to help readability and maintainability (e.g: extractor for Master data / transactional data or monthly extractors / daily extractors) Site-to-site communication is encrypted using a single root CA certificate shared across all keystore
  • #25: We always implement fail over that retry to perform a processing We always implement error management In each processing group we implement an Output Port called “onError” that are linked with the parent “OnError” Output port this ensure that notification are propagated until reaching the root canvas. The top “OnError” is a Remote Output port available on NiFi Azure side and implement a common handler to send errors Each processor is linked with that “Output port”, at the beginning of the flow branch has their own set of parameters defined: Service Process Priority Those parameters are used by the Error handling process to send notification to Administrators
  • #26: Key Custom Features: Leverage cursors on tables to stream batch of data (10000 rows at a time) Comma separated delimiter CSV file output , by default filename <yyyy-MM-dd_hh:mm:ss.SSS>.csv with LF as line separator. Default UTF-8 file encoding Option to save the extracted file to folder location ( Integrated PutFile Processor functionality ) A Boolean value for user to choose to keep the source file or remove from folder location. UPDATE SQL functionality Boolean option to choose output CSV file as Windows CSV file format with line separator CRLF or Unix CSV file format with line separator as LF. Also a fixed width processor to CSV format
  • #27: Dev environment accessible to anyone (after requesting an account