SlideShare a Scribd company logo
Presented by Kriangkrai Chaonithi @spicydog
14/11/2019 | KMUTT | Applied Computer Science
Introduction to
Data Engineer
and
Data Pipeline
at
Hello! My name is Gap
Education
● BS Applied Computer Science (KMUTT)
● MS Computer Engineering (KMUTT)
Work Experience
● Former Android, iOS & PHP Developer at Longdo.COM
● Former R&D Manager at Insightera
● CTO & co-founder at Credit OK
Fields of Interests
● Software Engineering
● Cloud Architecture & Distributed Computing
● Computer Security
● Machine Learning & NLP https://spicydog.me
Agenda
● What is Big Data?
○ Why data is big?
○ Structured vs Unstructured Data
● Data Engineering
○ Data technology careers
○ What do data engineers do?
○ Skills for data engineers
○ Knowledages & technologies for data engineer
● What is Data Pipeline?
○ ETL - Extract, Transform, Load
○ Batch vs streaming
● Data Pipeline at Credit OK
○ Introduction to GCP technologies
○ Problem and solution on data pipeline
○ Data pipeline architecture in details
● Summary
https://medium.com/@smartrac/the-deep-web-the-dark-web-and-simple-things-2e601ec980ac
What is Big Data?
https://unsplash.com/photos/LqKhnDzSF-8
Why data is big?
● Faster internet better infrastructure
● Business digitization
● Social network
● IoT & embedded systems
● Automated software
● Etc.
https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/KiH2-tdGQRY
Structured vs. Unstructured Data
https://unsplash.com/photos/QBpZGqEMsKg
https://towardsdatascience.com/data-engineering-101-for-dummies-like-me-cf6b9e89c2b4
Data Engineering
https://towardsdatascience.com/data-engineering-101-for-dummies-like-me-cf6b9e89c2b4
Data Technology Careers
https://unsplash.com/photos/QBpZGqEMsKghttps://www.springboard.com/blog/data-science-career-paths-different-roles-industry/
What do Data Engineers do?
https://medium.com/@info_46914/data-engineer-บุคคลที่องคกรไมควรมองขาม-e863b37af79
Skills for Data Engineers
● Data Architecture
● Cloud Computing and Infrastructure
● Programming on Data Manipulation
https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
Skills for Data Engineers
● Data Architecture
○ File Storage Architecture
■ Local Storage
■ Network Attached Storage
■ Object Storage
○ Databases Architecture
○ Data Warehouse
● Cloud Computing and Infrastructure
● Programming on Data Manipulation
https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
Skills for Data Engineers
● Data Architecture
○ File Storage Architecture
○ Databases Architecture
■ SQL (RDBMS)
■ NoSQL
○ Data Warehouse
● Cloud Computing and Infrastructure
● Programming on Data Manipulation
https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
Skills for Data Engineers
● Data Architecture
○ File Storage Architecture
○ Databases Architecture
■ SQL (RDBMS)
■ NoSQL
● Document-oriented Database
● Columnar Database
● Graph Database
● Key-value Database
○ Data Warehouse
● Cloud Computing and Infrastructure
● Programming on Data Manipulation
○ Data Ingestion
○ Data Cleaning
○ Data Manipulation & Data Pipeline
○ Crontab (Task Scheduler)
https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
Skills for Data Engineers
● Data Architecture
○ File Storage Architecture
○ Databases Architecture
■ SQL (RDBMS)
■ NoSQL
● Document-oriented Database
● Columnar Database
● Graph Database
● Key-value Database
○ Data Warehouse
● Cloud Computing and Infrastructure
● Programming on Data Manipulation
○ Data Ingestion
○ Data Cleaning
○ Data Manipulation & Data Pipeline
○ Crontab (Task Scheduler)
https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
Skills for Data Engineers
● Data Architecture
● Cloud Computing and Infrastructure
● Programming on Data Manipulation
Skills for Data Engineers
● Data Architecture
● Cloud Computing and Infrastructure
● Programming on Data Manipulation
○ Data Ingestion
○ Data Cleaning
○ Data Manipulation & Data Pipeline
○ Task Scheduler (Crontab)
https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
What is Data Pipeline?
https://unsplash.com/photos/9AxFJaNySB8
ETL - Extract, Transform, Load
https://unsplash.com/photos/QBpZGqEMsKghttps://www.astera.com/type/blog/etl-process-and-steps/
Batch vs Streaming Processing
https://unsplash.com/photos/QBpZGqEMsKg
Batch Streaming
Multiple record processing Per record processing
Scheduled / manual Real-time
Longer processing time Shorter processing time
Large window data processing Small window data processing
Credit Scoring Platform on Big Data Analytics
creditok.co
Introduction to Data Engineer and Data Pipeline at Credit OK
GCP Storages & Databases
Non-serverless
Serverless
GCP Data Analytics
Pipeline Analytics Visualization
Introduction to Data Engineer and Data Pipeline at Credit OK
Why do we use serverless on big data?
● No server maintenance
● Scalable & high performance
● Easier to optimize
● Only pay per use
Requirements
● Have a HUGE data warehouse for batch processing
● Our customer have on-premise data on >400 sites
● Data ingestor app is needed to install to every site
● Data ingestor app must be able to run on
● Data ingestor app must be super robust and easy to install
● Must work automatically everyday, task scheduler
When >400 sites upload large files
to your server at the same time..
This is kinna DDoS!
We use cloud functions
● Auto scale
● Almost zero maintenance!
● But only accept <10 MB body size
For the larger files,
we use
Google Cloud Run
Google Kubernetes Engine
Google Compute Engine
Introduction to Data Engineer and Data Pipeline at Credit OK
Raw Data
Source
Raw Data
Source
Data Pipeline Architecture
Raw Data
Source
Raw Data
Source
GCF - Load zipped file data via HTTPS protocol
GCF - Save zipped file data to GCS INPUT bucket
Raw Data
Source
Raw Data
Source
GCS - Auto trigger GCF when zipped file is put to the INPUT bucket
GCF - (data cleansing) Process text encoding (tis602, utf8)
GCF - (data cleansing) Check and clean CSV format, make it in the best possible one
GCF - Save output CSV to GCD the OUTPUT bucket
GCF - Log all the results for file ingestion reports
Raw Data
Source
Raw Data
Source
Cron - Auto run every some period to load CSV data from OUTPUT bucket
GBQ - Load data from OUTPUT bucket into RAW STAGING table in string format
Raw Data
Source
Raw Data
Source
GBQ - Cron to run data cleansing SQL from RAW STAGING table to CLEANED STAGING table
GBQ - Cron to run append data with SQL from CLEANED STAGING table to MAIN table
GBQ - Cron to run data processing SQL task from MAIN table to another tables til ready to FINAL tables
Raw Data
Source
Raw Data
Source
Frequently Used Data
Lumen - Cron to dump FINAL tables data to real-time database on frequently used data
Laravel - Load data from real-time database of Lumen via internal REST API
Vue - Use data processed from Laravel
Rarely Used Data
Lumen - Load data from BQ directly
Laravel - Load and process data from Lumen
Vue - Use data processed from Laravel
Summary
● Big data is possible because of technology advancement
● Store and process big data requires special technology and knowledge
● Data engineers are the geeks who work on processing data for the team
● Data pipeline is all about automation about data processing process
● Understanding about data going to process is crucial
● Don’t forget to log data pipeline to monitoring system
● Data engineer is in high demand in Thailand, we have dirty data, we have data scientist, we have
no one to process data => data scientist do everything! THAT’S WRONG!
Data Engineer is in need
Question & Answer
Time is short, let’s utilize the networks.
Feel free to connect with me via spicydog.me
Ad

More Related Content

What's hot (20)

Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lake
punedevscom
 
Data strategy in a Big Data world
Data strategy in a Big Data worldData strategy in a Big Data world
Data strategy in a Big Data world
Craig Milroy
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
Vivek Aanand Ganesan
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks
 
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
DATAVERSITY
 
Data Governance Best Practices, Assessments, and Roadmaps
Data Governance Best Practices, Assessments, and RoadmapsData Governance Best Practices, Assessments, and Roadmaps
Data Governance Best Practices, Assessments, and Roadmaps
DATAVERSITY
 
BI Consultancy - Data, Analytics and Strategy
BI Consultancy - Data, Analytics and StrategyBI Consultancy - Data, Analytics and Strategy
BI Consultancy - Data, Analytics and Strategy
Shivam Dhawan
 
2013 Data Governance Professionals Organization (DGPO) Digital River Webinar
2013 Data Governance Professionals Organization (DGPO) Digital River Webinar2013 Data Governance Professionals Organization (DGPO) Digital River Webinar
2013 Data Governance Professionals Organization (DGPO) Digital River Webinar
Deepak Bhaskar, MBA, BSEE
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?
DATAVERSITY
 
How to Use a Semantic Layer to Deliver Actionable Insights at Scale
How to Use a Semantic Layer to Deliver Actionable Insights at ScaleHow to Use a Semantic Layer to Deliver Actionable Insights at Scale
How to Use a Semantic Layer to Deliver Actionable Insights at Scale
DATAVERSITY
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
James Serra
 
Data Strategy
Data StrategyData Strategy
Data Strategy
sabnees
 
Data Governance and Metadata Management
Data Governance and Metadata ManagementData Governance and Metadata Management
Data Governance and Metadata Management
DATAVERSITY
 
Best Practices in Metadata Management
Best Practices in Metadata ManagementBest Practices in Metadata Management
Best Practices in Metadata Management
DATAVERSITY
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?
DATAVERSITY
 
Data platform architecture
Data platform architectureData platform architecture
Data platform architecture
Sudheer Kondla
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
Databricks
 
Data Governance
Data GovernanceData Governance
Data Governance
Rob Lux
 
RWDG Slides: Build an Effective Data Governance Framework
RWDG Slides: Build an Effective Data Governance FrameworkRWDG Slides: Build an Effective Data Governance Framework
RWDG Slides: Build an Effective Data Governance Framework
DATAVERSITY
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lake
punedevscom
 
Data strategy in a Big Data world
Data strategy in a Big Data worldData strategy in a Big Data world
Data strategy in a Big Data world
Craig Milroy
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks
 
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
DATAVERSITY
 
Data Governance Best Practices, Assessments, and Roadmaps
Data Governance Best Practices, Assessments, and RoadmapsData Governance Best Practices, Assessments, and Roadmaps
Data Governance Best Practices, Assessments, and Roadmaps
DATAVERSITY
 
BI Consultancy - Data, Analytics and Strategy
BI Consultancy - Data, Analytics and StrategyBI Consultancy - Data, Analytics and Strategy
BI Consultancy - Data, Analytics and Strategy
Shivam Dhawan
 
2013 Data Governance Professionals Organization (DGPO) Digital River Webinar
2013 Data Governance Professionals Organization (DGPO) Digital River Webinar2013 Data Governance Professionals Organization (DGPO) Digital River Webinar
2013 Data Governance Professionals Organization (DGPO) Digital River Webinar
Deepak Bhaskar, MBA, BSEE
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?
DATAVERSITY
 
How to Use a Semantic Layer to Deliver Actionable Insights at Scale
How to Use a Semantic Layer to Deliver Actionable Insights at ScaleHow to Use a Semantic Layer to Deliver Actionable Insights at Scale
How to Use a Semantic Layer to Deliver Actionable Insights at Scale
DATAVERSITY
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
James Serra
 
Data Strategy
Data StrategyData Strategy
Data Strategy
sabnees
 
Data Governance and Metadata Management
Data Governance and Metadata ManagementData Governance and Metadata Management
Data Governance and Metadata Management
DATAVERSITY
 
Best Practices in Metadata Management
Best Practices in Metadata ManagementBest Practices in Metadata Management
Best Practices in Metadata Management
DATAVERSITY
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?
DATAVERSITY
 
Data platform architecture
Data platform architectureData platform architecture
Data platform architecture
Sudheer Kondla
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
Databricks
 
Data Governance
Data GovernanceData Governance
Data Governance
Rob Lux
 
RWDG Slides: Build an Effective Data Governance Framework
RWDG Slides: Build an Effective Data Governance FrameworkRWDG Slides: Build an Effective Data Governance Framework
RWDG Slides: Build an Effective Data Governance Framework
DATAVERSITY
 

Similar to Introduction to Data Engineer and Data Pipeline at Credit OK (20)

Serverless Big Data Architecture on Google Cloud Platform at Credit OK
Serverless Big Data Architecture on Google Cloud Platform at Credit OKServerless Big Data Architecture on Google Cloud Platform at Credit OK
Serverless Big Data Architecture on Google Cloud Platform at Credit OK
Kriangkrai Chaonithi
 
Netflix Big Data Paris 2017
Netflix Big Data Paris 2017Netflix Big Data Paris 2017
Netflix Big Data Paris 2017
Jason Flittner
 
Workflow Engines + Luigi
Workflow Engines + LuigiWorkflow Engines + Luigi
Workflow Engines + Luigi
Vladislav Supalov
 
Designing for operability and managability
Designing for operability and managabilityDesigning for operability and managability
Designing for operability and managability
Gaurav Bahrani
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the Cloud
Amihay Zer-Kavod
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data Platform
GoDataDriven
 
Clearing Airflow Obstructions
Clearing Airflow ObstructionsClearing Airflow Obstructions
Clearing Airflow Obstructions
Tatiana Al-Chueyr
 
Data Provision API with BigQuery - Google Cloud Summit Jakarta 18
Data Provision API with BigQuery  - Google Cloud Summit Jakarta 18Data Provision API with BigQuery  - Google Cloud Summit Jakarta 18
Data Provision API with BigQuery - Google Cloud Summit Jakarta 18
Imre Nagi
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
Marcin Bielak
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.govNot Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
Chris Shenton
 
Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)
Sid Anand
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
RTylerCroy
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
Hakan Ilter
 
Ducksboard - A real-time data oriented webservice architecture
Ducksboard - A real-time data oriented webservice architectureDucksboard - A real-time data oriented webservice architecture
Ducksboard - A real-time data oriented webservice architecture
Ducksboard
 
The Big Bad Data
The Big Bad DataThe Big Bad Data
The Big Bad Data
Przemysław Pastuszka
 
Dataflow.pptx
Dataflow.pptxDataflow.pptx
Dataflow.pptx
Sadeka Islam
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
Lars Albertsson
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking VN
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
C4Media
 
Serverless Big Data Architecture on Google Cloud Platform at Credit OK
Serverless Big Data Architecture on Google Cloud Platform at Credit OKServerless Big Data Architecture on Google Cloud Platform at Credit OK
Serverless Big Data Architecture on Google Cloud Platform at Credit OK
Kriangkrai Chaonithi
 
Netflix Big Data Paris 2017
Netflix Big Data Paris 2017Netflix Big Data Paris 2017
Netflix Big Data Paris 2017
Jason Flittner
 
Designing for operability and managability
Designing for operability and managabilityDesigning for operability and managability
Designing for operability and managability
Gaurav Bahrani
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the Cloud
Amihay Zer-Kavod
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data Platform
GoDataDriven
 
Clearing Airflow Obstructions
Clearing Airflow ObstructionsClearing Airflow Obstructions
Clearing Airflow Obstructions
Tatiana Al-Chueyr
 
Data Provision API with BigQuery - Google Cloud Summit Jakarta 18
Data Provision API with BigQuery  - Google Cloud Summit Jakarta 18Data Provision API with BigQuery  - Google Cloud Summit Jakarta 18
Data Provision API with BigQuery - Google Cloud Summit Jakarta 18
Imre Nagi
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
Marcin Bielak
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.govNot Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
Chris Shenton
 
Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)
Sid Anand
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
RTylerCroy
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
Hakan Ilter
 
Ducksboard - A real-time data oriented webservice architecture
Ducksboard - A real-time data oriented webservice architectureDucksboard - A real-time data oriented webservice architecture
Ducksboard - A real-time data oriented webservice architecture
Ducksboard
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
Lars Albertsson
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking VN
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
C4Media
 
Ad

More from Kriangkrai Chaonithi (6)

Introduction to DevOps and the Practical Use Cases at Credit OK
Introduction to DevOps and the Practical Use Cases at Credit OKIntroduction to DevOps and the Practical Use Cases at Credit OK
Introduction to DevOps and the Practical Use Cases at Credit OK
Kriangkrai Chaonithi
 
Introduction to Modern DevOps Technologies
Introduction to  Modern DevOps TechnologiesIntroduction to  Modern DevOps Technologies
Introduction to Modern DevOps Technologies
Kriangkrai Chaonithi
 
Laravel Basic Workshop (Build a Simple Webboard)
Laravel Basic Workshop (Build a Simple Webboard)Laravel Basic Workshop (Build a Simple Webboard)
Laravel Basic Workshop (Build a Simple Webboard)
Kriangkrai Chaonithi
 
Laravel level 2 (Let's Practical)
Laravel level 2 (Let's Practical)Laravel level 2 (Let's Practical)
Laravel level 2 (Let's Practical)
Kriangkrai Chaonithi
 
Laravel Level 1 (The Basic)
Laravel Level 1 (The Basic)Laravel Level 1 (The Basic)
Laravel Level 1 (The Basic)
Kriangkrai Chaonithi
 
Laravel level 0 (introduction)
Laravel level 0 (introduction)Laravel level 0 (introduction)
Laravel level 0 (introduction)
Kriangkrai Chaonithi
 
Introduction to DevOps and the Practical Use Cases at Credit OK
Introduction to DevOps and the Practical Use Cases at Credit OKIntroduction to DevOps and the Practical Use Cases at Credit OK
Introduction to DevOps and the Practical Use Cases at Credit OK
Kriangkrai Chaonithi
 
Introduction to Modern DevOps Technologies
Introduction to  Modern DevOps TechnologiesIntroduction to  Modern DevOps Technologies
Introduction to Modern DevOps Technologies
Kriangkrai Chaonithi
 
Laravel Basic Workshop (Build a Simple Webboard)
Laravel Basic Workshop (Build a Simple Webboard)Laravel Basic Workshop (Build a Simple Webboard)
Laravel Basic Workshop (Build a Simple Webboard)
Kriangkrai Chaonithi
 
Ad

Recently uploaded (20)

Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 

Introduction to Data Engineer and Data Pipeline at Credit OK

  • 1. Presented by Kriangkrai Chaonithi @spicydog 14/11/2019 | KMUTT | Applied Computer Science Introduction to Data Engineer and Data Pipeline at
  • 2. Hello! My name is Gap Education ● BS Applied Computer Science (KMUTT) ● MS Computer Engineering (KMUTT) Work Experience ● Former Android, iOS & PHP Developer at Longdo.COM ● Former R&D Manager at Insightera ● CTO & co-founder at Credit OK Fields of Interests ● Software Engineering ● Cloud Architecture & Distributed Computing ● Computer Security ● Machine Learning & NLP https://spicydog.me
  • 3. Agenda ● What is Big Data? ○ Why data is big? ○ Structured vs Unstructured Data ● Data Engineering ○ Data technology careers ○ What do data engineers do? ○ Skills for data engineers ○ Knowledages & technologies for data engineer ● What is Data Pipeline? ○ ETL - Extract, Transform, Load ○ Batch vs streaming ● Data Pipeline at Credit OK ○ Introduction to GCP technologies ○ Problem and solution on data pipeline ○ Data pipeline architecture in details ● Summary
  • 5. What is Big Data? https://unsplash.com/photos/LqKhnDzSF-8
  • 6. Why data is big? ● Faster internet better infrastructure ● Business digitization ● Social network ● IoT & embedded systems ● Automated software ● Etc. https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/KiH2-tdGQRY
  • 7. Structured vs. Unstructured Data https://unsplash.com/photos/QBpZGqEMsKg https://towardsdatascience.com/data-engineering-101-for-dummies-like-me-cf6b9e89c2b4
  • 10. What do Data Engineers do? https://medium.com/@info_46914/data-engineer-บุคคลที่องคกรไมควรมองขาม-e863b37af79
  • 11. Skills for Data Engineers ● Data Architecture ● Cloud Computing and Infrastructure ● Programming on Data Manipulation https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
  • 12. Skills for Data Engineers ● Data Architecture ○ File Storage Architecture ■ Local Storage ■ Network Attached Storage ■ Object Storage ○ Databases Architecture ○ Data Warehouse ● Cloud Computing and Infrastructure ● Programming on Data Manipulation https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
  • 13. Skills for Data Engineers ● Data Architecture ○ File Storage Architecture ○ Databases Architecture ■ SQL (RDBMS) ■ NoSQL ○ Data Warehouse ● Cloud Computing and Infrastructure ● Programming on Data Manipulation https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
  • 14. Skills for Data Engineers ● Data Architecture ○ File Storage Architecture ○ Databases Architecture ■ SQL (RDBMS) ■ NoSQL ● Document-oriented Database ● Columnar Database ● Graph Database ● Key-value Database ○ Data Warehouse ● Cloud Computing and Infrastructure ● Programming on Data Manipulation ○ Data Ingestion ○ Data Cleaning ○ Data Manipulation & Data Pipeline ○ Crontab (Task Scheduler) https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
  • 15. Skills for Data Engineers ● Data Architecture ○ File Storage Architecture ○ Databases Architecture ■ SQL (RDBMS) ■ NoSQL ● Document-oriented Database ● Columnar Database ● Graph Database ● Key-value Database ○ Data Warehouse ● Cloud Computing and Infrastructure ● Programming on Data Manipulation ○ Data Ingestion ○ Data Cleaning ○ Data Manipulation & Data Pipeline ○ Crontab (Task Scheduler) https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
  • 16. Skills for Data Engineers ● Data Architecture ● Cloud Computing and Infrastructure ● Programming on Data Manipulation
  • 17. Skills for Data Engineers ● Data Architecture ● Cloud Computing and Infrastructure ● Programming on Data Manipulation ○ Data Ingestion ○ Data Cleaning ○ Data Manipulation & Data Pipeline ○ Task Scheduler (Crontab) https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
  • 18. What is Data Pipeline? https://unsplash.com/photos/9AxFJaNySB8
  • 19. ETL - Extract, Transform, Load https://unsplash.com/photos/QBpZGqEMsKghttps://www.astera.com/type/blog/etl-process-and-steps/
  • 20. Batch vs Streaming Processing https://unsplash.com/photos/QBpZGqEMsKg Batch Streaming Multiple record processing Per record processing Scheduled / manual Real-time Longer processing time Shorter processing time Large window data processing Small window data processing
  • 21. Credit Scoring Platform on Big Data Analytics creditok.co
  • 23. GCP Storages & Databases Non-serverless Serverless
  • 24. GCP Data Analytics Pipeline Analytics Visualization
  • 26. Why do we use serverless on big data? ● No server maintenance ● Scalable & high performance ● Easier to optimize ● Only pay per use
  • 27. Requirements ● Have a HUGE data warehouse for batch processing ● Our customer have on-premise data on >400 sites ● Data ingestor app is needed to install to every site ● Data ingestor app must be able to run on ● Data ingestor app must be super robust and easy to install ● Must work automatically everyday, task scheduler
  • 28. When >400 sites upload large files to your server at the same time.. This is kinna DDoS!
  • 29. We use cloud functions ● Auto scale ● Almost zero maintenance! ● But only accept <10 MB body size For the larger files, we use Google Cloud Run Google Kubernetes Engine Google Compute Engine
  • 31. Raw Data Source Raw Data Source Data Pipeline Architecture
  • 32. Raw Data Source Raw Data Source GCF - Load zipped file data via HTTPS protocol GCF - Save zipped file data to GCS INPUT bucket
  • 33. Raw Data Source Raw Data Source GCS - Auto trigger GCF when zipped file is put to the INPUT bucket GCF - (data cleansing) Process text encoding (tis602, utf8) GCF - (data cleansing) Check and clean CSV format, make it in the best possible one GCF - Save output CSV to GCD the OUTPUT bucket GCF - Log all the results for file ingestion reports
  • 34. Raw Data Source Raw Data Source Cron - Auto run every some period to load CSV data from OUTPUT bucket GBQ - Load data from OUTPUT bucket into RAW STAGING table in string format
  • 35. Raw Data Source Raw Data Source GBQ - Cron to run data cleansing SQL from RAW STAGING table to CLEANED STAGING table GBQ - Cron to run append data with SQL from CLEANED STAGING table to MAIN table GBQ - Cron to run data processing SQL task from MAIN table to another tables til ready to FINAL tables
  • 36. Raw Data Source Raw Data Source Frequently Used Data Lumen - Cron to dump FINAL tables data to real-time database on frequently used data Laravel - Load data from real-time database of Lumen via internal REST API Vue - Use data processed from Laravel Rarely Used Data Lumen - Load data from BQ directly Laravel - Load and process data from Lumen Vue - Use data processed from Laravel
  • 37. Summary ● Big data is possible because of technology advancement ● Store and process big data requires special technology and knowledge ● Data engineers are the geeks who work on processing data for the team ● Data pipeline is all about automation about data processing process ● Understanding about data going to process is crucial ● Don’t forget to log data pipeline to monitoring system ● Data engineer is in high demand in Thailand, we have dirty data, we have data scientist, we have no one to process data => data scientist do everything! THAT’S WRONG! Data Engineer is in need
  • 39. Time is short, let’s utilize the networks. Feel free to connect with me via spicydog.me