Introduction to Data Engineer and Data Pipeline at Credit OK

Presented by Kriangkrai Chaonithi @spicydog
14/11/2019 | KMUTT | Applied Computer Science
Introduction to
Data Engineer
and
Data Pipeline
at

Hello! My name is Gap
Education
● BS Applied Computer Science (KMUTT)
● MS Computer Engineering (KMUTT)
Work Experience
● Former Android, iOS & PHP Developer at Longdo.COM
● Former R&D Manager at Insightera
● CTO & co-founder at Credit OK
Fields of Interests
● Software Engineering
● Cloud Architecture & Distributed Computing
● Computer Security
● Machine Learning & NLP https://spicydog.me

Agenda
● What is Big Data?
○ Why data is big?
○ Structured vs Unstructured Data
● Data Engineering
○ Data technology careers
○ What do data engineers do?
○ Skills for data engineers
○ Knowledages & technologies for data engineer
● What is Data Pipeline?
○ ETL - Extract, Transform, Load
○ Batch vs streaming
● Data Pipeline at Credit OK
○ Introduction to GCP technologies
○ Problem and solution on data pipeline
○ Data pipeline architecture in details
● Summary

https://medium.com/@smartrac/the-deep-web-the-dark-web-and-simple-things-2e601ec980ac

What is Big Data?
https://unsplash.com/photos/LqKhnDzSF-8

Why data is big?
● Faster internet better infrastructure
● Business digitization
● Social network
● IoT & embedded systems
● Automated software
● Etc.
https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/KiH2-tdGQRY

Structured vs. Unstructured Data
https://unsplash.com/photos/QBpZGqEMsKg
https://towardsdatascience.com/data-engineering-101-for-dummies-like-me-cf6b9e89c2b4

Data Engineering
https://towardsdatascience.com/data-engineering-101-for-dummies-like-me-cf6b9e89c2b4

Data Technology Careers
https://unsplash.com/photos/QBpZGqEMsKghttps://www.springboard.com/blog/data-science-career-paths-different-roles-industry/

What do Data Engineers do?
https://medium.com/@info_46914/data-engineer-บุคคลที่องคกรไมควรมองขาม-e863b37af79

Skills for Data Engineers
● Data Architecture
● Cloud Computing and Infrastructure
● Programming on Data Manipulation
https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI

○ File Storage Architecture
■ Local Storage
■ Network Attached Storage
■ Object Storage
○ Databases Architecture
○ Data Warehouse

■ SQL (RDBMS)
■ NoSQL
○ Data Warehouse

■ SQL (RDBMS)
■ NoSQL
● Document-oriented Database
● Columnar Database
● Graph Database
● Key-value Database
○ Data Warehouse
○ Data Ingestion
○ Data Cleaning
○ Data Manipulation & Data Pipeline
○ Crontab (Task Scheduler)

○ Data Ingestion
○ Data Cleaning
○ Data Manipulation & Data Pipeline
○ Task Scheduler (Crontab)

What is Data Pipeline?
https://unsplash.com/photos/9AxFJaNySB8

ETL - Extract, Transform, Load
https://unsplash.com/photos/QBpZGqEMsKghttps://www.astera.com/type/blog/etl-process-and-steps/

Batch vs Streaming Processing
https://unsplash.com/photos/QBpZGqEMsKg
Batch Streaming
Multiple record processing Per record processing
Scheduled / manual Real-time
Longer processing time Shorter processing time
Large window data processing Small window data processing

Credit Scoring Platform on Big Data Analytics
creditok.co

Introduction to Data Engineer and Data Pipeline at Credit OK

GCP Storages & Databases
Non-serverless
Serverless

GCP Data Analytics
Pipeline Analytics Visualization

Why do we use serverless on big data?
● No server maintenance
● Scalable & high performance
● Easier to optimize
● Only pay per use

Requirements
● Have a HUGE data warehouse for batch processing
● Our customer have on-premise data on >400 sites
● Data ingestor app is needed to install to every site
● Data ingestor app must be able to run on
● Data ingestor app must be super robust and easy to install
● Must work automatically everyday, task scheduler

When >400 sites upload large ﬁles
to your server at the same time..
This is kinna DDoS!

We use cloud functions
● Auto scale
● Almost zero maintenance!
● But only accept <10 MB body size
For the larger ﬁles,
we use
Google Cloud Run
Google Kubernetes Engine
Google Compute Engine

Raw Data
Source
Raw Data
Source
Data Pipeline Architecture

Raw Data
Source
Raw Data
Source
GCF - Load zipped file data via HTTPS protocol
GCF - Save zipped file data to GCS INPUT bucket

Raw Data
Source
Raw Data
Source
GCS - Auto trigger GCF when zipped file is put to the INPUT bucket
GCF - (data cleansing) Process text encoding (tis602, utf8)
GCF - (data cleansing) Check and clean CSV format, make it in the best possible one
GCF - Save output CSV to GCD the OUTPUT bucket
GCF - Log all the results for file ingestion reports

Raw Data
Source
Raw Data
Source
Cron - Auto run every some period to load CSV data from OUTPUT bucket
GBQ - Load data from OUTPUT bucket into RAW STAGING table in string format

Raw Data
Source
Raw Data
Source
GBQ - Cron to run data cleansing SQL from RAW STAGING table to CLEANED STAGING table
GBQ - Cron to run append data with SQL from CLEANED STAGING table to MAIN table
GBQ - Cron to run data processing SQL task from MAIN table to another tables til ready to FINAL tables

Raw Data
Source
Raw Data
Source
Frequently Used Data
Lumen - Cron to dump FINAL tables data to real-time database on frequently used data
Laravel - Load data from real-time database of Lumen via internal REST API
Vue - Use data processed from Laravel
Rarely Used Data
Lumen - Load data from BQ directly
Laravel - Load and process data from Lumen
Vue - Use data processed from Laravel

Summary
● Big data is possible because of technology advancement
● Store and process big data requires special technology and knowledge
● Data engineers are the geeks who work on processing data for the team
● Data pipeline is all about automation about data processing process
● Understanding about data going to process is crucial
● Don’t forget to log data pipeline to monitoring system
● Data engineer is in high demand in Thailand, we have dirty data, we have data scientist, we have
no one to process data => data scientist do everything! THAT’S WRONG!
Data Engineer is in need

Time is short, let’s utilize the networks.
Feel free to connect with me via spicydog.me

Introduction to Data Engineer and Data Pipeline at Credit OK

Recommended

More Related Content

What's hot (20)

Similar to Introduction to Data Engineer and Data Pipeline at Credit OK (20)

More from Kriangkrai Chaonithi (6)

Recently uploaded (20)

Introduction to Data Engineer and Data Pipeline at Credit OK