SlideShare a Scribd company logo
DATA SCIENTIST’S
DAILY LIFE
BRYAN YANG 2015.09
A B O U T M E
• Blog
Bryan的行銷研究及資料分析筆記
http://bryannotes.blogspot.tw
• Group
Spark.TW
A G E N D A
• Data scientist?
• Big data and data scientist
• Data scientist’s Toolbox
• Data is the biggest
Derive
Knowledge
from
Big data
Efficiently
and
Intelligently
F R O M B A C K E N D T O F R O N T E N D
https://doubleclix.wordpress.com/2012/12/15/what-or-who-is-a-data-scientist/
WHAT IS BIG DATA?
WHERE DO THE DATA COME FROM
• Web Log data
• Machine data
• Transactional data
• Social media data
• …
https://plus.google.com/+DigitalStrategyIE
Data Scientist's Daily Life
A WEB SERVICE RECEIVE THE LOG DATA MORE THEN 50G PER DAY
TOTAL SPACE USED LAST THREE MONTH :4500G
TOTAL SPACE USED LAST ONE YEAR :18,000G(17.6T)
• Data Storage/ Backup
• 2T/per HDD
• How to save the data MORE than 2T?
• $0.3 USD/per gigabyte
• Pay 900 USR for KEEPING data but do nothing else.
• Read/Write Speed
• Read: 131.6 MB/s / Write 131.4MB/s
• Spend 393s(6 min) reading just ONE day data.
• Large number of transactions immediately
HADOOP AND
MAPREDUCE
H A D O O P A N D H D F S
http://www.fraudtechwire.com/f-level-guide-to-hadoop-hdfs/
Data Scientist's Daily Life
Data Scientist's Daily Life
– D I S T R I B U T E D A L G O R I T H M
「The world will change,when data is distributed」
M A P R E D U C E
http://www.milanor.net/blog/?p=853
https://chamibuddhika.wordpress.com/2012/02/26/joins-with-map-reduce/
http://blog.agro-know.com/?p=3810
P E R F O R M A N C E O F H A D O O P ?
• Not good, but at least can run.
• Count 86,389,084 rows/per day in 39 sec.
(64G ram, E5 8core * 2/per node * 10)
• How about 39sec * 30days ?
BEFORE ANALYTIC…
E X T R A C T T R A S F O R M L O A D
m/e-university/data-warehouse-etl-toolkit-tutorial-201/surrounding-the-requirements-1
e.net/capgemini/emc-world-2014-breakout-move-to-the-business-data-lake-not-a
/hortonworks/modern-data-architecture-for-a-data-lake-with-informatica-and-h
DATA SCIENTIST’S
TOOL BOX
L I N U X
• The best server choice
• Free and freedom
• Easy to control system
• Easy data processing
• Hadoop is based on Linux
Data Scientist's Daily Life
P O W E R F U L S H E L L S C R I P T
S Q L D A T A B A S E
• MySql, Postgresql, Hive, MongoDB(NOSQL)
• Standard SQL Language
• Store and Manage data
R E L A T I O N A L D A T A B A S E
T A B L E R E L A T I O N
https://cloudant.com/blog/foundbites-data-model-relational-db-vs-nosql-on-cloudant/
http://ghtorrent.org/relational.html
S Q L S Y N T A X
R & P Y T H O N
• Basic Analysis Tools
• Easy to Learn
• Many Packages
Data Scientist's Daily Life
Data Scientist's Daily Life
• Example
• http://bryannotes.blogspot.tw/2014/08/r-ptt-
wantedsocial-network-analysis.html
• http://bryannotes.blogspot.tw/2014/10/python-k-
means-script.html
E T C …
• Excel
• Google Analytics
• Visualisation tools (tableau)
• Web Crawler
• Version control management (git)
• ETL and job scheduling tools (jenkins)
• …
D A T A I S T H E B I G G E S T
– J O S H W I L L S
“Person who is better at statistics than any software
engineer and better at software engineering than
any statistician.”
S T A T I S T I C
W H Y D O W E N E E D M A C H I N E
L E A R N I N G ?
• Clustering
這些人可以分成幾類
• Classification
哪個人屬於哪一類?
• Regression
某個事件發生或某人屬於哪類的機率是多少?
• Dimensionality reduction
降維
C L U S T E R I N G
http://simplystatistics.org/2014/02/18/k-means-clustering-in-a-gif/
source http://humble-developer.blogspot.tw/2011/01/kmeans-clustering-algorithm-part-1.html
C L A S S I F I C A T I O N
http://letsmakerobots.com/content/tcs3200-color-sensor-with-k-nearest-neighbor-classification-algorithm
http://www.astroml.org/sklearn_tutorial/
L O G I S T I C R E G R E S S I O N
https://www.coursera.org/instructor/andrewng
C O S T F U N C T I O N
https://www.coursera.org/instructor/andrewng
O V E R F I T T I N G
https://www.coursera.org/instructor/andrewng
OH MY GOD!
HOW TO CHOOSE IT
M A C H I N E L E A R N I N G A L G O R I T H M N
http://amueller.github.io/sklearn_tutorial/
S T A T I S T I C V S M L
S T A T T I S T I C
M A C H I N E
L E A R N I N G
F O C U S O N
U N D E R S T A N D I N G D A T A
I N T E R M S O F M O D E L S
F O C U S O N T H E A N A L Y S I S
O F L E A R N I N G
A L G O R I T H M S
I N T E R P R E T A B I L I T Y ,
H Y P O T H E S I S T E S T I N G
G R E A T E R F O C U S O N
P R E D I C T I O N
S Y S T E M A T I C S A N D A U T O M A T I O N
http://www.slideshare.net/CetasAnalytics/cetas-e-baymeetupprezofinal
http://mlg.postech.ac.kr/projects/
SHOW YOUR DATA AND
FINDINGS
http://hortonworks.com/wp-content/uploads/2012/06/Tableau2.png
http://www.tableau.com
http://www.tableau.com
http://www.tableau.com
THE REAL CASE
HOW TO START?
• Codecademy http://www.codecademy.com/
Include kinds of programming language, i.e. python,
JavaSrtipt, even shell script and sql
• Coursera http://www.codecademy.com/
Famous self-learning MOOC website.
http://nirvacana.com/thoughts/becoming-a-data-scientist/

More Related Content

PPTX
Building your bi system-HadoopCon Taiwan 2015
Bryan Yang
 
PPTX
The Future of Data Engineering - 2019 InfoQ QConSF
Chris Riccomini
 
PDF
Why you really want SQL in a Real-Time Enterprise Environment
VoltDB
 
PDF
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking VN
 
PDF
The Future of ETL Isn't What It Used to Be
confluent
 
PDF
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Thoughtworks
 
PDF
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking VN
 
PDF
Apache Kafka® Delivers a Single Source of Truth for The New York Times
confluent
 
Building your bi system-HadoopCon Taiwan 2015
Bryan Yang
 
The Future of Data Engineering - 2019 InfoQ QConSF
Chris Riccomini
 
Why you really want SQL in a Real-Time Enterprise Environment
VoltDB
 
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking VN
 
The Future of ETL Isn't What It Used to Be
confluent
 
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Thoughtworks
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking VN
 
Apache Kafka® Delivers a Single Source of Truth for The New York Times
confluent
 

What's hot (20)

PPTX
Managing Descriptive Metadata with Open XML...For Now
Gregory Wiedeman
 
PDF
Kyle Kingsbury Talks about the Jepsen Test: What VoltDB Learned About Data Ac...
VoltDB
 
PDF
GraphQL vs. (the) REST
coliquio GmbH
 
PDF
A Journey from Hexagonal Architecture to Event Sourcing
Carlos Buenosvinos
 
PPTX
Rounds tips & tricks
Aviv Laufer
 
PPT
Lspe
Arpit Tak
 
PDF
Agile Lab_BigData_Meetup_AKKA
Paolo Platter
 
PDF
#GeodeSummit: Easy Ways to Become a Contributor to Apache Geode
PivotalOpenSourceHub
 
PPTX
Rapid Data Analytics @ Netflix
Data Con LA
 
PDF
Sysml 2019 demo_paper
strange_loop
 
PDF
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
Hakka Labs
 
PPTX
Cloud powered search
Codecamp Romania
 
PDF
Unreal Engine 4 Blueprints: Odio e amore Roberto De Ioris - Codemotion Rome 2017
Codemotion
 
PDF
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
confluent
 
PDF
Azkaban
Anatoliy Nikulin
 
PPTX
Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"
Flink Forward
 
PDF
Nordstrom's Event-Sourced Architecture and Kafka-as-a-Service | Adam Weyant a...
HostedbyConfluent
 
PPTX
Presto@Netflix Presto Meetup 03-19-15
Zhenxiao Luo
 
PDF
Building scalable data with kafka and spark
babatunde ekemode
 
PDF
Spark Summit EU talk by Dean Wampler
Spark Summit
 
Managing Descriptive Metadata with Open XML...For Now
Gregory Wiedeman
 
Kyle Kingsbury Talks about the Jepsen Test: What VoltDB Learned About Data Ac...
VoltDB
 
GraphQL vs. (the) REST
coliquio GmbH
 
A Journey from Hexagonal Architecture to Event Sourcing
Carlos Buenosvinos
 
Rounds tips & tricks
Aviv Laufer
 
Lspe
Arpit Tak
 
Agile Lab_BigData_Meetup_AKKA
Paolo Platter
 
#GeodeSummit: Easy Ways to Become a Contributor to Apache Geode
PivotalOpenSourceHub
 
Rapid Data Analytics @ Netflix
Data Con LA
 
Sysml 2019 demo_paper
strange_loop
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
Hakka Labs
 
Cloud powered search
Codecamp Romania
 
Unreal Engine 4 Blueprints: Odio e amore Roberto De Ioris - Codemotion Rome 2017
Codemotion
 
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
confluent
 
Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"
Flink Forward
 
Nordstrom's Event-Sourced Architecture and Kafka-as-a-Service | Adam Weyant a...
HostedbyConfluent
 
Presto@Netflix Presto Meetup 03-19-15
Zhenxiao Luo
 
Building scalable data with kafka and spark
babatunde ekemode
 
Spark Summit EU talk by Dean Wampler
Spark Summit
 
Ad

Viewers also liked (20)

PPTX
Introduce to Spark sql 1.3.0
Bryan Yang
 
PPTX
Tableau and hadoop
Craig Jordan
 
PPT
Xsd examples
Bình Trọng Án
 
PPTX
Build your ETL job using Jenkins - step by step
Bryan Yang
 
PPTX
Spark MLlib - Training Material
Bryan Yang
 
PDF
Artificial Intelligence at Work - Assist Workshop 2016 - Nick Triantos - SRI
Assist
 
PDF
手把手教你 R 語言分析實務
Helen Afterglow
 
PDF
Word2vec (中文)
Yiwei Chen
 
PPTX
Blockchain Smartnetworks
Melanie Swan
 
PDF
DSP 資料科學計畫簡介
codefortomorrow
 
PPT
Business intelligence
Mustafa Ali Hassan, MBA
 
PPTX
Spark Sql for Training
Bryan Yang
 
PPTX
Big data para principiantes
Carlos Toxtli
 
PDF
Estudio "Big Data: retos y oportunidades para el turismo"
Invattur
 
KEY
Introducción al Big Data
David Alayón
 
PPTX
Business intelligence
Randy L. Archambault
 
PPT
Business Intelligence - Intro
David Hubbard
 
PDF
手把手教你 R 語言資料分析實務/張毓倫&陳柏亨
台灣資料科學年會
 
PDF
PyCon Taiwan 2013 Tutorial
Justin Lin
 
PDF
Building A Bi Strategy
larryzagata
 
Introduce to Spark sql 1.3.0
Bryan Yang
 
Tableau and hadoop
Craig Jordan
 
Xsd examples
Bình Trọng Án
 
Build your ETL job using Jenkins - step by step
Bryan Yang
 
Spark MLlib - Training Material
Bryan Yang
 
Artificial Intelligence at Work - Assist Workshop 2016 - Nick Triantos - SRI
Assist
 
手把手教你 R 語言分析實務
Helen Afterglow
 
Word2vec (中文)
Yiwei Chen
 
Blockchain Smartnetworks
Melanie Swan
 
DSP 資料科學計畫簡介
codefortomorrow
 
Business intelligence
Mustafa Ali Hassan, MBA
 
Spark Sql for Training
Bryan Yang
 
Big data para principiantes
Carlos Toxtli
 
Estudio "Big Data: retos y oportunidades para el turismo"
Invattur
 
Introducción al Big Data
David Alayón
 
Business intelligence
Randy L. Archambault
 
Business Intelligence - Intro
David Hubbard
 
手把手教你 R 語言資料分析實務/張毓倫&陳柏亨
台灣資料科學年會
 
PyCon Taiwan 2013 Tutorial
Justin Lin
 
Building A Bi Strategy
larryzagata
 
Ad

Similar to Data Scientist's Daily Life (20)

PDF
Data Modelling at Scale
David Simons
 
PDF
New Era of Software with modern Application Security v1.0
Dinis Cruz
 
PPTX
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]
New Relic
 
PDF
From Content Strategy to Drupal Site Building - Connecting the dots
Ronald Ashri
 
PDF
From Content Strategy to Drupal Site Building - Connecting the Dots
Ronald Ashri
 
PDF
Decoupled APIs through Microservices
David Simons
 
PDF
So You Want to be an OpenStack Contributor
Anne Gentle
 
PDF
Development and Deployment: The Human Factor
Boris Adryan
 
PDF
Choosing the Right Database
David Simons
 
PDF
The Changing Face of Government IT
Dustin Haisler
 
PDF
Data Interoperability for Learning Analytics and Lifelong Learning
Megan Bowe
 
PDF
Data Interoperability for Learning Analytics and Lifelong Learning
Megan Bowe
 
PPTX
Neotys PAC - Todd De Capua
Neotys_Partner
 
PDF
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
Denodo
 
PDF
How to Transform Into a Data-Driven Organization
WarrenCruz3
 
PDF
Choosing the right database
David Simons
 
PPT
Practical Routers and Switches (Including TCP/IP and Ethernet) for Engineers ...
Living Online
 
PPTX
Creating Modern Metadata Systems [FutureStack16 NYC]
New Relic
 
PPTX
Elasticsearch Atlanta Meetup 3/15/16
Roy Russo
 
PDF
Vikram emerging technologies
Vikram Sankhala IIT, IIM, Ex IRS, FRM, Fin.Engr
 
Data Modelling at Scale
David Simons
 
New Era of Software with modern Application Security v1.0
Dinis Cruz
 
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]
New Relic
 
From Content Strategy to Drupal Site Building - Connecting the dots
Ronald Ashri
 
From Content Strategy to Drupal Site Building - Connecting the Dots
Ronald Ashri
 
Decoupled APIs through Microservices
David Simons
 
So You Want to be an OpenStack Contributor
Anne Gentle
 
Development and Deployment: The Human Factor
Boris Adryan
 
Choosing the Right Database
David Simons
 
The Changing Face of Government IT
Dustin Haisler
 
Data Interoperability for Learning Analytics and Lifelong Learning
Megan Bowe
 
Data Interoperability for Learning Analytics and Lifelong Learning
Megan Bowe
 
Neotys PAC - Todd De Capua
Neotys_Partner
 
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
Denodo
 
How to Transform Into a Data-Driven Organization
WarrenCruz3
 
Choosing the right database
David Simons
 
Practical Routers and Switches (Including TCP/IP and Ethernet) for Engineers ...
Living Online
 
Creating Modern Metadata Systems [FutureStack16 NYC]
New Relic
 
Elasticsearch Atlanta Meetup 3/15/16
Roy Russo
 

More from Bryan Yang (10)

PDF
敏捷開發心法
Bryan Yang
 
PDF
Data pipeline essential
Bryan Yang
 
PPTX
Docker 101
Bryan Yang
 
PDF
資料分析的快樂就是如此樸實無華且枯燥
Bryan Yang
 
PDF
Data pipeline 101
Bryan Yang
 
PPTX
Building a data driven business
Bryan Yang
 
PPTX
產業數據力-以傳統零售業為例
Bryan Yang
 
PPTX
Serverless ETL
Bryan Yang
 
PPTX
敏捷開發心法
Bryan Yang
 
PPTX
Introduction to docker
Bryan Yang
 
敏捷開發心法
Bryan Yang
 
Data pipeline essential
Bryan Yang
 
Docker 101
Bryan Yang
 
資料分析的快樂就是如此樸實無華且枯燥
Bryan Yang
 
Data pipeline 101
Bryan Yang
 
Building a data driven business
Bryan Yang
 
產業數據力-以傳統零售業為例
Bryan Yang
 
Serverless ETL
Bryan Yang
 
敏捷開發心法
Bryan Yang
 
Introduction to docker
Bryan Yang
 

Recently uploaded (20)

PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PPTX
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
PPTX
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
PPTX
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
PPTX
Presentation on animal welfare a good topic
kidscream385
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PPTX
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
PDF
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PDF
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PPTX
INFO8116 -Big data architecture and analytics
guddipatel10
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
Presentation on animal welfare a good topic
kidscream385
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
INFO8116 -Big data architecture and analytics
guddipatel10
 

Data Scientist's Daily Life