SlideShare a Scribd company logo
Big Data Analytics: Finding diamonds in the rough with Azure
Big Data analytics:
Finding diamonds in the rough with Azure
Christos Charmatzis
@T.A. Geoforce
Athens Global Azure Bootcamp
2019
DATA TEAM
Agenda
• Introduction
• When we have a Big Data problem
• Finding the best solution for our Big Data
• Working inside the Data Team
• Extract the true value of our data
Introduction
What is Big Data?
"Big Data" is a field that treats ways to analyze, systematically extract
information from, or otherwise deal with data sets that are too large or
complex to be dealt with by traditional data-processing application
software.
Source: Wikipedia
The concept gained momentum in the early 2000s when industry analyst
Doug Laney articulated the now-mainstream definition of big data as the
three Vs:
1. Volume
2. Velocity
3. Variety
And because everything is relative
What is today’s small (1TB), was yesterday’s big.....
And what is today’s big(100TB) is tomorrow’s small….
(we use 100TB, because is the dataset size of Sort Benchmark competition
http://sortbenchmark.org/ )
When we have a Big Data problem?
Example 2
• 3TB Datasets
• Machine (M32ls
Instance , 32
VCPU, 256 GiB
RAM, 1,024 GiB
Storage,
~€2,122.3736/mont
h )
• Enterprise
Database (e.g. SQL
Server)
• Aggregation,
Statistics,
Summaries
Example 3
• 10TB Dataset
• Aggregation,
Statistics,
Summaries,
Transformations etc
Example 4
• 450GB Dataset
• Machine (M32ls
Instance , 32
VCPU, 256 GiB
RAM, 1,024 GiB
Storage,
~€2,122.3736/mont
h )
• Enterprise
Database (e.g. SQL
Server)
• Transformations
Example 1
• 450GB Datasets
• Machine (M32ls
Instance , 32
VCPU, 256 GiB
RAM, 1,024 GiB
Storage,
~€2,122.3736/mont
h )
• Enterprise
Database (e.g. SQL
Server)
• Aggregation,
Statistics,
Summaries
STAY WHERE YOU
ARE
UPGRADE STORAGE GO TO THE CLOUD GO TO THE CLOUD
Big Data Infrastructure comparison
Spark Cluster
• Initial release >
2014
• Supported
programming
languages: Java,
Scala, Python, R,
Julia
• Performance keys:
Partitioning
DB in premise
• Initial release <
2014
• Supported
programming
languages: almost
every programming
language
• Performance keys:
Indexes
!==
Big Data Performance
In Spark always:
• use “df.explain(true)”
• Or check the DAG!
Every time a block is
changing the data are
repartitioning!!!
Finding the best solution for our Big Data problem
• Hadoop on a cluster of Azure Virtual Machines
• Azure HDInsights (Clusters as-a-service)
• Azure Databricks
• Azure Data Factory (New & Improved!!!!)
• Azure Data Lake Analytics (Queries as-a-service)
Big Data in Azure: Storage
Azure Blob Storage
• Object Storage
• General purpose
(files &
workloads)
Azure Data Lake
• Hierarchical file
system
• Optimized for
analytics
workloads
Azure Data Lake
(Gen.2)
• Multi-modal
storage
• Optimized for
analytics
workloads
Big Data in Azure: Storage
Azure Blob Storage
wasb[s]://containername@accountname.blob.core.windo
ws.net/file.csv
Azure Data Lake
abfs[s]://filesystemname@accountname.dfs.core.window
s.net/file.csv
Azure Data Lake (Gen.2)
• Endpoint: object store access Blob API using wasb[s]://
• Endpoint: file system access ADLS Gen 2 API using
abfs[s]://
Azure Data Factory
A managed could service for building & operating data
pipelines.
Azure Data Factory
Source: https://channel9.msdn.com
Azure Data Factory (ADF)
DEMO
Why do we need tools like ADF?
85% of the working time is on data wrangling!!!
ADF Pricing
https://azure.microsoft.com/en-us/pricing/details/data-factory/
Tip: Look out!!! The data reads and writes are the most
expensive in Big Data Analytics
Working inside the Data Team
Rembrandt (1662). The Sampling Officials (Dutch: De
Staalmeesters))
We must
compare
the results
with last
year's...
I will run
the whole
thing,
again
Don't we
have
somewhere
that report?
Yes, there're in
a folder, inside a
VM, inside
John's PC...
No, we have
uploaded them in
blob storage... I
don't remember
Somewhere, inside
a meeting room….
Metadata area
With Data comes problems….
With Big Data comes Bigger
Problems!!!
Like….
• Many datasets
• Frequently updates
• Many fields
• Many users
Where do I keep the metadata?
• Azure Data Catalogue
• DataBricks Delta Lake
• Create your own meta-portal
Be aware, always use metadata
standards (ISO, Dublin Core, MPEG-7
…)
More info:
https://en.wikipedia.org/wiki/Metadata_standard#Available_metadata
_standards
Azure Data Factory Metadata
This activity allows for collecting metadata about Azure Data
Factory.
Get Metadata activity supports:
• itemName
• itemType
• Size
• Created
• lastModified
• childItems
• contentMD5
• Structure
• columnCount
• exists
Extract real value from the data
Visualize data | Write good experiments | Share
results
And we just scratched the surface of that…
Conclusions
• For ETL projects from in premise to cloud use Azure Data
Factory
• The size isn’t always the problem in your case
• Velocity isn’t only on the code side, you HAVE to know your
data
• Create METADATA
Thank U
Qs+As
Please evaluate:
http://bit.ly/AAB2019Evaluation

More Related Content

What's hot (20)

PDF
Bi on Big Data - Strata 2016 in London
Dremio Corporation
 
PPTX
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
PyData
 
PPTX
Intro to bigdata on gcp (1)
SahilRaina21
 
PDF
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Chris Schalk
 
PPTX
How to Design a Modern Data Warehouse in BigQuery
Dan Sullivan, Ph.D.
 
PPTX
Options for Data Prep - A Survey of the Current Market
Dremio Corporation
 
PPTX
Real-Time Analytics in Transactional Applications by Brian Bulkowski
Data Con LA
 
PDF
Optiq: A dynamic data management framework
Julian Hyde
 
PPTX
Introduction to Dremio
Dremio Corporation
 
PPTX
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
Sascha Dittmann
 
PDF
Using Alluxio as a Fault Tolerant Pluggable Optimization Component to Compute...
Alluxio, Inc.
 
PPTX
Google Cloud Spanner Preview
DoiT International
 
PDF
TDC2016SP - Trilha BigData
tdc-globalcode
 
PPTX
NoSQL on MySQL - MySQL Document Store by Vadim Tkachenko
Data Con LA
 
PPTX
Database Choices
Lynn Langit
 
PDF
A Gentle Introduction to GPU Computing by Armen Donigian
Data Con LA
 
PDF
Building an open data platform with apache iceberg
Alluxio, Inc.
 
PDF
Debunking "Purpose-Built Data Systems:": Enter the Universal Database
Stavros Papadopoulos
 
PDF
Exploring BigData with Google BigQuery
Dharmesh Vaya
 
PDF
Apache Druid Vision and Roadmap
Imply
 
Bi on Big Data - Strata 2016 in London
Dremio Corporation
 
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
PyData
 
Intro to bigdata on gcp (1)
SahilRaina21
 
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Chris Schalk
 
How to Design a Modern Data Warehouse in BigQuery
Dan Sullivan, Ph.D.
 
Options for Data Prep - A Survey of the Current Market
Dremio Corporation
 
Real-Time Analytics in Transactional Applications by Brian Bulkowski
Data Con LA
 
Optiq: A dynamic data management framework
Julian Hyde
 
Introduction to Dremio
Dremio Corporation
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
Sascha Dittmann
 
Using Alluxio as a Fault Tolerant Pluggable Optimization Component to Compute...
Alluxio, Inc.
 
Google Cloud Spanner Preview
DoiT International
 
TDC2016SP - Trilha BigData
tdc-globalcode
 
NoSQL on MySQL - MySQL Document Store by Vadim Tkachenko
Data Con LA
 
Database Choices
Lynn Langit
 
A Gentle Introduction to GPU Computing by Armen Donigian
Data Con LA
 
Building an open data platform with apache iceberg
Alluxio, Inc.
 
Debunking "Purpose-Built Data Systems:": Enter the Universal Database
Stavros Papadopoulos
 
Exploring BigData with Google BigQuery
Dharmesh Vaya
 
Apache Druid Vision and Roadmap
Imply
 

Similar to Big Data Analytics: Finding diamonds in the rough with Azure (20)

PPTX
Big Data
Mahesh Bmn
 
PPTX
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
PDF
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seeling Cheung
 
ODP
Big data nyu
Edward Capriolo
 
PPT
Data analytics & its Trends
Dr.K.Sreenivas Rao
 
PPTX
Big Data & Hadoop Introduction
Jayant Mukherjee
 
PPTX
bigdata.pptx
VIJAYAPRABAP
 
PPTX
Introduction to Azure DocumentDB
Denny Lee
 
PDF
bigdata.pdf
AnjaliKumari301316
 
PPTX
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku
 
PPTX
Big Data Analytics Strategy and Roadmap
Srinath Perera
 
PPTX
The Right Data for the Right Job
Emily Curtin
 
PDF
A data analyst view of Bigdata
Venkata Reddy Konasani
 
PPTX
Big Data - Need of Converged Data Platform
GeekNightHyderabad
 
ODP
Database Shootout: What's best for BI?
Jos van Dongen
 
PPTX
Essential Data Engineering for Data Scientist
SoftServe
 
PPTX
Big data4businessusers
Bob Hardaway
 
PPT
Hadoop HDFS.ppt
6535ANURAGANURAG
 
PDF
Processing Drone data @Scale
Dr Hajji Hicham
 
PPTX
Webinar: Sizing Up Object Storage for the Enterprise
Storage Switzerland
 
Big Data
Mahesh Bmn
 
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seeling Cheung
 
Big data nyu
Edward Capriolo
 
Data analytics & its Trends
Dr.K.Sreenivas Rao
 
Big Data & Hadoop Introduction
Jayant Mukherjee
 
bigdata.pptx
VIJAYAPRABAP
 
Introduction to Azure DocumentDB
Denny Lee
 
bigdata.pdf
AnjaliKumari301316
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku
 
Big Data Analytics Strategy and Roadmap
Srinath Perera
 
The Right Data for the Right Job
Emily Curtin
 
A data analyst view of Bigdata
Venkata Reddy Konasani
 
Big Data - Need of Converged Data Platform
GeekNightHyderabad
 
Database Shootout: What's best for BI?
Jos van Dongen
 
Essential Data Engineering for Data Scientist
SoftServe
 
Big data4businessusers
Bob Hardaway
 
Hadoop HDFS.ppt
6535ANURAGANURAG
 
Processing Drone data @Scale
Dr Hajji Hicham
 
Webinar: Sizing Up Object Storage for the Enterprise
Storage Switzerland
 
Ad

Recently uploaded (20)

PDF
CT-2-Ancient ancient accept-Criticism.pdf
DepartmentofEnglishC1
 
PDF
Loading Data into Snowflake (Bulk & Stream)
Accentfuture
 
PPTX
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
PPTX
Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...
Debolina Ghosh
 
PPTX
MENU-DRIVEN PROGRAM ON ARUNACHAL PRADESH.pptx
manvi200807
 
PDF
2025 Global Data Summit - FOM with AI.pdf
Marco Wobben
 
PDF
Business Automation Solution with Excel 1.1.pdf
Vivek Kedia
 
PPTX
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
PPTX
Data anlytics Hospitals Research India.pptx
SayantanChakravorty2
 
PPTX
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PPTX
microservices-with-container-apps-dapr.pptx
vjay22
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PPTX
美国史蒂文斯理工学院毕业证书{SIT学费发票SIT录取通知书}哪里购买
Taqyea
 
PPTX
在线购买英国本科毕业证苏格兰皇家音乐学院水印成绩单RSAMD学费发票
Taqyea
 
PPTX
Generative AI Boost Data Governance and Quality- Tejasvi Addagada
Tejasvi Addagada
 
PDF
SQL for Accountants and Finance Managers
ysmaelreyes
 
PDF
SaleServicereport and SaleServicereport
2251330007
 
PDF
GOOGLE ADS (1).pdf THE ULTIMATE GUIDE TO
kushalkeshwanisou
 
CT-2-Ancient ancient accept-Criticism.pdf
DepartmentofEnglishC1
 
Loading Data into Snowflake (Bulk & Stream)
Accentfuture
 
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...
Debolina Ghosh
 
MENU-DRIVEN PROGRAM ON ARUNACHAL PRADESH.pptx
manvi200807
 
2025 Global Data Summit - FOM with AI.pdf
Marco Wobben
 
Business Automation Solution with Excel 1.1.pdf
Vivek Kedia
 
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
Data anlytics Hospitals Research India.pptx
SayantanChakravorty2
 
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
BinarySearchTree in datastructures in detail
kichokuttu
 
microservices-with-container-apps-dapr.pptx
vjay22
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
美国史蒂文斯理工学院毕业证书{SIT学费发票SIT录取通知书}哪里购买
Taqyea
 
在线购买英国本科毕业证苏格兰皇家音乐学院水印成绩单RSAMD学费发票
Taqyea
 
Generative AI Boost Data Governance and Quality- Tejasvi Addagada
Tejasvi Addagada
 
SQL for Accountants and Finance Managers
ysmaelreyes
 
SaleServicereport and SaleServicereport
2251330007
 
GOOGLE ADS (1).pdf THE ULTIMATE GUIDE TO
kushalkeshwanisou
 
Ad

Big Data Analytics: Finding diamonds in the rough with Azure

  • 2. Big Data analytics: Finding diamonds in the rough with Azure Christos Charmatzis @T.A. Geoforce Athens Global Azure Bootcamp 2019
  • 4. Agenda • Introduction • When we have a Big Data problem • Finding the best solution for our Big Data • Working inside the Data Team • Extract the true value of our data
  • 5. Introduction What is Big Data? "Big Data" is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software. Source: Wikipedia The concept gained momentum in the early 2000s when industry analyst Doug Laney articulated the now-mainstream definition of big data as the three Vs: 1. Volume 2. Velocity 3. Variety
  • 6. And because everything is relative What is today’s small (1TB), was yesterday’s big..... And what is today’s big(100TB) is tomorrow’s small…. (we use 100TB, because is the dataset size of Sort Benchmark competition http://sortbenchmark.org/ )
  • 7. When we have a Big Data problem? Example 2 • 3TB Datasets • Machine (M32ls Instance , 32 VCPU, 256 GiB RAM, 1,024 GiB Storage, ~€2,122.3736/mont h ) • Enterprise Database (e.g. SQL Server) • Aggregation, Statistics, Summaries Example 3 • 10TB Dataset • Aggregation, Statistics, Summaries, Transformations etc Example 4 • 450GB Dataset • Machine (M32ls Instance , 32 VCPU, 256 GiB RAM, 1,024 GiB Storage, ~€2,122.3736/mont h ) • Enterprise Database (e.g. SQL Server) • Transformations Example 1 • 450GB Datasets • Machine (M32ls Instance , 32 VCPU, 256 GiB RAM, 1,024 GiB Storage, ~€2,122.3736/mont h ) • Enterprise Database (e.g. SQL Server) • Aggregation, Statistics, Summaries STAY WHERE YOU ARE UPGRADE STORAGE GO TO THE CLOUD GO TO THE CLOUD
  • 8. Big Data Infrastructure comparison Spark Cluster • Initial release > 2014 • Supported programming languages: Java, Scala, Python, R, Julia • Performance keys: Partitioning DB in premise • Initial release < 2014 • Supported programming languages: almost every programming language • Performance keys: Indexes !==
  • 9. Big Data Performance In Spark always: • use “df.explain(true)” • Or check the DAG! Every time a block is changing the data are repartitioning!!!
  • 10. Finding the best solution for our Big Data problem • Hadoop on a cluster of Azure Virtual Machines • Azure HDInsights (Clusters as-a-service) • Azure Databricks • Azure Data Factory (New & Improved!!!!) • Azure Data Lake Analytics (Queries as-a-service)
  • 11. Big Data in Azure: Storage Azure Blob Storage • Object Storage • General purpose (files & workloads) Azure Data Lake • Hierarchical file system • Optimized for analytics workloads Azure Data Lake (Gen.2) • Multi-modal storage • Optimized for analytics workloads
  • 12. Big Data in Azure: Storage Azure Blob Storage wasb[s]://[email protected] ws.net/file.csv Azure Data Lake abfs[s]://[email protected] s.net/file.csv Azure Data Lake (Gen.2) • Endpoint: object store access Blob API using wasb[s]:// • Endpoint: file system access ADLS Gen 2 API using abfs[s]://
  • 13. Azure Data Factory A managed could service for building & operating data pipelines.
  • 14. Azure Data Factory Source: https://channel9.msdn.com
  • 15. Azure Data Factory (ADF) DEMO
  • 16. Why do we need tools like ADF? 85% of the working time is on data wrangling!!!
  • 17. ADF Pricing https://azure.microsoft.com/en-us/pricing/details/data-factory/ Tip: Look out!!! The data reads and writes are the most expensive in Big Data Analytics
  • 18. Working inside the Data Team Rembrandt (1662). The Sampling Officials (Dutch: De Staalmeesters)) We must compare the results with last year's... I will run the whole thing, again Don't we have somewhere that report? Yes, there're in a folder, inside a VM, inside John's PC... No, we have uploaded them in blob storage... I don't remember Somewhere, inside a meeting room….
  • 19. Metadata area With Data comes problems…. With Big Data comes Bigger Problems!!! Like…. • Many datasets • Frequently updates • Many fields • Many users
  • 20. Where do I keep the metadata? • Azure Data Catalogue • DataBricks Delta Lake • Create your own meta-portal Be aware, always use metadata standards (ISO, Dublin Core, MPEG-7 …) More info: https://en.wikipedia.org/wiki/Metadata_standard#Available_metadata _standards
  • 21. Azure Data Factory Metadata This activity allows for collecting metadata about Azure Data Factory. Get Metadata activity supports: • itemName • itemType • Size • Created • lastModified • childItems • contentMD5 • Structure • columnCount • exists
  • 22. Extract real value from the data Visualize data | Write good experiments | Share results
  • 23. And we just scratched the surface of that…
  • 24. Conclusions • For ETL projects from in premise to cloud use Azure Data Factory • The size isn’t always the problem in your case • Velocity isn’t only on the code side, you HAVE to know your data • Create METADATA

Editor's Notes