SlideShare a Scribd company logo
Analyzing Semi-Structured
Data at Volume in the Cloud
Kevin Bair
Solution Architect
Kevin.Bair@snowflake.net
Topics this presentation will cover
1.  Structured vs. Semi-Structured
2.  ETL / Data Pipeline Architecture
3.  Analytics on Semi-Structured Data
Clickstream Demo
4.  Analyzing Structured with Semi-Structured Data
Twitter feed Demo
5.  Time permitting…..Cloud Big Data / Data Warehousing
2
3
Surge in cloud
spending and
supporting
technology
(IDC)
Of workloads will
be processed In
cloud data centers
(Cisco)
Data in the cloud today is
expected to grow in the
next two years.
(Gigaom)
Today’s data: big, complex, moving to cloud
Structured data and Semi-
Structured data
•  Transactional data
•  Relational
•  Fixed schema
•  OLTP / OLAP
•  Machine-generated
•  Non-relational
•  Varying schema
•  Most common in cloud
environments
What does Semi Structured
mean?
•  Data that may be of any type
•  Data that is variable in length (arrays)
•  Structure that can rapidly and unpredictably
change
•  Usually Self Describing
•  Examples
•  XML
•  AVRO
•  JSON
XML Example
<?xml version="1.0" encoding="UTF-8"?> 
 

<breakfast_menu> 


<food> 


<name>Belgian Waffles</name> 


<price>$5.95</price> 


<description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>



<calories>650</calories> 


</food> 


<food> 


<name>Strawberry Belgian Waffles</name> 


<price>$7.95</price> 


<description>Light Belgian waffles covered with strawberries and whipped cream</description>



<calories>900</calories> 


</food> 


<food> 


<name>Berry-Berry Belgian Waffles</name> 


<price>$8.95</price> 


<description>Light Belgian waffles covered with an assortment of fresh berries and
whipped cream</description> 


<calories>900</calories> 


</food> 

</breakfast_menu>
JSON Example
{	
  
	
  	
  	
  	
  "custkey":	
  "450002",	
  
	
  	
  	
  	
  "useragent":	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  "devicetype":	
  "pc",	
  
	
  	
  	
  	
  	
  	
  	
  	
  "experience":	
  "browser",	
  
	
  	
  	
  	
  	
  	
  	
  	
  "platform":	
  "windows"	
  
	
  	
  	
  	
  },	
  
	
  	
  	
  	
  "pagetype":	
  "home",	
  
	
  	
  	
  	
  "productline":	
  "television",	
  
	
  	
  	
  	
  "customerprofile":	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  "age":	
  20,	
  
	
  	
  	
  	
  	
  	
  	
  	
  "gender":	
  "male",	
  
	
  	
  	
  	
  	
  	
  	
  	
  "customerinterests":	
  [	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  "movies",	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  "fashion",	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  "music"	
  
	
  	
  	
  	
  	
  	
  	
  	
  ]	
  
	
  	
  	
  	
  }	
  
}	
  
Avro Example
Schema
Data
}
}
JSON
Binary
Why is this so hard for a
traditional Relational DBMS?
•  Pre-defined Schema
•  Store in Character Large Object (CLOB) data
type
•  Constantly Changing
•  Inefficient to Query
Data Warehousing
•  Complex: manage hardware, data
distribution, indexes, …
•  Limited elasticity: forklift upgrades,
data redistribution, downtime
•  Costly: overprovisioning, significant
care & feeding
Hadoop
•  Complex: specialized skills, new tools
•  Limited elasticity: data
redistribution, resource contention
•  Not a data warehouse: batch-
oriented, limited optimization,
incomplete security
Current architectures can’t keep up
10
Source
Website
Logs
Operational
Systems
External
Providers
Stream
Data
Stage
S3
•  10TB
Data
Lake
Hadoop
•  30 TB
Stage
S3
•  5 TB
•  Summary
EDW
MPP
•  10 TB Disk
Data Pipeline / Data Lake Architecture – “ETL”
ETL vs ELT for Big Data
•  Think more strategically about file formats, size,
storage methods, standards
•  Processing Power – Tools vs “Services”
•  Pipeline – Where should the analysis occur?
•  Platform
•  Unlimited Processing Power
•  Contention for resources
•  Support SQL for both Schema-on-write and Schema-
on-read with full “indexing” for Structured / Semi-
Structured
•  Compress, Clone metadata, don’t replicate…
Source
Website Logs
Operational
Systems
External
Providers
Stream Data
Stage
S3
• 10TB
EDW
Snowflake
• 2 TB Disk
Data Pipeline / Snowflake Architecture – “ELT”
Demo Scenarios
•  Clickstream Analysis (load JSON, multi-table insert)
•  Which Product Category is most clicked on?
•  Which Product line does the customer self identify as
having the most interest in?
•  Twitter Feed (Join Structured and Semi-Structured)
•  From our twitter campaign, is there a correlation
between twitter volume and sales?
Clickstream Example
{	
  
	
  	
  	
  	
  "custkey":	
  "450002",	
  
	
  	
  	
  	
  "useragent":	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  "devicetype":	
  "pc",	
  
	
  	
  	
  	
  	
  	
  	
  	
  "experience":	
  "browser",	
  
	
  	
  	
  	
  	
  	
  	
  	
  "platform":	
  "windows"	
  
	
  	
  	
  	
  },	
  
	
  	
  	
  	
  "pagetype":	
  "home",	
  
	
  	
  	
  	
  "productline":	
  "none",	
  
	
  	
  	
  	
  "customerprofile":	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  "age":	
  20,	
  
	
  	
  	
  	
  	
  	
  	
  	
  "gender":	
  "male",	
  
	
  	
  	
  	
  	
  	
  	
  	
  "customerinterests":	
  [	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  "movies",	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  "fashion",	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  "music"	
  
	
  	
  	
  	
  	
  	
  	
  	
  ]	
  
	
  	
  	
  	
  }	
  
}	
  
Relational Processing of
Semi-Structured Data
16
1. Variant data type compresses storage of semi-
structured data
2. Data is analyzed during load to discern repetitive
attributes within the hierarchy
3. Repetitive attributes are columnar compressed
and statistics are collected for relational query
optimization
4. SQL extensions enable relational queries against
both semi-structured and structured data
FLATTEN() in Snowflake SQL
(Removing one level of nesting)
SELECT S.fullrow:fullName, t.value:name, t.value:age
FROM json_data_table as S, TABLE(FLATTEN(S.fullrow,'children')) t
WHERE s.fullrow:fullName = 'Mike Jones’
AND t.value:age::integer > 6 ;
FLATTEN() Converts a repeated field into a set of rows:
What makes Snowflake unique for
handling Semi-Structured Data?
•  Compression
•  Encryption / Role Based Authentication
•  Shredding
•  History/Results
•  Clone
•  Time Travel
•  Flatten
•  Regexp
•  No Contention
•  No Tuning
•  Infinitely scalable
•  SQL based with extremely high performance
z
Map-Reduce Jobs
One Platform for all Business Data
Data Sink
Structured
Storage
HDFS
Relational
Databases
Snowflake
ü One System
ü One Common Skillset
ü Faster/Less Costly Data Conversion
ü For both Structured and Semi-
Structured Business Data
Apple 101.12 250 FIH-2316
Pear 56.22 202 IHO-6912
Orange 98.21 600 WHQ-6090
Structured data
{ "firstName": "John",
"lastName": "Smith",
"height_cm": 167.64,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021-3100"
},
Semi-structured data
Other Systems
ü Multiple Systems
ü Specialized Skillset
ü Slower/More Costly Data Conversion
THANK YOU!
Ad

More Related Content

What's hot (17)

Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...
Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...
Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...
Patrick Van Renterghem
 
Intro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on SnowflakeIntro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on Snowflake
Kent Graziano
 
Elastic Data Warehousing
Elastic Data WarehousingElastic Data Warehousing
Elastic Data Warehousing
Snowflake Computing
 
AWS Summit Singapore 2019 | Snowflake: Your Data. No Limits
AWS Summit Singapore 2019 | Snowflake: Your Data. No LimitsAWS Summit Singapore 2019 | Snowflake: Your Data. No Limits
AWS Summit Singapore 2019 | Snowflake: Your Data. No Limits
AWS Summits
 
Sydney: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cloud
Sydney: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cloud Sydney: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cloud
Sydney: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cloud
Certus Solutions
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Grega Kespret
 
SLC Snowflake User Group - Mar 12, 2020
SLC Snowflake User Group - Mar 12, 2020SLC Snowflake User Group - Mar 12, 2020
SLC Snowflake User Group - Mar 12, 2020
Nathan Skousen
 
Snowflake essentials
Snowflake essentialsSnowflake essentials
Snowflake essentials
qureshihamid
 
Launching a Data Platform on Snowflake
Launching a Data Platform on SnowflakeLaunching a Data Platform on Snowflake
Launching a Data Platform on Snowflake
KETL Limited
 
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Databricks
 
Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for Dinner
Kent Graziano
 
Snowflake Overview
Snowflake OverviewSnowflake Overview
Snowflake Overview
Snowflake Computing
 
HOW TO SAVE PILEs of $$$ BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
HOW TO SAVE  PILEs of $$$BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...HOW TO SAVE  PILEs of $$$BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
HOW TO SAVE PILEs of $$$ BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
Kent Graziano
 
SQL vs NoSQL
SQL vs NoSQLSQL vs NoSQL
SQL vs NoSQL
Naseeba P P
 
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
Adaryl "Bob" Wakefield, MBA
 
KSnow: Getting started with Snowflake
KSnow: Getting started with SnowflakeKSnow: Getting started with Snowflake
KSnow: Getting started with Snowflake
Knoldus Inc.
 
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...
Certus Solutions
 
Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...
Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...
Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...
Patrick Van Renterghem
 
Intro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on SnowflakeIntro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on Snowflake
Kent Graziano
 
AWS Summit Singapore 2019 | Snowflake: Your Data. No Limits
AWS Summit Singapore 2019 | Snowflake: Your Data. No LimitsAWS Summit Singapore 2019 | Snowflake: Your Data. No Limits
AWS Summit Singapore 2019 | Snowflake: Your Data. No Limits
AWS Summits
 
Sydney: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cloud
Sydney: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cloud Sydney: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cloud
Sydney: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cloud
Certus Solutions
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Grega Kespret
 
SLC Snowflake User Group - Mar 12, 2020
SLC Snowflake User Group - Mar 12, 2020SLC Snowflake User Group - Mar 12, 2020
SLC Snowflake User Group - Mar 12, 2020
Nathan Skousen
 
Snowflake essentials
Snowflake essentialsSnowflake essentials
Snowflake essentials
qureshihamid
 
Launching a Data Platform on Snowflake
Launching a Data Platform on SnowflakeLaunching a Data Platform on Snowflake
Launching a Data Platform on Snowflake
KETL Limited
 
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Databricks
 
Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for Dinner
Kent Graziano
 
HOW TO SAVE PILEs of $$$ BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
HOW TO SAVE  PILEs of $$$BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...HOW TO SAVE  PILEs of $$$BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
HOW TO SAVE PILEs of $$$ BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
Kent Graziano
 
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
Adaryl "Bob" Wakefield, MBA
 
KSnow: Getting started with Snowflake
KSnow: Getting started with SnowflakeKSnow: Getting started with Snowflake
KSnow: Getting started with Snowflake
Knoldus Inc.
 
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...
Certus Solutions
 

Viewers also liked (14)

Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy f...
Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy f...Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy f...
Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy f...
net2-project
 
Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)
Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)
Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)
Faysal Shaarani (MBA)
 
NT sf-bjq
NT sf-bjqNT sf-bjq
NT sf-bjq
sofia Ferrando
 
The Snowflake Effect: open learning without barriers
The Snowflake Effect: open learning without barriersThe Snowflake Effect: open learning without barriers
The Snowflake Effect: open learning without barriers
Erik Duval
 
Difference between data warehouse and data mining
Difference between data warehouse and data miningDifference between data warehouse and data mining
Difference between data warehouse and data mining
maxonlinetr
 
eCloud newspapers
eCloud newspaperseCloud newspapers
eCloud newspapers
Erik Duval
 
Introduction Of Artificial neural network
Introduction Of Artificial neural networkIntroduction Of Artificial neural network
Introduction Of Artificial neural network
Nagarajan
 
Distributed blood bank management system database
Distributed blood bank management system databaseDistributed blood bank management system database
Distributed blood bank management system database
Saimunur Rahman
 
Neural networks
Neural networksNeural networks
Neural networks
Slideshare
 
Data Warehousing 2016
Data Warehousing 2016Data Warehousing 2016
Data Warehousing 2016
Kent Graziano
 
Artificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKSArtificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKS
REHMAT ULLAH
 
Neural network & its applications
Neural network & its applications Neural network & its applications
Neural network & its applications
Ahmed_hashmi
 
IBM Watson Content Analytics: Discover Hidden Value in Your Unstructured Data
IBM Watson Content Analytics: Discover Hidden Value in Your Unstructured DataIBM Watson Content Analytics: Discover Hidden Value in Your Unstructured Data
IBM Watson Content Analytics: Discover Hidden Value in Your Unstructured Data
Perficient, Inc.
 
Amazon.com Business Model
Amazon.com Business ModelAmazon.com Business Model
Amazon.com Business Model
Raveena Balani
 
Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy f...
Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy f...Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy f...
Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy f...
net2-project
 
Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)
Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)
Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)
Faysal Shaarani (MBA)
 
The Snowflake Effect: open learning without barriers
The Snowflake Effect: open learning without barriersThe Snowflake Effect: open learning without barriers
The Snowflake Effect: open learning without barriers
Erik Duval
 
Difference between data warehouse and data mining
Difference between data warehouse and data miningDifference between data warehouse and data mining
Difference between data warehouse and data mining
maxonlinetr
 
eCloud newspapers
eCloud newspaperseCloud newspapers
eCloud newspapers
Erik Duval
 
Introduction Of Artificial neural network
Introduction Of Artificial neural networkIntroduction Of Artificial neural network
Introduction Of Artificial neural network
Nagarajan
 
Distributed blood bank management system database
Distributed blood bank management system databaseDistributed blood bank management system database
Distributed blood bank management system database
Saimunur Rahman
 
Neural networks
Neural networksNeural networks
Neural networks
Slideshare
 
Data Warehousing 2016
Data Warehousing 2016Data Warehousing 2016
Data Warehousing 2016
Kent Graziano
 
Artificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKSArtificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKS
REHMAT ULLAH
 
Neural network & its applications
Neural network & its applications Neural network & its applications
Neural network & its applications
Ahmed_hashmi
 
IBM Watson Content Analytics: Discover Hidden Value in Your Unstructured Data
IBM Watson Content Analytics: Discover Hidden Value in Your Unstructured DataIBM Watson Content Analytics: Discover Hidden Value in Your Unstructured Data
IBM Watson Content Analytics: Discover Hidden Value in Your Unstructured Data
Perficient, Inc.
 
Amazon.com Business Model
Amazon.com Business ModelAmazon.com Business Model
Amazon.com Business Model
Raveena Balani
 
Ad

Similar to Analyzing Semi-Structured Data At Volume In The Cloud (20)

Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
MongoDB
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data Lessons
George Stathis
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016
Nisha Talagala
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
Martin Bém
 
Database Management System Processing.ppt
Database Management System Processing.pptDatabase Management System Processing.ppt
Database Management System Processing.ppt
HajarMeseehYaseen
 
Modèles de données et langages de description ouverts 6 - 2021-2022
Modèles de données et langages de description ouverts   6 - 2021-2022Modèles de données et langages de description ouverts   6 - 2021-2022
Modèles de données et langages de description ouverts 6 - 2021-2022
François-Xavier Boffy
 
Data modeling trends for analytics
Data modeling trends for analyticsData modeling trends for analytics
Data modeling trends for analytics
Ike Ellis
 
Middle Tier Scalability - Present and Future
Middle Tier Scalability - Present and FutureMiddle Tier Scalability - Present and Future
Middle Tier Scalability - Present and Future
dfilppi
 
Lecture1
Lecture1Lecture1
Lecture1
Manish Singh
 
Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
IdontKnow66967
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
 
Taming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI OptionsTaming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI Options
Kellyn Pot'Vin-Gorman
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
David P. Moore
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dataconomy Media
 
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
confluent
 
Dw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhanDw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhan
Dr Pradhan PL Pradhan
 
Build a modern data platform.pptx
Build a modern data platform.pptxBuild a modern data platform.pptx
Build a modern data platform.pptx
Ike Ellis
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
MongoDB
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data Lessons
George Stathis
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016
Nisha Talagala
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
Martin Bém
 
Database Management System Processing.ppt
Database Management System Processing.pptDatabase Management System Processing.ppt
Database Management System Processing.ppt
HajarMeseehYaseen
 
Modèles de données et langages de description ouverts 6 - 2021-2022
Modèles de données et langages de description ouverts   6 - 2021-2022Modèles de données et langages de description ouverts   6 - 2021-2022
Modèles de données et langages de description ouverts 6 - 2021-2022
François-Xavier Boffy
 
Data modeling trends for analytics
Data modeling trends for analyticsData modeling trends for analytics
Data modeling trends for analytics
Ike Ellis
 
Middle Tier Scalability - Present and Future
Middle Tier Scalability - Present and FutureMiddle Tier Scalability - Present and Future
Middle Tier Scalability - Present and Future
dfilppi
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
 
Taming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI OptionsTaming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI Options
Kellyn Pot'Vin-Gorman
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
David P. Moore
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dataconomy Media
 
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
confluent
 
Build a modern data platform.pptx
Build a modern data platform.pptxBuild a modern data platform.pptx
Build a modern data platform.pptx
Ike Ellis
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy
 
Ad

More from Robert Dempsey (20)

Building A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning PipelineBuilding A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning Pipeline
Robert Dempsey
 
Using PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of DataUsing PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of Data
Robert Dempsey
 
Practical Predictive Modeling in Python
Practical Predictive Modeling in PythonPractical Predictive Modeling in Python
Practical Predictive Modeling in Python
Robert Dempsey
 
Creating Your First Predictive Model In Python
Creating Your First Predictive Model In PythonCreating Your First Predictive Model In Python
Creating Your First Predictive Model In Python
Robert Dempsey
 
Growth Hacking 101
Growth Hacking 101Growth Hacking 101
Growth Hacking 101
Robert Dempsey
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With Python
Robert Dempsey
 
DC Python Intro Slides - Rob's Version
DC Python Intro Slides - Rob's VersionDC Python Intro Slides - Rob's Version
DC Python Intro Slides - Rob's Version
Robert Dempsey
 
Content Marketing Strategy for 2013
Content Marketing Strategy for 2013Content Marketing Strategy for 2013
Content Marketing Strategy for 2013
Robert Dempsey
 
Creating Lead-Generating Social Media Campaigns
Creating Lead-Generating Social Media CampaignsCreating Lead-Generating Social Media Campaigns
Creating Lead-Generating Social Media Campaigns
Robert Dempsey
 
Goal Writing Workshop
Goal Writing WorkshopGoal Writing Workshop
Goal Writing Workshop
Robert Dempsey
 
Google AdWords Introduction
Google AdWords IntroductionGoogle AdWords Introduction
Google AdWords Introduction
Robert Dempsey
 
20 Tips For Freelance Success
20 Tips For Freelance Success20 Tips For Freelance Success
20 Tips For Freelance Success
Robert Dempsey
 
How To Turn Your Business Into A Media Powerhouse
How To Turn Your Business Into A Media PowerhouseHow To Turn Your Business Into A Media Powerhouse
How To Turn Your Business Into A Media Powerhouse
Robert Dempsey
 
Agile Teams as Innovation Teams
Agile Teams as Innovation TeamsAgile Teams as Innovation Teams
Agile Teams as Innovation Teams
Robert Dempsey
 
Introduction to kanban
Introduction to kanbanIntroduction to kanban
Introduction to kanban
Robert Dempsey
 
Get The **** Up And Market
Get The **** Up And MarketGet The **** Up And Market
Get The **** Up And Market
Robert Dempsey
 
Introduction To Inbound Marketing
Introduction To Inbound MarketingIntroduction To Inbound Marketing
Introduction To Inbound Marketing
Robert Dempsey
 
Writing Agile Requirements
Writing  Agile  RequirementsWriting  Agile  Requirements
Writing Agile Requirements
Robert Dempsey
 
Twitter For Business
Twitter For BusinessTwitter For Business
Twitter For Business
Robert Dempsey
 
Introduction To Scrum For Managers
Introduction To Scrum For ManagersIntroduction To Scrum For Managers
Introduction To Scrum For Managers
Robert Dempsey
 
Building A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning PipelineBuilding A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning Pipeline
Robert Dempsey
 
Using PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of DataUsing PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of Data
Robert Dempsey
 
Practical Predictive Modeling in Python
Practical Predictive Modeling in PythonPractical Predictive Modeling in Python
Practical Predictive Modeling in Python
Robert Dempsey
 
Creating Your First Predictive Model In Python
Creating Your First Predictive Model In PythonCreating Your First Predictive Model In Python
Creating Your First Predictive Model In Python
Robert Dempsey
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With Python
Robert Dempsey
 
DC Python Intro Slides - Rob's Version
DC Python Intro Slides - Rob's VersionDC Python Intro Slides - Rob's Version
DC Python Intro Slides - Rob's Version
Robert Dempsey
 
Content Marketing Strategy for 2013
Content Marketing Strategy for 2013Content Marketing Strategy for 2013
Content Marketing Strategy for 2013
Robert Dempsey
 
Creating Lead-Generating Social Media Campaigns
Creating Lead-Generating Social Media CampaignsCreating Lead-Generating Social Media Campaigns
Creating Lead-Generating Social Media Campaigns
Robert Dempsey
 
Google AdWords Introduction
Google AdWords IntroductionGoogle AdWords Introduction
Google AdWords Introduction
Robert Dempsey
 
20 Tips For Freelance Success
20 Tips For Freelance Success20 Tips For Freelance Success
20 Tips For Freelance Success
Robert Dempsey
 
How To Turn Your Business Into A Media Powerhouse
How To Turn Your Business Into A Media PowerhouseHow To Turn Your Business Into A Media Powerhouse
How To Turn Your Business Into A Media Powerhouse
Robert Dempsey
 
Agile Teams as Innovation Teams
Agile Teams as Innovation TeamsAgile Teams as Innovation Teams
Agile Teams as Innovation Teams
Robert Dempsey
 
Introduction to kanban
Introduction to kanbanIntroduction to kanban
Introduction to kanban
Robert Dempsey
 
Get The **** Up And Market
Get The **** Up And MarketGet The **** Up And Market
Get The **** Up And Market
Robert Dempsey
 
Introduction To Inbound Marketing
Introduction To Inbound MarketingIntroduction To Inbound Marketing
Introduction To Inbound Marketing
Robert Dempsey
 
Writing Agile Requirements
Writing  Agile  RequirementsWriting  Agile  Requirements
Writing Agile Requirements
Robert Dempsey
 
Introduction To Scrum For Managers
Introduction To Scrum For ManagersIntroduction To Scrum For Managers
Introduction To Scrum For Managers
Robert Dempsey
 

Recently uploaded (20)

Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 

Analyzing Semi-Structured Data At Volume In The Cloud

  • 1. Analyzing Semi-Structured Data at Volume in the Cloud Kevin Bair Solution Architect [email protected]
  • 2. Topics this presentation will cover 1.  Structured vs. Semi-Structured 2.  ETL / Data Pipeline Architecture 3.  Analytics on Semi-Structured Data Clickstream Demo 4.  Analyzing Structured with Semi-Structured Data Twitter feed Demo 5.  Time permitting…..Cloud Big Data / Data Warehousing 2
  • 3. 3 Surge in cloud spending and supporting technology (IDC) Of workloads will be processed In cloud data centers (Cisco) Data in the cloud today is expected to grow in the next two years. (Gigaom) Today’s data: big, complex, moving to cloud
  • 4. Structured data and Semi- Structured data •  Transactional data •  Relational •  Fixed schema •  OLTP / OLAP •  Machine-generated •  Non-relational •  Varying schema •  Most common in cloud environments
  • 5. What does Semi Structured mean? •  Data that may be of any type •  Data that is variable in length (arrays) •  Structure that can rapidly and unpredictably change •  Usually Self Describing •  Examples •  XML •  AVRO •  JSON
  • 6. XML Example <?xml version="1.0" encoding="UTF-8"?> <breakfast_menu> <food> <name>Belgian Waffles</name> <price>$5.95</price> <description>Two of our famous Belgian Waffles with plenty of real maple syrup</description> <calories>650</calories> </food> <food> <name>Strawberry Belgian Waffles</name> <price>$7.95</price> <description>Light Belgian waffles covered with strawberries and whipped cream</description> <calories>900</calories> </food> <food> <name>Berry-Berry Belgian Waffles</name> <price>$8.95</price> <description>Light Belgian waffles covered with an assortment of fresh berries and whipped cream</description> <calories>900</calories> </food> </breakfast_menu>
  • 7. JSON Example {          "custkey":  "450002",          "useragent":  {                  "devicetype":  "pc",                  "experience":  "browser",                  "platform":  "windows"          },          "pagetype":  "home",          "productline":  "television",          "customerprofile":  {                  "age":  20,                  "gender":  "male",                  "customerinterests":  [                          "movies",                          "fashion",                          "music"                  ]          }   }  
  • 9. Why is this so hard for a traditional Relational DBMS? •  Pre-defined Schema •  Store in Character Large Object (CLOB) data type •  Constantly Changing •  Inefficient to Query
  • 10. Data Warehousing •  Complex: manage hardware, data distribution, indexes, … •  Limited elasticity: forklift upgrades, data redistribution, downtime •  Costly: overprovisioning, significant care & feeding Hadoop •  Complex: specialized skills, new tools •  Limited elasticity: data redistribution, resource contention •  Not a data warehouse: batch- oriented, limited optimization, incomplete security Current architectures can’t keep up 10
  • 11. Source Website Logs Operational Systems External Providers Stream Data Stage S3 •  10TB Data Lake Hadoop •  30 TB Stage S3 •  5 TB •  Summary EDW MPP •  10 TB Disk Data Pipeline / Data Lake Architecture – “ETL”
  • 12. ETL vs ELT for Big Data •  Think more strategically about file formats, size, storage methods, standards •  Processing Power – Tools vs “Services” •  Pipeline – Where should the analysis occur? •  Platform •  Unlimited Processing Power •  Contention for resources •  Support SQL for both Schema-on-write and Schema- on-read with full “indexing” for Structured / Semi- Structured •  Compress, Clone metadata, don’t replicate…
  • 14. Demo Scenarios •  Clickstream Analysis (load JSON, multi-table insert) •  Which Product Category is most clicked on? •  Which Product line does the customer self identify as having the most interest in? •  Twitter Feed (Join Structured and Semi-Structured) •  From our twitter campaign, is there a correlation between twitter volume and sales?
  • 15. Clickstream Example {          "custkey":  "450002",          "useragent":  {                  "devicetype":  "pc",                  "experience":  "browser",                  "platform":  "windows"          },          "pagetype":  "home",          "productline":  "none",          "customerprofile":  {                  "age":  20,                  "gender":  "male",                  "customerinterests":  [                          "movies",                          "fashion",                          "music"                  ]          }   }  
  • 16. Relational Processing of Semi-Structured Data 16 1. Variant data type compresses storage of semi- structured data 2. Data is analyzed during load to discern repetitive attributes within the hierarchy 3. Repetitive attributes are columnar compressed and statistics are collected for relational query optimization 4. SQL extensions enable relational queries against both semi-structured and structured data
  • 17. FLATTEN() in Snowflake SQL (Removing one level of nesting) SELECT S.fullrow:fullName, t.value:name, t.value:age FROM json_data_table as S, TABLE(FLATTEN(S.fullrow,'children')) t WHERE s.fullrow:fullName = 'Mike Jones’ AND t.value:age::integer > 6 ; FLATTEN() Converts a repeated field into a set of rows:
  • 18. What makes Snowflake unique for handling Semi-Structured Data? •  Compression •  Encryption / Role Based Authentication •  Shredding •  History/Results •  Clone •  Time Travel •  Flatten •  Regexp •  No Contention •  No Tuning •  Infinitely scalable •  SQL based with extremely high performance
  • 19. z Map-Reduce Jobs One Platform for all Business Data Data Sink Structured Storage HDFS Relational Databases Snowflake ü One System ü One Common Skillset ü Faster/Less Costly Data Conversion ü For both Structured and Semi- Structured Business Data Apple 101.12 250 FIH-2316 Pear 56.22 202 IHO-6912 Orange 98.21 600 WHQ-6090 Structured data { "firstName": "John", "lastName": "Smith", "height_cm": 167.64, "address": { "streetAddress": "21 2nd Street", "city": "New York", "state": "NY", "postalCode": "10021-3100" }, Semi-structured data Other Systems ü Multiple Systems ü Specialized Skillset ü Slower/More Costly Data Conversion