SlideShare a Scribd company logo
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with BigQuery
Fast analysis of Big Data

                            Jordan Tigani, Software Engineer
01000001011011100111001101110111011001010111001
00010000001110100011011110010000001110100011010
00011001010010000001010101011011000111010001101
00101101101011000010111010001100101001000000101
00010111010101100101011100110111010001101001011
01111011011100010000001101111011001100010000001
00110001101001011001100110010100101100001000000
11101000110100001100101001000000101010101101110
01101001011101100110010101110010011100110110010
10010110000100000011000010110111001100100001000
00010001010111011001100101011100100111100101110
100101110011001000000011010000110010...........
Big Data at Google




      72 hours

      100 million gigabytes
SELECT
  kick_ass_product_plan AS strategy,
  AVG(kicking_factor) AS awesomeness
FROM
  lots_of_data
GROUP BY
  strategy
+-------------+----------------+
| strategy    | awesomeness    |
+-------------+----------------+
| "Forty-two" | 1000000.01     |
+-------------+----------------+
1 row in result set (10.2 s)
Scanned 100GB
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Regular expressions on 13 billion rows...
13 Billion rows
1 TB of data in 4 tables
FAST!
AST
Google's Internal Technology:
Dremel
MapReduce is Flexible but Heavy

                                     •   Master constructs the plan and
               Mapper      Mapper        begins spinning up workers

                                     •   Mappers read and write to
                                         distributed storage
    Master     Distributed Storage

                                     •   Map => Shuffle => Reduce


                     Reducer
                                     •   Reducers read and write to
                                         distributed storage
MapReduce is Flexible but Heavy

                  Stage 1                    Stage 2

               Mapper      Mapper       Mapper        Mapper




    Master                  Distributed Storage                Master




                        Reducer             Reducer
Dremel vs MapReduce

•   MapReduce
    o Flexible batch processing
    o High overall throughput
    o High latency

•   Dremel
    o Optimized for interactive SQL queries
    o Very low latency
Mixer 0                       Dremel Architecture


                                                      •   Partial Reduction
       Mixer 1                           Mixer 1
                                                      •   Diskless data flow

                                                      •   Long lived shared serving tree
Leaf             Leaf             Leaf         Leaf



                                                      •   Columnar Storage

             Distributed Storage
Simple Query
SELECT
    state, COUNT(*) count_babies
FROM [publicdata:samples.natality]
WHERE
    year >= 1980 AND year < 1990
GROUP BY state
ORDER BY count_babies DESC
LIMIT 10
LIMIT 10
                                                      ORDER BY count_babies DESC
                        Mixer 0
                                                      COUNT(*)
                                                      GROUP BY state


                                                                       O(50 states)
                                                                       O(50 states)
       Mixer 1                           Mixer 1      COUNT(*)
                                                      GROUP BY state


                                                                       O(50 states)
                                                      COUNT(*)
Leaf             Leaf             Leaf         Leaf
                                                      GROUP BY state
                                                      WHERE year >= 1980 and year < 1990


                                                                    O(Rows ~140M)
             Distributed Storage
                                                      SELECT state, year
Modeling Data
Example: Daily Weather Station Data


                            weather_station_data
station lat    long    mean_temp   humidity   timestamp    year   month   day
9384     33.57 86.75   89.3        .35        1351005129   2011   04      19
2857     36.77 119.72 78.5         .24        1351005135   2011   04      19
3475     40.77 73.98   68          .35        1351015930   2011   04      19
etc...
Example: Daily Weather Station Data

station,   lat,     long,     mean_temp,   year,      mon, day
999999,    36.624, -116.023, 63.6,         2009,      10,    9
911904,    20.963, -156.675, 83.4,         2009,      10,    9
916890,         -18133, 178433,    76.9,           2009,   10,   9
943320,         -20678, 139488,    73.8,           2009,   10,   9




                            CSV
Organizing BigQuery Tables

                             October 22




                             October 23



   Your Source
      Data                   October 24
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Modeling Event Data: Social Music Store


                    logs.oct_24_2012_song_activities
USERNAME   ACTIVITY       Cost    SONG            ARTIST       TIMESTAMP
Michael    LISTEN                 Too Close       Alex Clare   1351065562
Michael    LISTEN                 Gangnam Style   PSY          1351105150
Jim        LISTEN                 Complications   Deadmau5     1351075720
Michael    PURCHASE       0.99    Gangnam Style   PSY          1351115962
Users Who Listened to More than 10 Songs/Day
SELECT
  UserId, COUNT(*) as ListenActivities
FROM
  [logs.oct_24_2012_song_activities]
GROUP EACH BY
  UserId
HAVING
  ListenActivites > 10
How Many Songs Listened to Total by Listeners of PSY?
SELECT
  UserId, count(*) as ListenActivities
FROM
  [logs.oct_24_2012_song_activities]
WHERE UserId IN (
     SELECT
       UserId
     FROM
       [logs.oct_24_2012_song_activities]
     WHERE artist = 'PSY')
GROUP EACH BY UserId
HAVING
  ListenActivites > 10
Modeling Event Data: Nested and Repeated Values
{"UserID" : "Michael",
 "Listens":   [
     {"TrackId":1234,"Title":"Gangnam Style",
     {"TrackId":1234,"Title":"Gangam Style",
        "Artist":"PSY","Timestamp":1351075700},
     {"TrackId":1234,"Title":"Alex Clare",
        "Artist":"Alex Clare",'Timestamp":1351075700}
  ]
  "Purchases": [
     {"Track":2345,"Title":"Gangnam Style",
     {"Track":2345,"Title":"Gangam Style",
        "Artist":"PSY","Timestamp":1351075700,"Cost":0.99}
  ]}


                         JSON
Which Users Have Listened to Beyonce?
SELECT
  UserID,
  COUNT(ListenActivities.artist) WITHIN RECORD
    AS song_count
FROM
  [logs.oct_24_2012_songactivities]
WHERE
  UserID IN (SELECT UserID,
             FROM [logs.oct_24_2012_songactivities]
             WHERE ListenActivities.artist = 'Beyonce');
What Position are PSY songs in our Users' Daily Playlists?
SELECT
  UserID,
  POSITION(ListenActivities.artist)
FROM
  [sample_music_logs.oct_24_2012_songactivities]
WHERE
  ListenActivities.artist = 'PSY';
Average Position of Songs by PSY in All Daily Playlists?
SELECT
  AVG(POSITION(ListenActivities.artist))
FROM
  [sample_music_logs.oct_24_2012_songactivities],
  [sample_music_logs.oct_23_2012_songactivities],
  /* etc... */
WHERE
  ListenActivities.artist = 'PSY';
Summary: Choosing a BigQuery Data Model
• "Shard" your Data Using Multiple Tables
• Source Data Files
  • CSV format
  • Newline-delimited JSON
• Using Nested and Repeated Records
  • Simplify Some Types of Queries
  • Often Matches Document Database Models
Developing with BigQuery
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Upload Your Data




                   Google Cloud
                                  BigQuery
                     Storage
Load your Data into BigQuery
"jobReference":{
   "projectId":"605902584318"},
"configuration":{
   "load":{
      "destinationTable":{
         "projectId":"605902584318",
         "datasetId":"my_dataset",
         "tableId":"widget_sales"},
      "sourceUris":[
         "gs://widget-sales-data/2012080100.csv"],
      "schema":{
         "fields":[{
               "name":"widget",
               "type":"string"},
                                         ...

POST https://www.googleapis.com/bigquery/v2/projects/605902584318/jobs
Query Away!


"jobReference":{
    "projectId":"605902584318",
    "query":"SELECT TOP(widget, 50), COUNT(*) AS sale_count
         FROM widget_sales",
    "maxResults":100,
    "apiVersion":"v2"
}



POST https://www.googleapis.com/bigquery/v2/projects/605902584318/jobs
Libraries


•   Python   •   JavaScript
•   Java     •   Go
•   .NET     •   PHP
•   Ruby     •   Objective-C
Libraries - Example JavaScript Query

var request = gapi.client.bigquery.jobs.query({
    'projectId': project_id,
    'timeoutMs': '30000',
    'query': 'SELECT state, AVG(mother_age) AS theav
              FROM [publicdata:samples.natality]
              WHERE year=2000 AND ever_born=1
              GROUP BY state
              ORDER BY theav DESC;'
});

request.execute(function(response) {
    console.log(response);
    $.each(response.result.rows, function(i, item) {
    ...
Custom Code and the Google Chart Tools API
Google Spreadsheets
Commercial Visualization Tools
Demo: Using BigQuery on BigQuery
BigQuery - Aggregate Big Data Analysis in Seconds

• Full table scans FAST
• Aggregate Queries on Massive Datasets
• Supports Flat and Nested/Repeated Data Models
• It's an API

      Get started now:
      http://developers.google.com/bigquery/
SELECT questions FROM audience

SELECT 'Thank You!'
FROM jordan

http://developers.google.com/bigquery
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Schema definition




           birth_record         parents
         parent_id_mother   id
         parent_id_father   race
         plurality          age
         is_male            cigarette_use
         race               state
         weight
Schema definition

                         birth_record
                    mother_race
                    mother_age
                    mother_cigarette_use
                    mother_state
                    father_race
                    father_age
                    father_cigarette_use
                    father_state
                    plurality
                    is_male
                    race
                    weight
Tools to prepare your data

• App Engine MapReduce
• Commercial ETL tools
  • Pervasive
  • Informatica
  • Talend
• UNIX command-line
Schema definition - sharding
 birth_record_2011      birth_record_2012     birth_record_2013
mother_race            mother_race            birth_record_2014
mother_age             mother_age
mother_cigarette_use   mother_cigarette_use   birth_record_2015
mother_state           mother_state
father_race            father_race            birth_record_2016
father_age             father_age
father_cigarette_use   father_cigarette_use
father_state           father_state
plurality              plurality
is_male                is_male
race                   race
weight                 weight
Visualizing your Data
BigQuery architecture
“ If you do a table scan over a 1TB table,
  you're going to have a bad time. ”


 Anonymous
 16th century Italian Philosopher-Monk
Goal: Perform a 1 TB table scan in 1 second
Parallelize Parallelize Parallelize!


•
• Reading 1 TB/ second from disk:
  • 10k+ disks
• Processing 1 TB / sec:
  • 5k processors
Data access: Column Store




 Record Oriented Storage    Column Oriented Storage
BigQuery Architecture
                                                  Mixer 0




          Mixer 1                           Mixer 1                    Mixer 1
          Shard 0-8                         Shard 9-16                 Shard 17-24




Shard 0                          Shard 10                   Shard 12     Shard 20    Shard 24




Distributed Storage (e.g. GFS)
Running your Queries
BigQuery SQL Example: Simple aggregates




SELECT COUNT(foo), MAX(foo), STDDEV(foo)
FROM ...
BigQuery SQL Example: Complex Processing




SELECT ... FROM ....
WHERE REGEXP_MATCH(url, ".com$")
  AND user CONTAINS 'test'
BigQuery SQL Example: Nested SELECT

SELECT COUNT(*) FROM
  (SELECT foo ..... )
GROUP BY foo
BigQuery SQL Example: Small JOIN



SELECT huge_table.foo
FROM huge_table
JOIN small_table
ON small_table.foo = huge_table.foo
BigQuery Architecture: Small Join
                                 Mixer 0




             Mixer 1                                  Mixer 1
             Shard 0-8                                Shard 17-24




             Shard 0                       Shard 20                 Shard 24




Distributed Storage (e.g. GFS)
Other new features!
Batch queries!

• Don't need interactive queries for some jobs?
  • priority: "BATCH"
That's it

• API
• Column-based datastore
• Full table scans FAST
• Aggregates
• Commercial tool support
• Use cases
SELECT questions FROM audience

SELECT 'Thank You!'
FROM ryan

http://developers.google.com/bigquery

@ryguyrg          http://profiles.google.com/ryan.boyd
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Data access: Column Store




 Record Oriented Storage    Column Oriented Storage
A Little Later ...
Row   wp_namespace   Revs
                                           Underlying table:
1     0              53697002               • Wikipedia page revision records
2     1              6151228                • Rows: 314 million
3     3              5519859
                                            • Byte size: 35.7 GB
4     4              4184389               Query Stats:
5     2              3108562                • Scanned 7G of data
6     10             1052044                • <5 seconds
7     6              877417
                                            • ~ 100M rows scanned / second
8     14             838940
9     5              651749
10    11             192534
11    100            148135
ORDER BY Revs DESC
                        Mixer 0
                                                      COUNT (revision_id)
                                                      GROUP BY wp_namespace




       Mixer 1                           Mixer 1
                                                      COUNT (revision_id)
                                                      GROUP BY wp_namespace




Leaf             Leaf             Leaf         Leaf   COUNT (revision_id)
                                                      GROUP BY wp_namespace
                                                      WHERE timestamp > CUTOFF

                                                                       10 GB / s

             Distributed Storage
                                                      SELECT wp_namespace, revision_id
"Multi-stage" Query
SELECT
  LogEdits, COUNT(contributor_id) Contributors
FROM (
  SELECT
  SELECT                SELECT
    contributor_id,
    contributor_id, contributor_id,
    INTEGER(LOG10(COUNT(revision_id))) LogEdits
    INTEGER(LOG10(COUNT(*))) LogEdits
       INTEGER(LOG10(COUNT(revision_id))) LogEdits
  FROM [publicdata:samples.wikipedia]
  FROM [publicdata:samples.wikipedia]
          FROM [publicdata:samples.wikipedia]
  GROUP EACH BY contributor_id)
  GROUP EACH BY contributor_id)
GROUP BY LogEdits
ORDER BY LogEdits DESC
ORDER BY LogEdits DESC
                        Mixer 0                       COUNT(contributor_id)
                                                      GROUP BY LogEdits




       Mixer 1                       Mixer 1
                                                      COUNT(contributor_id)
                                                      GROUP BY LogEdits




                                                     COUNT(contributor_id)
Leaf             Leaf         Shuffler    Shuffler   GROUP BY LogEdits         N^2    Shuffle by
                                                     SELECT LE, Id             GB/s   contributor_id
                                                     COUNT(*)
                                                     GROUP BY contributor_id


             Distributed Storage
                                                      SELECT contributor_id
When to use EACH

•   Shuffle definitely adds some overhead
•   Poor query performance if used incorrectly

•   GROUP BY
    o Groups << Rows => Unbalanced load
    o Example: GROUP BY state

•   GROUP EACH BY
    o Groups ~ Rows
    o Example: GROUP BY user_id

More Related Content

PDF
Google Big Query UDFs
PDF
Complex realtime event analytics using BigQuery @Crunch Warmup
PDF
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
PDF
Google BigQuery
PDF
How BigQuery broke my heart
ODP
Big Data Analytics with Google BigQuery. By Javier Ramirez. All your base Co...
PDF
Big query the first step - (MOSG)
PDF
Google BigQuery for Everyday Developer
Google Big Query UDFs
Complex realtime event analytics using BigQuery @Crunch Warmup
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Google BigQuery
How BigQuery broke my heart
Big Data Analytics with Google BigQuery. By Javier Ramirez. All your base Co...
Big query the first step - (MOSG)
Google BigQuery for Everyday Developer

What's hot (20)

PDF
TDC2016SP - Trilha BigData
PDF
Redshift VS BigQuery
PDF
BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...
PDF
BigQuery for Beginners
PDF
Big query
PPTX
Google BigQuery 101 & What’s New
PDF
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
PDF
Google Cloud Platform at Vente-Exclusive.com
PPTX
30 days of google cloud event
PDF
GDD Brazil 2010 - Google Storage, Bigquery and Prediction APIs
PDF
BigQuery implementation
PPTX
Augmenting Mongo DB with treasure data
PDF
You might be paying too much for BigQuery
PPTX
Google Cloud Spanner Preview
PDF
Unifying Events and Logs into the Cloud
PPTX
Webinar: Live Data Visualisation with Tableau and MongoDB
PDF
Scaling to Infinity - Open Source meets Big Data
PPTX
Real Time Data Analytics with MongoDB and Fluentd at Wish
PPTX
An Intro to Elasticsearch and Kibana
PPTX
Hands On: Javascript SDK
TDC2016SP - Trilha BigData
Redshift VS BigQuery
BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...
BigQuery for Beginners
Big query
Google BigQuery 101 & What’s New
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
Google Cloud Platform at Vente-Exclusive.com
30 days of google cloud event
GDD Brazil 2010 - Google Storage, Bigquery and Prediction APIs
BigQuery implementation
Augmenting Mongo DB with treasure data
You might be paying too much for BigQuery
Google Cloud Spanner Preview
Unifying Events and Logs into the Cloud
Webinar: Live Data Visualisation with Tableau and MongoDB
Scaling to Infinity - Open Source meets Big Data
Real Time Data Analytics with MongoDB and Fluentd at Wish
An Intro to Elasticsearch and Kibana
Hands On: Javascript SDK
Ad

Similar to Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012 (20)

PPTX
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
PDF
Don't optimize my queries, organize my data!
PPTX
At the core you will have KUSTO
PDF
Spatial query on vanilla databases
PPTX
Hadoop & Hive Change the Data Warehousing Game Forever
PDF
Starfish: A Self-tuning System for Big Data Analytics
PDF
The Hadoop Ecosystem
PDF
Sorry - How Bieber broke Google Cloud at Spotify
PPTX
Presentation_BigData_NenaMarin
PPTX
Odtug2011 adf developers make the database work for you
PPTX
The Other HPC: High Productivity Computing
PDF
20180420 hk-the powerofmysql8
PPT
Mondrian - Geo Mondrian
PPTX
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
PDF
A Century Of Weather Data - Midwest.io
PDF
R programming & Machine Learning
PDF
Data Profiling in Apache Calcite
PDF
Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span...
PDF
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
PDF
OrientDB - The 2nd generation of (multi-model) NoSQL
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't optimize my queries, organize my data!
At the core you will have KUSTO
Spatial query on vanilla databases
Hadoop & Hive Change the Data Warehousing Game Forever
Starfish: A Self-tuning System for Big Data Analytics
The Hadoop Ecosystem
Sorry - How Bieber broke Google Cloud at Spotify
Presentation_BigData_NenaMarin
Odtug2011 adf developers make the database work for you
The Other HPC: High Productivity Computing
20180420 hk-the powerofmysql8
Mondrian - Geo Mondrian
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
A Century Of Weather Data - Midwest.io
R programming & Machine Learning
Data Profiling in Apache Calcite
Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
OrientDB - The 2nd generation of (multi-model) NoSQL
Ad

More from Big Data Spain (20)

PDF
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
PDF
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
PDF
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
PDF
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
PDF
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
PDF
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
PDF
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
PDF
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
PDF
State of the art time-series analysis with deep learning by Javier Ordóñez at...
PDF
Trading at market speed with the latest Kafka features by Iñigo González at B...
PDF
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
PDF
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
PDF
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
PDF
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
PDF
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
PDF
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
PDF
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
PDF
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
PDF
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
PDF
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
State of the art time-series analysis with deep learning by Javier Ordóñez at...
Trading at market speed with the latest Kafka features by Iñigo González at B...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017

Recently uploaded (20)

PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Empathic Computing: Creating Shared Understanding
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
PDF
cuic standard and advanced reporting.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Transforming Manufacturing operations through Intelligent Integrations
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
Advanced Soft Computing BINUS July 2025.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Advanced methodologies resolving dimensionality complications for autism neur...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Reach Out and Touch Someone: Haptics and Empathic Computing
Empathic Computing: Creating Shared Understanding
NewMind AI Weekly Chronicles - August'25 Week I
Per capita expenditure prediction using model stacking based on satellite ima...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
cuic standard and advanced reporting.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Transforming Manufacturing operations through Intelligent Integrations
Dropbox Q2 2025 Financial Results & Investor Presentation
Understanding_Digital_Forensics_Presentation.pptx
20250228 LYD VKU AI Blended-Learning.pptx
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Advanced Soft Computing BINUS July 2025.pdf

Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

  • 2. Crunching Data with BigQuery Fast analysis of Big Data Jordan Tigani, Software Engineer
  • 4. Big Data at Google 72 hours 100 million gigabytes
  • 5. SELECT kick_ass_product_plan AS strategy, AVG(kicking_factor) AS awesomeness FROM lots_of_data GROUP BY strategy
  • 6. +-------------+----------------+ | strategy | awesomeness | +-------------+----------------+ | "Forty-two" | 1000000.01 | +-------------+----------------+ 1 row in result set (10.2 s) Scanned 100GB
  • 9. Regular expressions on 13 billion rows...
  • 10. 13 Billion rows 1 TB of data in 4 tables FAST! AST
  • 12. MapReduce is Flexible but Heavy • Master constructs the plan and Mapper Mapper begins spinning up workers • Mappers read and write to distributed storage Master Distributed Storage • Map => Shuffle => Reduce Reducer • Reducers read and write to distributed storage
  • 13. MapReduce is Flexible but Heavy Stage 1 Stage 2 Mapper Mapper Mapper Mapper Master Distributed Storage Master Reducer Reducer
  • 14. Dremel vs MapReduce • MapReduce o Flexible batch processing o High overall throughput o High latency • Dremel o Optimized for interactive SQL queries o Very low latency
  • 15. Mixer 0 Dremel Architecture • Partial Reduction Mixer 1 Mixer 1 • Diskless data flow • Long lived shared serving tree Leaf Leaf Leaf Leaf • Columnar Storage Distributed Storage
  • 16. Simple Query SELECT state, COUNT(*) count_babies FROM [publicdata:samples.natality] WHERE year >= 1980 AND year < 1990 GROUP BY state ORDER BY count_babies DESC LIMIT 10
  • 17. LIMIT 10 ORDER BY count_babies DESC Mixer 0 COUNT(*) GROUP BY state O(50 states) O(50 states) Mixer 1 Mixer 1 COUNT(*) GROUP BY state O(50 states) COUNT(*) Leaf Leaf Leaf Leaf GROUP BY state WHERE year >= 1980 and year < 1990 O(Rows ~140M) Distributed Storage SELECT state, year
  • 19. Example: Daily Weather Station Data weather_station_data station lat long mean_temp humidity timestamp year month day 9384 33.57 86.75 89.3 .35 1351005129 2011 04 19 2857 36.77 119.72 78.5 .24 1351005135 2011 04 19 3475 40.77 73.98 68 .35 1351015930 2011 04 19 etc...
  • 20. Example: Daily Weather Station Data station, lat, long, mean_temp, year, mon, day 999999, 36.624, -116.023, 63.6, 2009, 10, 9 911904, 20.963, -156.675, 83.4, 2009, 10, 9 916890, -18133, 178433, 76.9, 2009, 10, 9 943320, -20678, 139488, 73.8, 2009, 10, 9 CSV
  • 21. Organizing BigQuery Tables October 22 October 23 Your Source Data October 24
  • 23. Modeling Event Data: Social Music Store logs.oct_24_2012_song_activities USERNAME ACTIVITY Cost SONG ARTIST TIMESTAMP Michael LISTEN Too Close Alex Clare 1351065562 Michael LISTEN Gangnam Style PSY 1351105150 Jim LISTEN Complications Deadmau5 1351075720 Michael PURCHASE 0.99 Gangnam Style PSY 1351115962
  • 24. Users Who Listened to More than 10 Songs/Day SELECT UserId, COUNT(*) as ListenActivities FROM [logs.oct_24_2012_song_activities] GROUP EACH BY UserId HAVING ListenActivites > 10
  • 25. How Many Songs Listened to Total by Listeners of PSY? SELECT UserId, count(*) as ListenActivities FROM [logs.oct_24_2012_song_activities] WHERE UserId IN ( SELECT UserId FROM [logs.oct_24_2012_song_activities] WHERE artist = 'PSY') GROUP EACH BY UserId HAVING ListenActivites > 10
  • 26. Modeling Event Data: Nested and Repeated Values {"UserID" : "Michael", "Listens": [ {"TrackId":1234,"Title":"Gangnam Style", {"TrackId":1234,"Title":"Gangam Style", "Artist":"PSY","Timestamp":1351075700}, {"TrackId":1234,"Title":"Alex Clare", "Artist":"Alex Clare",'Timestamp":1351075700} ] "Purchases": [ {"Track":2345,"Title":"Gangnam Style", {"Track":2345,"Title":"Gangam Style", "Artist":"PSY","Timestamp":1351075700,"Cost":0.99} ]} JSON
  • 27. Which Users Have Listened to Beyonce? SELECT UserID, COUNT(ListenActivities.artist) WITHIN RECORD AS song_count FROM [logs.oct_24_2012_songactivities] WHERE UserID IN (SELECT UserID, FROM [logs.oct_24_2012_songactivities] WHERE ListenActivities.artist = 'Beyonce');
  • 28. What Position are PSY songs in our Users' Daily Playlists? SELECT UserID, POSITION(ListenActivities.artist) FROM [sample_music_logs.oct_24_2012_songactivities] WHERE ListenActivities.artist = 'PSY';
  • 29. Average Position of Songs by PSY in All Daily Playlists? SELECT AVG(POSITION(ListenActivities.artist)) FROM [sample_music_logs.oct_24_2012_songactivities], [sample_music_logs.oct_23_2012_songactivities], /* etc... */ WHERE ListenActivities.artist = 'PSY';
  • 30. Summary: Choosing a BigQuery Data Model • "Shard" your Data Using Multiple Tables • Source Data Files • CSV format • Newline-delimited JSON • Using Nested and Repeated Records • Simplify Some Types of Queries • Often Matches Document Database Models
  • 33. Upload Your Data Google Cloud BigQuery Storage
  • 34. Load your Data into BigQuery "jobReference":{ "projectId":"605902584318"}, "configuration":{ "load":{ "destinationTable":{ "projectId":"605902584318", "datasetId":"my_dataset", "tableId":"widget_sales"}, "sourceUris":[ "gs://widget-sales-data/2012080100.csv"], "schema":{ "fields":[{ "name":"widget", "type":"string"}, ... POST https://www.googleapis.com/bigquery/v2/projects/605902584318/jobs
  • 35. Query Away! "jobReference":{ "projectId":"605902584318", "query":"SELECT TOP(widget, 50), COUNT(*) AS sale_count FROM widget_sales", "maxResults":100, "apiVersion":"v2" } POST https://www.googleapis.com/bigquery/v2/projects/605902584318/jobs
  • 36. Libraries • Python • JavaScript • Java • Go • .NET • PHP • Ruby • Objective-C
  • 37. Libraries - Example JavaScript Query var request = gapi.client.bigquery.jobs.query({ 'projectId': project_id, 'timeoutMs': '30000', 'query': 'SELECT state, AVG(mother_age) AS theav FROM [publicdata:samples.natality] WHERE year=2000 AND ever_born=1 GROUP BY state ORDER BY theav DESC;' }); request.execute(function(response) { console.log(response); $.each(response.result.rows, function(i, item) { ...
  • 38. Custom Code and the Google Chart Tools API
  • 41. Demo: Using BigQuery on BigQuery
  • 42. BigQuery - Aggregate Big Data Analysis in Seconds • Full table scans FAST • Aggregate Queries on Massive Datasets • Supports Flat and Nested/Repeated Data Models • It's an API Get started now: http://developers.google.com/bigquery/
  • 43. SELECT questions FROM audience SELECT 'Thank You!' FROM jordan http://developers.google.com/bigquery
  • 45. Schema definition birth_record parents parent_id_mother id parent_id_father race plurality age is_male cigarette_use race state weight
  • 46. Schema definition birth_record mother_race mother_age mother_cigarette_use mother_state father_race father_age father_cigarette_use father_state plurality is_male race weight
  • 47. Tools to prepare your data • App Engine MapReduce • Commercial ETL tools • Pervasive • Informatica • Talend • UNIX command-line
  • 48. Schema definition - sharding birth_record_2011 birth_record_2012 birth_record_2013 mother_race mother_race birth_record_2014 mother_age mother_age mother_cigarette_use mother_cigarette_use birth_record_2015 mother_state mother_state father_race father_race birth_record_2016 father_age father_age father_cigarette_use father_cigarette_use father_state father_state plurality plurality is_male is_male race race weight weight
  • 51. “ If you do a table scan over a 1TB table, you're going to have a bad time. ” Anonymous 16th century Italian Philosopher-Monk
  • 52. Goal: Perform a 1 TB table scan in 1 second Parallelize Parallelize Parallelize! • • Reading 1 TB/ second from disk: • 10k+ disks • Processing 1 TB / sec: • 5k processors
  • 53. Data access: Column Store Record Oriented Storage Column Oriented Storage
  • 54. BigQuery Architecture Mixer 0 Mixer 1 Mixer 1 Mixer 1 Shard 0-8 Shard 9-16 Shard 17-24 Shard 0 Shard 10 Shard 12 Shard 20 Shard 24 Distributed Storage (e.g. GFS)
  • 56. BigQuery SQL Example: Simple aggregates SELECT COUNT(foo), MAX(foo), STDDEV(foo) FROM ...
  • 57. BigQuery SQL Example: Complex Processing SELECT ... FROM .... WHERE REGEXP_MATCH(url, ".com$") AND user CONTAINS 'test'
  • 58. BigQuery SQL Example: Nested SELECT SELECT COUNT(*) FROM (SELECT foo ..... ) GROUP BY foo
  • 59. BigQuery SQL Example: Small JOIN SELECT huge_table.foo FROM huge_table JOIN small_table ON small_table.foo = huge_table.foo
  • 60. BigQuery Architecture: Small Join Mixer 0 Mixer 1 Mixer 1 Shard 0-8 Shard 17-24 Shard 0 Shard 20 Shard 24 Distributed Storage (e.g. GFS)
  • 62. Batch queries! • Don't need interactive queries for some jobs? • priority: "BATCH"
  • 63. That's it • API • Column-based datastore • Full table scans FAST • Aggregates • Commercial tool support • Use cases
  • 64. SELECT questions FROM audience SELECT 'Thank You!' FROM ryan http://developers.google.com/bigquery @ryguyrg http://profiles.google.com/ryan.boyd
  • 66. Data access: Column Store Record Oriented Storage Column Oriented Storage
  • 67. A Little Later ... Row wp_namespace Revs Underlying table: 1 0 53697002 • Wikipedia page revision records 2 1 6151228 • Rows: 314 million 3 3 5519859 • Byte size: 35.7 GB 4 4 4184389 Query Stats: 5 2 3108562 • Scanned 7G of data 6 10 1052044 • <5 seconds 7 6 877417 • ~ 100M rows scanned / second 8 14 838940 9 5 651749 10 11 192534 11 100 148135
  • 68. ORDER BY Revs DESC Mixer 0 COUNT (revision_id) GROUP BY wp_namespace Mixer 1 Mixer 1 COUNT (revision_id) GROUP BY wp_namespace Leaf Leaf Leaf Leaf COUNT (revision_id) GROUP BY wp_namespace WHERE timestamp > CUTOFF 10 GB / s Distributed Storage SELECT wp_namespace, revision_id
  • 69. "Multi-stage" Query SELECT LogEdits, COUNT(contributor_id) Contributors FROM ( SELECT SELECT SELECT contributor_id, contributor_id, contributor_id, INTEGER(LOG10(COUNT(revision_id))) LogEdits INTEGER(LOG10(COUNT(*))) LogEdits INTEGER(LOG10(COUNT(revision_id))) LogEdits FROM [publicdata:samples.wikipedia] FROM [publicdata:samples.wikipedia] FROM [publicdata:samples.wikipedia] GROUP EACH BY contributor_id) GROUP EACH BY contributor_id) GROUP BY LogEdits ORDER BY LogEdits DESC
  • 70. ORDER BY LogEdits DESC Mixer 0 COUNT(contributor_id) GROUP BY LogEdits Mixer 1 Mixer 1 COUNT(contributor_id) GROUP BY LogEdits COUNT(contributor_id) Leaf Leaf Shuffler Shuffler GROUP BY LogEdits N^2 Shuffle by SELECT LE, Id GB/s contributor_id COUNT(*) GROUP BY contributor_id Distributed Storage SELECT contributor_id
  • 71. When to use EACH • Shuffle definitely adds some overhead • Poor query performance if used incorrectly • GROUP BY o Groups << Rows => Unbalanced load o Example: GROUP BY state • GROUP EACH BY o Groups ~ Rows o Example: GROUP BY user_id