SlideShare a Scribd company logo
Cloud
Csaba Toth
Presented By:
Introduction to Google BigQuery
Our sponsors
Disclaimer
Disclaimer – cont.
Goal
• Being able to issue queries
• Preferably in an SQL dialect
• Over Big Data
• As small response time as possible
• Preferably interactive web interface
(thus no need to install anything)
Agenda
• Big Data
• Brief look at Hadoop, HIVE and
Spark
• Row based data store vs. Column
data store
• Google BigQuery
• Demo
Big Data
Wikipedia: “collection of data sets so large and complex
that it becomes difficult to process using on-hand database
management tools or traditional data processing
applications”
Examples: (Wikibon - A Comprehensive List of Big Data Statistics)
• 100 Terabytes of data is uploaded to Facebook every day
• Facebook Stores, Processes, and Analyzes more than 30 Petabytes of user
generated data
• Twitter generates 12 Terabytes of data every day
• LinkedIn processes and mines Petabytes of user data to power the "People You May
Know" feature
• YouTube users upload 48 hours of new video content every minute of the day
• Decoding of the human genome used to take 10 years. Now it can be done in 7 days
Little Hadoop history
“The Google File System” - October 2003
• http://labs.google.com/papers/gfs.html – describes a
scalable, distributed, fault-tolerant file system tailored for
data-intensive applications, running on inexpensive
commodity hardware, delivers high aggregate
performance
“MapReduce: Simplified Data Processing on Large
Clusters” - April 2004
• http://queue.acm.org/detail.cfm?id=988408 – describes a
programming model and an implementation for
processing large data sets.
Hadoop
• Hadoop is an open-source software
framework that supports data-
intensive distributed applications
• A Hadoop cluster is composed of a
single master node and multiple
worker nodes
Hadoop
Has two main services:
1. Storing large amounts of data: HDFS
– Hadoop Distributed File System
2. Processing large amounts of data:
implementing the MapReduce
programming model
HDFS
Name node
Metadata
Store
Data node Data node Data node
Node 1 Node 2
Block A Block B Block A Block B
Node 3
Block A Block B
Job / task management
Name node
Heart beat signals and
communication
Jobtracker
Data node Data node Data node
Task-
tracker
Task-
tracker
Map 1 Reduce 1 Map 2 Reduce 2
Task-
tracker
Map 3 Reduce 3
Map-Reduce
Hadoop vs. RDBMS
Hadoop / MapReduce RDBMS
Size of data Petabytes Gigabytes
Integrity of data Low High (referential, typed)
Data schema Dynamic Static
Access method Batch Interactive and Batch
Scaling Linear Nonlinear (worse than
linear)
Data structure Unstructured Structured
Normalization of data Not Required Required
Query Response Time Has latency (due to batch
processing)
Can be near immediate
Apache Hive
Log Data RDBMS
Data Integration LayerFlume Sqoop
Storage Layer (HDFS): row and columnar data, file data
Computing Layer (MapReduce)
Advanced Query Engines (Hive, Pig)
Data Mining
(Pegasus,
Mahout)
Index,
Searches
(Lucene)
DB drivers (Hive driver)
GUI (web interface, RESTful API, JavaScript)
System
management
Distribution
coordination
(Zookeeper)
JDBC ODBC JS
Apache Hive UI
Apache Hive UI
Beyond Apache Hive
Goals: decrease latency
• YARN: the “next generation Hadoop”,
improves performance in many respects
(resource management and allocation, …)
• Hadoop distribution specific solution: e.g.
Cloudera Impala, MPP SQL Query
engine, based on Hadoop
Apache Spark
• Cluster computing framework with multi-
stage in-memory primitives
• Open Source, originates from Berkeley
• In contrast to Hadoop’s two-stage disk-
based MapReduce paradigm, multi-stage
in-memory primitives can provide up to
100x performance increase
• Requires YARN and HDFS
Spark and Hadoop
Spark and Hadoop
Storing data: row stores
• Traditional RDBMS and often the
document stores are row oriented too
• The engine stores and retrieves rows
from disk (unless indexes help)
• Row is a collection of column cell
values together
• Rows are materialized on disk
Row stores
Row cells
are stored
together
on disk
id scientist death_by movie_name
1 Reinhardt Maximillian The Black Hole
2 Tyrell Roy Batty Blade Runner
3 Hammond Dinosaur Jurassic Park
4 Soong Lore Star Trek: TNG
5 Morbius His mind Forbidden Planet
6 Dyson Skynet Terminator 2: Judgment Day
Row stores
• Not so great for wide rows
• If only a small subset of columns
queried, reading the entire row
wastes IO
• (Indexing strategies can help but I
don’t have time to cover them)
Row stores
Bad case scenario:
• select sum(bigint_column) from table
• Million rows in table
• Average row length is 1 KiB
The select reads one bigint column (8 bytes)
• Entire row must be read
• Reads ~1 GiB data for ~8MiB of column
data
Column stores
• Data is organized by columns
instead of rows
• Non material world: often not
materialized during storage, exists
only in memory
• Each row still has some sort of “row
id”
Column stores
• A row is a collection of column values that
are associated with one another
• Associated: every row has some type of
“row id“
• Can still produce row output (assembling
a row maybe complex though – under the
hood)
Column stores
Stores each COLUMN on disk
id
1
2
3
4
5
6
title
Mrs. Doubtfire
The Big Lebowski
The Fly
Steel Magnolias
The Birdcage
Erin Brokovitch
actor
Robin Williams
Jeff Bridges
Jeff Goldblum
Dolly Parton
Nathan Lane
Julia Roberts
genre
Comedy
Comedy
Horror
Drama
Comedy
Drama
row id = 1
row id = 6
Natural order may be unusual Each column has a file or segment on disk
Column stores
• Column compression can be way more
efficient than row based compression
(sometimes 10:1 to 30:1 ratio)
• Compression: RLE, Integer packing,
dictionaries and lookup, other…
• Reduces both storage and IO (thus
response time)
Column stores
Best case scenario:
• select sum(bigint_column) from table
• Million rows in table
• Average row length is 1 KiB
The select reads one bigint column (8 bytes)
• Only single column read from disk
• Reads ~8 MiB of column data, even less
with compression
Column stores
Bad case scenario:
select *
from long_wide_table
where order_line_id = 34653875;
• Accessing all columns doesn’t save
anything, could be even more expensive
than row store
• Not ideal for tables with few columns
Column stores
Updating and deleting rows is expensive
• Some column stores are append only
• Others just strongly discourage writes
• Some split storage into row and column
areas
Column / Row stores
• RDBMS provide ACID capabilities
• Row stores mainly use tree style indexes
• B-tree derivative index structure provides
very fast binary search as long as it fits
into memory
• Very large datasets end up
unmanageably big indexes
• Column stores: bitmap indexing
Very expensive to update
BigQuery history
“Dremel: Interactive Analysis of Web-Scale
Datasets” – 2010, describes a column store
/ retrieval system
• https://static.googleusercontent.com/media/research.goo
gle.com/en//pubs/archive/36632.pdf
Presentation with illustration about principles
used in Dremel, from Google
• http://www.cs.berkeley.edu/~istoica/classes/cs294/11/not
es/12-sameer-dremel.pdf
BigQuery
• A service that enables interactive analysis
of massively large datasets
• Based on Dremel, a scalable, interactive
ad hoc query system for analysis of read-
only nested data
• Working in conjunction with Google
Storage
• Has a RESTful web service interface
BigQuery
• You can issue SQL queries over big
data
• Interactive web interface
• As small response time as possible
• Auto scales under the hood
BigQuery
SaaS (/ PaaS)
Interfacing:
• REST API
• Web console
• Command line tools
• Language libraries
Insert only
Demo!
Wikipedia public dataset
Natalities public dataset
Names (uploaded)
Google Genomics
https://cloud.google.com/genomics/
https://cloud.google.com/genomics/v1/public-data
https://cloud.google.com/bigquery/web-ui-quickstart
https://cloud.google.com/bigquery/query-reference
Future thoughts
How to visualize data
• Possibly using Google Charts
• BigQuery alongside Google Maps
Playing with genomics data – requires
some bio-informatics knowledge
Thank you!
Questions?
Our sponsors
Resources
• Slides: http://www.slideshare.net/tothc
• Contact: http://www.meetup.com/CCalJUG/
• Csaba Toth: Introduction to Hadoop and MapReduce -
http://www.slideshare.net/tothc/introduction-to-hadoop-
and-map-reduce
• Justin Swanhart: Introduction to column stores -
http://www.slideshare.net/MySQLGeek/intro-to-column-
stores
• Jan Steemann: Column-oriented databases -
http://www.slideshare.net/arangodb/introduction-to-
column-oriented-databases
Resources
• https://anonymousbi.wordpress.com/2012/11/02/hadoo
p-beginners-tutorial-on-ubuntu/
• https://www.capgemini.com/blog/capping-it-
off/2012/01/what-is-hadoop
• http://blog.iquestgroup.com/en/hadoop/#.Vgg2w2sRMeI
• https://www.cloudera.com/content/cloudera/en/docume
ntation/core/latest/PDF/cloudera-impala.pdf
• https://www.keithrozario.com/2012/07/google-bigquery-
wikipedia-dataset-malaysia-singapore.html
• https://cloud.google.com/bigquery/web-ui-quickstart
• https://cloud.google.com/bigquery/query-reference
Resources
• https://github.com/googlegenomics/getting-started-bigquery
• https://github.com/googlegenomics/bigquery-examples
• https://github.com/googlegenomics/readthedocs
Pricing
https://cloud.google.com/bigquery/pricing
Storage: $0.020 per GB/mo.
Queries: $5 per TB processed
Streaming inserts: $0.01 per 200 MiB (1 KiB
rows)
Columns are compressed but price is based
on the uncompressed size
Tips
• Storage dominates the costs
• Plus you need to restrict queries, use as
few columns as possible
• BigQuery scans the full columns which
are involved in the query
• Do not select *
• LIMIT the result set
• Result caching
Ad

More Related Content

What's hot (20)

NoSQL on MySQL - MySQL Document Store by Vadim Tkachenko
NoSQL on MySQL - MySQL Document Store by Vadim TkachenkoNoSQL on MySQL - MySQL Document Store by Vadim Tkachenko
NoSQL on MySQL - MySQL Document Store by Vadim Tkachenko
Data Con LA
 
NoSQL and MongoDB Introdction
NoSQL and MongoDB IntrodctionNoSQL and MongoDB Introdction
NoSQL and MongoDB Introdction
Brian Enochson
 
A Gentle Introduction to GPU Computing by Armen Donigian
A Gentle Introduction to GPU Computing by Armen DonigianA Gentle Introduction to GPU Computing by Armen Donigian
A Gentle Introduction to GPU Computing by Armen Donigian
Data Con LA
 
Practical Use of a NoSQL
Practical Use of a NoSQLPractical Use of a NoSQL
Practical Use of a NoSQL
IBM Cloud Data Services
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dw
elephantscale
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
Guang Xu
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
Dremio Corporation
 
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
Philly Code Camp 2013 Mark Kromer Big Data with SQL ServerPhilly Code Camp 2013 Mark Kromer Big Data with SQL Server
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
Mark Kromer
 
SQL To NoSQL - Top 6 Questions Before Making The Move
SQL To NoSQL - Top 6 Questions Before Making The MoveSQL To NoSQL - Top 6 Questions Before Making The Move
SQL To NoSQL - Top 6 Questions Before Making The Move
IBM Cloud Data Services
 
Dremio introduction
Dremio introductionDremio introduction
Dremio introduction
Alexis Gendronneau
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
Choosing data warehouse considerations
Choosing data warehouse considerationsChoosing data warehouse considerations
Choosing data warehouse considerations
Aseem Bansal
 
Augmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure dataAugmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure data
Treasure Data, Inc.
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster Services
Adam Doyle
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop framework
Tu Pham
 
Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8
MongoDB
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
Dataflair Web Services Pvt Ltd
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data Warehouse
Grant Fritchey
 
How to boost your datamanagement with Dremio ?
How to boost your datamanagement with Dremio ?How to boost your datamanagement with Dremio ?
How to boost your datamanagement with Dremio ?
Vincent Terrasi
 
NoSQL on MySQL - MySQL Document Store by Vadim Tkachenko
NoSQL on MySQL - MySQL Document Store by Vadim TkachenkoNoSQL on MySQL - MySQL Document Store by Vadim Tkachenko
NoSQL on MySQL - MySQL Document Store by Vadim Tkachenko
Data Con LA
 
NoSQL and MongoDB Introdction
NoSQL and MongoDB IntrodctionNoSQL and MongoDB Introdction
NoSQL and MongoDB Introdction
Brian Enochson
 
A Gentle Introduction to GPU Computing by Armen Donigian
A Gentle Introduction to GPU Computing by Armen DonigianA Gentle Introduction to GPU Computing by Armen Donigian
A Gentle Introduction to GPU Computing by Armen Donigian
Data Con LA
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dw
elephantscale
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
Guang Xu
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
Dremio Corporation
 
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
Philly Code Camp 2013 Mark Kromer Big Data with SQL ServerPhilly Code Camp 2013 Mark Kromer Big Data with SQL Server
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
Mark Kromer
 
SQL To NoSQL - Top 6 Questions Before Making The Move
SQL To NoSQL - Top 6 Questions Before Making The MoveSQL To NoSQL - Top 6 Questions Before Making The Move
SQL To NoSQL - Top 6 Questions Before Making The Move
IBM Cloud Data Services
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
Choosing data warehouse considerations
Choosing data warehouse considerationsChoosing data warehouse considerations
Choosing data warehouse considerations
Aseem Bansal
 
Augmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure dataAugmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure data
Treasure Data, Inc.
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster Services
Adam Doyle
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop framework
Tu Pham
 
Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8
MongoDB
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data Warehouse
Grant Fritchey
 
How to boost your datamanagement with Dremio ?
How to boost your datamanagement with Dremio ?How to boost your datamanagement with Dremio ?
How to boost your datamanagement with Dremio ?
Vincent Terrasi
 

Similar to Introduction to Google BigQuery (20)

Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
Csaba Toth
 
Intro to big data choco devday - 23-01-2014
Intro to big data   choco devday - 23-01-2014Intro to big data   choco devday - 23-01-2014
Intro to big data choco devday - 23-01-2014
Hassan Islamov
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
Zohar Elkayam
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
Lokesh Ramaswamy
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
Kunal Khanna
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
Taldor Group
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
Jayant Mukherjee
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
Sandeep Singh
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
Dendej Sawarnkatat
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
York University
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
Geoff Hendrey
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Mohit Tare
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
chariorienit
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
Kelly Technologies
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
Uri Laserson
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
MaharajothiP
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
Zubair Nabi
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
AmirReza Mohammadi
 
Dissecting Scalable Database Architectures
Dissecting Scalable Database ArchitecturesDissecting Scalable Database Architectures
Dissecting Scalable Database Architectures
hypertable
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
Csaba Toth
 
Intro to big data choco devday - 23-01-2014
Intro to big data   choco devday - 23-01-2014Intro to big data   choco devday - 23-01-2014
Intro to big data choco devday - 23-01-2014
Hassan Islamov
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
Kunal Khanna
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
Taldor Group
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
Jayant Mukherjee
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
Sandeep Singh
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
Geoff Hendrey
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Mohit Tare
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
Uri Laserson
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
MaharajothiP
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
Zubair Nabi
 
Dissecting Scalable Database Architectures
Dissecting Scalable Database ArchitecturesDissecting Scalable Database Architectures
Dissecting Scalable Database Architectures
hypertable
 
Ad

More from Csaba Toth (17)

Git, GitHub gh-pages and static websites
Git, GitHub gh-pages and static websitesGit, GitHub gh-pages and static websites
Git, GitHub gh-pages and static websites
Csaba Toth
 
Eclipse RCP Demo
Eclipse RCP DemoEclipse RCP Demo
Eclipse RCP Demo
Csaba Toth
 
The Health of Networks
The Health of NetworksThe Health of Networks
The Health of Networks
Csaba Toth
 
Windows 10 preview
Windows 10 previewWindows 10 preview
Windows 10 preview
Csaba Toth
 
Developing Multi Platform Games using PlayN and TriplePlay Framework
Developing Multi Platform Games using PlayN and TriplePlay FrameworkDeveloping Multi Platform Games using PlayN and TriplePlay Framework
Developing Multi Platform Games using PlayN and TriplePlay Framework
Csaba Toth
 
Trends and future of java
Trends and future of javaTrends and future of java
Trends and future of java
Csaba Toth
 
Google Compute Engine
Google Compute EngineGoogle Compute Engine
Google Compute Engine
Csaba Toth
 
Google App Engine
Google App EngineGoogle App Engine
Google App Engine
Csaba Toth
 
Setting up a free open source java e-commerce website
Setting up a free open source java e-commerce websiteSetting up a free open source java e-commerce website
Setting up a free open source java e-commerce website
Csaba Toth
 
CCJUG inaugural meeting and Adopt a JSR
CCJUG inaugural meeting and Adopt a JSRCCJUG inaugural meeting and Adopt a JSR
CCJUG inaugural meeting and Adopt a JSR
Csaba Toth
 
Google Cloud Platform, Compute Engine, and App Engine
Google Cloud Platform, Compute Engine, and App EngineGoogle Cloud Platform, Compute Engine, and App Engine
Google Cloud Platform, Compute Engine, and App Engine
Csaba Toth
 
Hive and Pig for .NET User Group
Hive and Pig for .NET User GroupHive and Pig for .NET User Group
Hive and Pig for .NET User Group
Csaba Toth
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
Csaba Toth
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
Csaba Toth
 
Introduction into windows 8 application development
Introduction into windows 8 application developmentIntroduction into windows 8 application development
Introduction into windows 8 application development
Csaba Toth
 
Ups and downs of enterprise Java app in a research setting
Ups and downs of enterprise Java app in a research settingUps and downs of enterprise Java app in a research setting
Ups and downs of enterprise Java app in a research setting
Csaba Toth
 
Adopt a JSR NJUG edition
Adopt a JSR NJUG editionAdopt a JSR NJUG edition
Adopt a JSR NJUG edition
Csaba Toth
 
Git, GitHub gh-pages and static websites
Git, GitHub gh-pages and static websitesGit, GitHub gh-pages and static websites
Git, GitHub gh-pages and static websites
Csaba Toth
 
Eclipse RCP Demo
Eclipse RCP DemoEclipse RCP Demo
Eclipse RCP Demo
Csaba Toth
 
The Health of Networks
The Health of NetworksThe Health of Networks
The Health of Networks
Csaba Toth
 
Windows 10 preview
Windows 10 previewWindows 10 preview
Windows 10 preview
Csaba Toth
 
Developing Multi Platform Games using PlayN and TriplePlay Framework
Developing Multi Platform Games using PlayN and TriplePlay FrameworkDeveloping Multi Platform Games using PlayN and TriplePlay Framework
Developing Multi Platform Games using PlayN and TriplePlay Framework
Csaba Toth
 
Trends and future of java
Trends and future of javaTrends and future of java
Trends and future of java
Csaba Toth
 
Google Compute Engine
Google Compute EngineGoogle Compute Engine
Google Compute Engine
Csaba Toth
 
Google App Engine
Google App EngineGoogle App Engine
Google App Engine
Csaba Toth
 
Setting up a free open source java e-commerce website
Setting up a free open source java e-commerce websiteSetting up a free open source java e-commerce website
Setting up a free open source java e-commerce website
Csaba Toth
 
CCJUG inaugural meeting and Adopt a JSR
CCJUG inaugural meeting and Adopt a JSRCCJUG inaugural meeting and Adopt a JSR
CCJUG inaugural meeting and Adopt a JSR
Csaba Toth
 
Google Cloud Platform, Compute Engine, and App Engine
Google Cloud Platform, Compute Engine, and App EngineGoogle Cloud Platform, Compute Engine, and App Engine
Google Cloud Platform, Compute Engine, and App Engine
Csaba Toth
 
Hive and Pig for .NET User Group
Hive and Pig for .NET User GroupHive and Pig for .NET User Group
Hive and Pig for .NET User Group
Csaba Toth
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
Csaba Toth
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
Csaba Toth
 
Introduction into windows 8 application development
Introduction into windows 8 application developmentIntroduction into windows 8 application development
Introduction into windows 8 application development
Csaba Toth
 
Ups and downs of enterprise Java app in a research setting
Ups and downs of enterprise Java app in a research settingUps and downs of enterprise Java app in a research setting
Ups and downs of enterprise Java app in a research setting
Csaba Toth
 
Adopt a JSR NJUG edition
Adopt a JSR NJUG editionAdopt a JSR NJUG edition
Adopt a JSR NJUG edition
Csaba Toth
 
Ad

Recently uploaded (20)

The Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdfThe Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdf
drewplanas10
 
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Lionel Briand
 
Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025
kashifyounis067
 
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
Solidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license codeSolidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license code
aneelaramzan63
 
PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025
mu394968
 
Download YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full ActivatedDownload YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full Activated
saniamalik72555
 
Douwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License codeDouwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License code
aneelaramzan63
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
AxisTechnolabs
 
Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025
kashifyounis067
 
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
F-Secure Freedome VPN 2025 Crack Plus Activation  New VersionF-Secure Freedome VPN 2025 Crack Plus Activation  New Version
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
saimabibi60507
 
Societal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainabilitySocietal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainability
Jordi Cabot
 
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
ssuserb14185
 
EASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License CodeEASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License Code
aneelaramzan63
 
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Andre Hora
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& ConsiderationsDesigning AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Dinusha Kumarasiri
 
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
Andre Hora
 
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Eric D. Schabell
 
The Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdfThe Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdf
drewplanas10
 
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Lionel Briand
 
Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025
kashifyounis067
 
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
Solidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license codeSolidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license code
aneelaramzan63
 
PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025
mu394968
 
Download YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full ActivatedDownload YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full Activated
saniamalik72555
 
Douwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License codeDouwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License code
aneelaramzan63
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
AxisTechnolabs
 
Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025
kashifyounis067
 
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
F-Secure Freedome VPN 2025 Crack Plus Activation  New VersionF-Secure Freedome VPN 2025 Crack Plus Activation  New Version
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
saimabibi60507
 
Societal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainabilitySocietal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainability
Jordi Cabot
 
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
ssuserb14185
 
EASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License CodeEASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License Code
aneelaramzan63
 
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Andre Hora
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& ConsiderationsDesigning AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Dinusha Kumarasiri
 
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
Andre Hora
 
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Eric D. Schabell
 

Introduction to Google BigQuery

  • 5. Goal • Being able to issue queries • Preferably in an SQL dialect • Over Big Data • As small response time as possible • Preferably interactive web interface (thus no need to install anything)
  • 6. Agenda • Big Data • Brief look at Hadoop, HIVE and Spark • Row based data store vs. Column data store • Google BigQuery • Demo
  • 7. Big Data Wikipedia: “collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications” Examples: (Wikibon - A Comprehensive List of Big Data Statistics) • 100 Terabytes of data is uploaded to Facebook every day • Facebook Stores, Processes, and Analyzes more than 30 Petabytes of user generated data • Twitter generates 12 Terabytes of data every day • LinkedIn processes and mines Petabytes of user data to power the "People You May Know" feature • YouTube users upload 48 hours of new video content every minute of the day • Decoding of the human genome used to take 10 years. Now it can be done in 7 days
  • 8. Little Hadoop history “The Google File System” - October 2003 • http://labs.google.com/papers/gfs.html – describes a scalable, distributed, fault-tolerant file system tailored for data-intensive applications, running on inexpensive commodity hardware, delivers high aggregate performance “MapReduce: Simplified Data Processing on Large Clusters” - April 2004 • http://queue.acm.org/detail.cfm?id=988408 – describes a programming model and an implementation for processing large data sets.
  • 9. Hadoop • Hadoop is an open-source software framework that supports data- intensive distributed applications • A Hadoop cluster is composed of a single master node and multiple worker nodes
  • 10. Hadoop Has two main services: 1. Storing large amounts of data: HDFS – Hadoop Distributed File System 2. Processing large amounts of data: implementing the MapReduce programming model
  • 11. HDFS Name node Metadata Store Data node Data node Data node Node 1 Node 2 Block A Block B Block A Block B Node 3 Block A Block B
  • 12. Job / task management Name node Heart beat signals and communication Jobtracker Data node Data node Data node Task- tracker Task- tracker Map 1 Reduce 1 Map 2 Reduce 2 Task- tracker Map 3 Reduce 3 Map-Reduce
  • 13. Hadoop vs. RDBMS Hadoop / MapReduce RDBMS Size of data Petabytes Gigabytes Integrity of data Low High (referential, typed) Data schema Dynamic Static Access method Batch Interactive and Batch Scaling Linear Nonlinear (worse than linear) Data structure Unstructured Structured Normalization of data Not Required Required Query Response Time Has latency (due to batch processing) Can be near immediate
  • 14. Apache Hive Log Data RDBMS Data Integration LayerFlume Sqoop Storage Layer (HDFS): row and columnar data, file data Computing Layer (MapReduce) Advanced Query Engines (Hive, Pig) Data Mining (Pegasus, Mahout) Index, Searches (Lucene) DB drivers (Hive driver) GUI (web interface, RESTful API, JavaScript) System management Distribution coordination (Zookeeper) JDBC ODBC JS
  • 17. Beyond Apache Hive Goals: decrease latency • YARN: the “next generation Hadoop”, improves performance in many respects (resource management and allocation, …) • Hadoop distribution specific solution: e.g. Cloudera Impala, MPP SQL Query engine, based on Hadoop
  • 18. Apache Spark • Cluster computing framework with multi- stage in-memory primitives • Open Source, originates from Berkeley • In contrast to Hadoop’s two-stage disk- based MapReduce paradigm, multi-stage in-memory primitives can provide up to 100x performance increase • Requires YARN and HDFS
  • 21. Storing data: row stores • Traditional RDBMS and often the document stores are row oriented too • The engine stores and retrieves rows from disk (unless indexes help) • Row is a collection of column cell values together • Rows are materialized on disk
  • 22. Row stores Row cells are stored together on disk id scientist death_by movie_name 1 Reinhardt Maximillian The Black Hole 2 Tyrell Roy Batty Blade Runner 3 Hammond Dinosaur Jurassic Park 4 Soong Lore Star Trek: TNG 5 Morbius His mind Forbidden Planet 6 Dyson Skynet Terminator 2: Judgment Day
  • 23. Row stores • Not so great for wide rows • If only a small subset of columns queried, reading the entire row wastes IO • (Indexing strategies can help but I don’t have time to cover them)
  • 24. Row stores Bad case scenario: • select sum(bigint_column) from table • Million rows in table • Average row length is 1 KiB The select reads one bigint column (8 bytes) • Entire row must be read • Reads ~1 GiB data for ~8MiB of column data
  • 25. Column stores • Data is organized by columns instead of rows • Non material world: often not materialized during storage, exists only in memory • Each row still has some sort of “row id”
  • 26. Column stores • A row is a collection of column values that are associated with one another • Associated: every row has some type of “row id“ • Can still produce row output (assembling a row maybe complex though – under the hood)
  • 27. Column stores Stores each COLUMN on disk id 1 2 3 4 5 6 title Mrs. Doubtfire The Big Lebowski The Fly Steel Magnolias The Birdcage Erin Brokovitch actor Robin Williams Jeff Bridges Jeff Goldblum Dolly Parton Nathan Lane Julia Roberts genre Comedy Comedy Horror Drama Comedy Drama row id = 1 row id = 6 Natural order may be unusual Each column has a file or segment on disk
  • 28. Column stores • Column compression can be way more efficient than row based compression (sometimes 10:1 to 30:1 ratio) • Compression: RLE, Integer packing, dictionaries and lookup, other… • Reduces both storage and IO (thus response time)
  • 29. Column stores Best case scenario: • select sum(bigint_column) from table • Million rows in table • Average row length is 1 KiB The select reads one bigint column (8 bytes) • Only single column read from disk • Reads ~8 MiB of column data, even less with compression
  • 30. Column stores Bad case scenario: select * from long_wide_table where order_line_id = 34653875; • Accessing all columns doesn’t save anything, could be even more expensive than row store • Not ideal for tables with few columns
  • 31. Column stores Updating and deleting rows is expensive • Some column stores are append only • Others just strongly discourage writes • Some split storage into row and column areas
  • 32. Column / Row stores • RDBMS provide ACID capabilities • Row stores mainly use tree style indexes • B-tree derivative index structure provides very fast binary search as long as it fits into memory • Very large datasets end up unmanageably big indexes • Column stores: bitmap indexing Very expensive to update
  • 33. BigQuery history “Dremel: Interactive Analysis of Web-Scale Datasets” – 2010, describes a column store / retrieval system • https://static.googleusercontent.com/media/research.goo gle.com/en//pubs/archive/36632.pdf Presentation with illustration about principles used in Dremel, from Google • http://www.cs.berkeley.edu/~istoica/classes/cs294/11/not es/12-sameer-dremel.pdf
  • 34. BigQuery • A service that enables interactive analysis of massively large datasets • Based on Dremel, a scalable, interactive ad hoc query system for analysis of read- only nested data • Working in conjunction with Google Storage • Has a RESTful web service interface
  • 35. BigQuery • You can issue SQL queries over big data • Interactive web interface • As small response time as possible • Auto scales under the hood
  • 36. BigQuery SaaS (/ PaaS) Interfacing: • REST API • Web console • Command line tools • Language libraries Insert only
  • 37. Demo! Wikipedia public dataset Natalities public dataset Names (uploaded) Google Genomics https://cloud.google.com/genomics/ https://cloud.google.com/genomics/v1/public-data https://cloud.google.com/bigquery/web-ui-quickstart https://cloud.google.com/bigquery/query-reference
  • 38. Future thoughts How to visualize data • Possibly using Google Charts • BigQuery alongside Google Maps Playing with genomics data – requires some bio-informatics knowledge
  • 41. Resources • Slides: http://www.slideshare.net/tothc • Contact: http://www.meetup.com/CCalJUG/ • Csaba Toth: Introduction to Hadoop and MapReduce - http://www.slideshare.net/tothc/introduction-to-hadoop- and-map-reduce • Justin Swanhart: Introduction to column stores - http://www.slideshare.net/MySQLGeek/intro-to-column- stores • Jan Steemann: Column-oriented databases - http://www.slideshare.net/arangodb/introduction-to- column-oriented-databases
  • 42. Resources • https://anonymousbi.wordpress.com/2012/11/02/hadoo p-beginners-tutorial-on-ubuntu/ • https://www.capgemini.com/blog/capping-it- off/2012/01/what-is-hadoop • http://blog.iquestgroup.com/en/hadoop/#.Vgg2w2sRMeI • https://www.cloudera.com/content/cloudera/en/docume ntation/core/latest/PDF/cloudera-impala.pdf • https://www.keithrozario.com/2012/07/google-bigquery- wikipedia-dataset-malaysia-singapore.html • https://cloud.google.com/bigquery/web-ui-quickstart • https://cloud.google.com/bigquery/query-reference
  • 44. Pricing https://cloud.google.com/bigquery/pricing Storage: $0.020 per GB/mo. Queries: $5 per TB processed Streaming inserts: $0.01 per 200 MiB (1 KiB rows) Columns are compressed but price is based on the uncompressed size
  • 45. Tips • Storage dominates the costs • Plus you need to restrict queries, use as few columns as possible • BigQuery scans the full columns which are involved in the query • Do not select * • LIMIT the result set • Result caching

Editor's Notes

  • #6: SQL Structured Query Language
  • #33: ACID: Atomicity, Consistency, Isolation, Durability
  • #38: Name schema: name:string,gender:string,count:integer