SlideShare a Scribd company logo
Google App Engine
A Big Data Laboratory?
J Singh, Early Stage IT
March 20, 2012
2
© J Singh, 2011 2
App Engine as a Big Data Laboratory?
• Why bother? Why not use Hadoop?
• EvaluatingApp Engine as a Big Data Laboratory
– Loading Data
– Analytics Capabilities
– Visualization Capabilities
• Conclusions
3
© J Singh, 2011 3
Why Bother? Why not Hadoop?
• No install and configuration required
– Focus on the task: Analytics and Visualization
– Use the technology that powers Google Earth and Google Finance
• Works with Google Datastore
– Makes sense if your data is already there
• No import/export of data necessary
• But a purely „low-level‟ programming environment
– Write Map and Reduce functions in Python / Java
– No Pig, Hive, …
• Is this story for real? We wanted to find out.
4
© J Singh, 2011 4
Loading Data into GAE
• What? No native OS environment to work in?
– No OS commands, no file system accessible to the programmer
– Data Prep must be done elsewhere.
• But other options exist
1. Upload a file into Blobstore through an HTTP request
• Max object size 2GB, max get/put in one call: 1MB.
• Process into Datastore entities using BlobstoreInputReader or
BlobstoreZipInputReader classes.
2. Use remote_api to upload CSV files
• It‟s painful
– Only needs to be done one-time, we hope
– Or we need to set up a process for staging and feeding the data
5
© J Singh, 2011 5
Data Analysis: NumPy and SciPy
• NumPy and SciPy libraries using the traditional computing
model (not Map/Reduce) include:
– Array and Matrix manipulation
– Optimization algorithms, e.g., curve fitting, linear regression,
multi-variate regression.
– Multithreading (for embarrassingly parallel problems)
• Replace map(…) with parallel_map(…).
– map is a Python primitive
– parallel_map is a NumPy primitive
– Other scientific algorithms, e.g., Kalman Filtering, Signal
smoothing, Markov Chains.
• NumPy and SciPy depended on Python 2.7
– Enabled in Fall, 2011.
6
© J Singh, 2011 6
Data Analysis: MapReduce
• Input Reader
– Several provided by GAE, can write your own
• Map function: Written by Programmer
• Shuffle function: Provided
– Can write your own overrides for partitioning
(sharding) and comparison (use in sort)
• Reduce function: Written by Programmer
– Can be skipped if not needed
• Output Writer
– Several provided by GAE, can write your own
7
© J Singh, 2011 7
Data Analysis: Pipeline API
• Based on Python Generator functions
• Allows chaining of map reduce jobs
– Primitives for setting up various types of chains
• MapreducePipeline (prev page) was just one type of pipeline
• Available for Python or Java
– Python side better documented
Split and Merge example
class aPipe(pipeline.Pipeline):
def run(self, e_kind, prop_name, *value_list):
all_bs = []
for v in value_list:
stage = yield bPipe(e_kind, prop_name, v)
all_bs.append(stage)
yield common.Append(*all_bs)
8
© J Singh, 2011 8
Data Visualization
• Appengine supports multiple web frameworks for serving data
directly from the Datastore into an HTML5 Browser:
– Django, Jinja2, CherryPy, …
• Options:
– jQuery Visualize
– Google Visualization API
• Including MotionCharts
– Hans Rosling‟s Visualization API
– Check out his TED talk
• Conclusion:
– A rich set of facilities for visualization and taking action
9
© J Singh, 2011 9
Decision Factors
Usage Discussion
Proof of Concept
or
Demo
In GAE
Need a process for Data Loading
But saves on having to do Hadoop setup
Absence of Pig/Hive may be a limiting factor
Advantage in Visualization
Better security and isolation than Hadoop
Production
In GAE
Analyze cost before committing
Lock-in risk?
Production
elsewhere
Good semantic match between Datastore and HBase.
Need to do Hadoop setup and operation
10
© J Singh, 2011 10
Thank you
• J Singh
– President, Early Stage IT
• Technology Services and Strategy for Startups
• DataThinks.org is a new service of Early Stage IT
– “Big Data” analytics solutions

More Related Content

What's hot (20)

PDF
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
royans
 
PPTX
February 2014 HUG : Hive On Tez
Yahoo Developer Network
 
PDF
Hadoop in Practice (SDN Conference, Dec 2014)
Marcel Krcah
 
PPTX
February 2014 HUG : Pig On Tez
Yahoo Developer Network
 
PPTX
Map reduce paradigm explained
Dmytro Sandu
 
PPTX
Apache Tez – Present and Future
DataWorks Summit
 
PDF
Extending Hadoop for Fun & Profit
Milind Bhandarkar
 
PDF
Hadoop Primer
Steve Staso
 
PPTX
Hadoop/MapReduce/HDFS
praveen bhat
 
PPTX
Pig, Making Hadoop Easy
Nick Dimiduk
 
PPTX
Pig on Tez: Low Latency Data Processing with Big Data
DataWorks Summit
 
PPTX
Pig programming is more fun: New features in Pig
daijy
 
PDF
Hadoop Overview & Architecture
EMC
 
PPTX
Functional Programming and Big Data
DataWorks Summit
 
PDF
Getting started with Hadoop, Hive, and Elastic MapReduce
obdit
 
PDF
Future of Data Intensive Applicaitons
Milind Bhandarkar
 
PPT
Hadoop Hive Talk At IIT-Delhi
Joydeep Sen Sarma
 
ODP
Hadoop demo ppt
Phil Young
 
PPTX
Hive+Tez: A performance deep dive
t3rmin4t0r
 
PDF
Apache Hadoop 1.1
Sperasoft
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
royans
 
February 2014 HUG : Hive On Tez
Yahoo Developer Network
 
Hadoop in Practice (SDN Conference, Dec 2014)
Marcel Krcah
 
February 2014 HUG : Pig On Tez
Yahoo Developer Network
 
Map reduce paradigm explained
Dmytro Sandu
 
Apache Tez – Present and Future
DataWorks Summit
 
Extending Hadoop for Fun & Profit
Milind Bhandarkar
 
Hadoop Primer
Steve Staso
 
Hadoop/MapReduce/HDFS
praveen bhat
 
Pig, Making Hadoop Easy
Nick Dimiduk
 
Pig on Tez: Low Latency Data Processing with Big Data
DataWorks Summit
 
Pig programming is more fun: New features in Pig
daijy
 
Hadoop Overview & Architecture
EMC
 
Functional Programming and Big Data
DataWorks Summit
 
Getting started with Hadoop, Hive, and Elastic MapReduce
obdit
 
Future of Data Intensive Applicaitons
Milind Bhandarkar
 
Hadoop Hive Talk At IIT-Delhi
Joydeep Sen Sarma
 
Hadoop demo ppt
Phil Young
 
Hive+Tez: A performance deep dive
t3rmin4t0r
 
Apache Hadoop 1.1
Sperasoft
 

Viewers also liked (20)

PDF
OpenLSH - a framework for locality sensitive hashing
J Singh
 
PDF
Tableau reseller partner in Australia Bilytica Best business Intelligence com...
Carie John
 
PDF
2016 Standardization of Laboratory Test Coding - PHI Conference
Megan Sawchuk
 
PDF
Whitepaper2012 "Virtual Laboratory for Analytic Geometry" UNAM
metagraphos
 
PDF
Tableau reseller partner in Cape Verde Bilytica Best business Intelligence Co...
Carie John
 
PDF
Dmla0609 Hoeck Presentation
Wolfgang G. Hoeck
 
PPT
Checking in on Healthcare Data Analytics
Cybera Inc.
 
PDF
INCREASING LABORATORY EFFICIENCY AND VALUE OF LABORATORY DATA BY MAXIMISING ...
Keynetix
 
PPT
Exploring the Role of Information Technology Systems in Preventing and Managi...
Health Informatics New Zealand
 
PPT
Process Improvement - 10 Essential Ingredients
Richard Ouellette
 
PDF
Advanced Laboratory Analytics — A Disruptive Solution for Health Systems
Viewics
 
PDF
2008 Spotfire Life Science Forum
Wolfgang G. Hoeck
 
PDF
The Evolution of Laboratory Data Systems: Replacing Paper, Streamlining Proce...
IDBS
 
PDF
Clinical data analytics
SB BHATTACHARYYA
 
PPTX
eHealth: Big Data, Sports Analysis & Clinical Records
Health Informatics New Zealand
 
PDF
Electronic Medical Records - Paperless to Big Data Initiative
Data Science Thailand
 
PPSX
Basics of laboratory internal quality control, Ola Elgaddar, 2012
Ola Elgaddar
 
PPT
Quality control in the medical laboratory
Adnan Jaran
 
PPTX
Data analysis powerpoint
jamiebrandon
 
PPTX
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Health Catalyst
 
OpenLSH - a framework for locality sensitive hashing
J Singh
 
Tableau reseller partner in Australia Bilytica Best business Intelligence com...
Carie John
 
2016 Standardization of Laboratory Test Coding - PHI Conference
Megan Sawchuk
 
Whitepaper2012 "Virtual Laboratory for Analytic Geometry" UNAM
metagraphos
 
Tableau reseller partner in Cape Verde Bilytica Best business Intelligence Co...
Carie John
 
Dmla0609 Hoeck Presentation
Wolfgang G. Hoeck
 
Checking in on Healthcare Data Analytics
Cybera Inc.
 
INCREASING LABORATORY EFFICIENCY AND VALUE OF LABORATORY DATA BY MAXIMISING ...
Keynetix
 
Exploring the Role of Information Technology Systems in Preventing and Managi...
Health Informatics New Zealand
 
Process Improvement - 10 Essential Ingredients
Richard Ouellette
 
Advanced Laboratory Analytics — A Disruptive Solution for Health Systems
Viewics
 
2008 Spotfire Life Science Forum
Wolfgang G. Hoeck
 
The Evolution of Laboratory Data Systems: Replacing Paper, Streamlining Proce...
IDBS
 
Clinical data analytics
SB BHATTACHARYYA
 
eHealth: Big Data, Sports Analysis & Clinical Records
Health Informatics New Zealand
 
Electronic Medical Records - Paperless to Big Data Initiative
Data Science Thailand
 
Basics of laboratory internal quality control, Ola Elgaddar, 2012
Ola Elgaddar
 
Quality control in the medical laboratory
Adnan Jaran
 
Data analysis powerpoint
jamiebrandon
 
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Health Catalyst
 
Ad

Similar to Big Data Laboratory (20)

PDF
What is Google App Engine?
weschwee
 
PPTX
Introduction to Google App Engine with Python
Brian Lyttle
 
PDF
Introduction to App Engine Development
Ron Reiter
 
KEY
Introduction to Google App Engine
Chakkrit (Kla) Tantithamthavorn
 
PDF
App Engine overview (Android meetup 06-10)
jasonacooper
 
PDF
Cc unit 5
Dr. Radhey Shyam
 
PDF
Art & music vs Google App Engine
thomas alisi
 
KEY
SSJS, NoSQL, GAE and AppengineJS
Eugene Lazutkin
 
PDF
Web App Prototypes with Google App Engine
Vlad Filippov
 
PDF
Programming Google App Engine Build and Run Scalable Web Apps on Google s Inf...
keftonoztas
 
PDF
Ohio Devfest - Visual Analysis with GCP
Wesley Workman
 
PDF
Google App Engine
Lennon Shimokawa
 
PDF
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
larsgeorge
 
PDF
I've (probably) been using Google App Engine for a week longer than you have
Simon Willison
 
PPTX
Googleappengineintro 110410190620-phpapp01
Tony Frame
 
PDF
Cloud platform overview for camping
Le Camping by Silicon Sentier
 
PDF
Google App Engine Overview and Update
Chris Schalk
 
PDF
Introduction to Google App Engine
Colin Su
 
PPT
Google App Engine
Dave Nielsen
 
PPTX
Social Media Mining using GAE Map Reduce
J Singh
 
What is Google App Engine?
weschwee
 
Introduction to Google App Engine with Python
Brian Lyttle
 
Introduction to App Engine Development
Ron Reiter
 
Introduction to Google App Engine
Chakkrit (Kla) Tantithamthavorn
 
App Engine overview (Android meetup 06-10)
jasonacooper
 
Cc unit 5
Dr. Radhey Shyam
 
Art & music vs Google App Engine
thomas alisi
 
SSJS, NoSQL, GAE and AppengineJS
Eugene Lazutkin
 
Web App Prototypes with Google App Engine
Vlad Filippov
 
Programming Google App Engine Build and Run Scalable Web Apps on Google s Inf...
keftonoztas
 
Ohio Devfest - Visual Analysis with GCP
Wesley Workman
 
Google App Engine
Lennon Shimokawa
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
larsgeorge
 
I've (probably) been using Google App Engine for a week longer than you have
Simon Willison
 
Googleappengineintro 110410190620-phpapp01
Tony Frame
 
Cloud platform overview for camping
Le Camping by Silicon Sentier
 
Google App Engine Overview and Update
Chris Schalk
 
Introduction to Google App Engine
Colin Su
 
Google App Engine
Dave Nielsen
 
Social Media Mining using GAE Map Reduce
J Singh
 
Ad

More from J Singh (18)

PPTX
Designing analytics for big data
J Singh
 
PDF
Open LSH - september 2014 update
J Singh
 
PPTX
PaaS - google app engine
J Singh
 
PPTX
Mining of massive datasets using locality sensitive hashing (LSH)
J Singh
 
PPTX
Data Analytic Technology Platforms: Options and Tradeoffs
J Singh
 
PPTX
High Throughput Data Analysis
J Singh
 
PPTX
NoSQL and MapReduce
J Singh
 
PPTX
CS 542 -- Concurrency Control, Distributed Commit
J Singh
 
PPTX
CS 542 -- Failure Recovery, Concurrency Control
J Singh
 
PPTX
CS 542 -- Query Optimization
J Singh
 
PPTX
CS 542 -- Query Execution
J Singh
 
PPTX
CS 542 Putting it all together -- Storage Management
J Singh
 
PPTX
CS 542 Parallel DBs, NoSQL, MapReduce
J Singh
 
PPTX
CS 542 Database Index Structures
J Singh
 
PPTX
CS 542 Controlling Database Integrity and Performance
J Singh
 
PPTX
CS 542 Overview of query processing
J Singh
 
PPTX
CS 542 Introduction
J Singh
 
PDF
Cloud Computing from an Entrpreneur's Viewpoint
J Singh
 
Designing analytics for big data
J Singh
 
Open LSH - september 2014 update
J Singh
 
PaaS - google app engine
J Singh
 
Mining of massive datasets using locality sensitive hashing (LSH)
J Singh
 
Data Analytic Technology Platforms: Options and Tradeoffs
J Singh
 
High Throughput Data Analysis
J Singh
 
NoSQL and MapReduce
J Singh
 
CS 542 -- Concurrency Control, Distributed Commit
J Singh
 
CS 542 -- Failure Recovery, Concurrency Control
J Singh
 
CS 542 -- Query Optimization
J Singh
 
CS 542 -- Query Execution
J Singh
 
CS 542 Putting it all together -- Storage Management
J Singh
 
CS 542 Parallel DBs, NoSQL, MapReduce
J Singh
 
CS 542 Database Index Structures
J Singh
 
CS 542 Controlling Database Integrity and Performance
J Singh
 
CS 542 Overview of query processing
J Singh
 
CS 542 Introduction
J Singh
 
Cloud Computing from an Entrpreneur's Viewpoint
J Singh
 

Recently uploaded (20)

PDF
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PPTX
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
PDF
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
Lecture A - AI Workflows for Banking.pdf
Dr. LAM Yat-fai (林日辉)
 
PPTX
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PPTX
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
Basics of Electronics for IOT(actuators ,microcontroller etc..)
arnavmanesh
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
Per Axbom: The spectacular lies of maps
Nexer Digital
 
PDF
introduction to computer hardware and sofeware
chauhanshraddha2007
 
PPTX
Machine Learning Benefits Across Industries
SynapseIndia
 
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Lecture A - AI Workflows for Banking.pdf
Dr. LAM Yat-fai (林日辉)
 
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
Basics of Electronics for IOT(actuators ,microcontroller etc..)
arnavmanesh
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Per Axbom: The spectacular lies of maps
Nexer Digital
 
introduction to computer hardware and sofeware
chauhanshraddha2007
 
Machine Learning Benefits Across Industries
SynapseIndia
 

Big Data Laboratory

  • 1. Google App Engine A Big Data Laboratory? J Singh, Early Stage IT March 20, 2012
  • 2. 2 © J Singh, 2011 2 App Engine as a Big Data Laboratory? • Why bother? Why not use Hadoop? • EvaluatingApp Engine as a Big Data Laboratory – Loading Data – Analytics Capabilities – Visualization Capabilities • Conclusions
  • 3. 3 © J Singh, 2011 3 Why Bother? Why not Hadoop? • No install and configuration required – Focus on the task: Analytics and Visualization – Use the technology that powers Google Earth and Google Finance • Works with Google Datastore – Makes sense if your data is already there • No import/export of data necessary • But a purely „low-level‟ programming environment – Write Map and Reduce functions in Python / Java – No Pig, Hive, … • Is this story for real? We wanted to find out.
  • 4. 4 © J Singh, 2011 4 Loading Data into GAE • What? No native OS environment to work in? – No OS commands, no file system accessible to the programmer – Data Prep must be done elsewhere. • But other options exist 1. Upload a file into Blobstore through an HTTP request • Max object size 2GB, max get/put in one call: 1MB. • Process into Datastore entities using BlobstoreInputReader or BlobstoreZipInputReader classes. 2. Use remote_api to upload CSV files • It‟s painful – Only needs to be done one-time, we hope – Or we need to set up a process for staging and feeding the data
  • 5. 5 © J Singh, 2011 5 Data Analysis: NumPy and SciPy • NumPy and SciPy libraries using the traditional computing model (not Map/Reduce) include: – Array and Matrix manipulation – Optimization algorithms, e.g., curve fitting, linear regression, multi-variate regression. – Multithreading (for embarrassingly parallel problems) • Replace map(…) with parallel_map(…). – map is a Python primitive – parallel_map is a NumPy primitive – Other scientific algorithms, e.g., Kalman Filtering, Signal smoothing, Markov Chains. • NumPy and SciPy depended on Python 2.7 – Enabled in Fall, 2011.
  • 6. 6 © J Singh, 2011 6 Data Analysis: MapReduce • Input Reader – Several provided by GAE, can write your own • Map function: Written by Programmer • Shuffle function: Provided – Can write your own overrides for partitioning (sharding) and comparison (use in sort) • Reduce function: Written by Programmer – Can be skipped if not needed • Output Writer – Several provided by GAE, can write your own
  • 7. 7 © J Singh, 2011 7 Data Analysis: Pipeline API • Based on Python Generator functions • Allows chaining of map reduce jobs – Primitives for setting up various types of chains • MapreducePipeline (prev page) was just one type of pipeline • Available for Python or Java – Python side better documented Split and Merge example class aPipe(pipeline.Pipeline): def run(self, e_kind, prop_name, *value_list): all_bs = [] for v in value_list: stage = yield bPipe(e_kind, prop_name, v) all_bs.append(stage) yield common.Append(*all_bs)
  • 8. 8 © J Singh, 2011 8 Data Visualization • Appengine supports multiple web frameworks for serving data directly from the Datastore into an HTML5 Browser: – Django, Jinja2, CherryPy, … • Options: – jQuery Visualize – Google Visualization API • Including MotionCharts – Hans Rosling‟s Visualization API – Check out his TED talk • Conclusion: – A rich set of facilities for visualization and taking action
  • 9. 9 © J Singh, 2011 9 Decision Factors Usage Discussion Proof of Concept or Demo In GAE Need a process for Data Loading But saves on having to do Hadoop setup Absence of Pig/Hive may be a limiting factor Advantage in Visualization Better security and isolation than Hadoop Production In GAE Analyze cost before committing Lock-in risk? Production elsewhere Good semantic match between Datastore and HBase. Need to do Hadoop setup and operation
  • 10. 10 © J Singh, 2011 10 Thank you • J Singh – President, Early Stage IT • Technology Services and Strategy for Startups • DataThinks.org is a new service of Early Stage IT – “Big Data” analytics solutions