SlideShare a Scribd company logo
The Internet-of-things:
Architecting for the
deluge of data
CTO
bryan@joyent.com
Bryan Cantrill
@bcantrill
Big Data circa 1994: Pre-Internet
Source: BusinessWeek, September 5, 1994
Aside: Internet circa 1994
Source: BusinessWeek, October 10, 1994
Big Data circa 2004: Internet exhaust
• Through the 1990s, Moore’s Law + Kryder’s Law grew
faster than transaction rates, and what was
“overwhelming” in 1994 was manageable by 2004
• But large internet concerns (Google, Facebook, Yahoo!)
encountered a new class of problem: analyzing massive
amounts of data emitted as a byproduct of activity
• Data scaled with activity, not transactions — changing
both data sizes and economics
• Data sizes were too large for extant data warehousing
solutions — and were embarrassingly parallel besides
Big Data circa 2004: MapReduce
• MapReduce, pioneered by Google and later emulated
by Hadoop, pointed to a new paradigm where compute
tasks are broken into map and reduce phases
• Serves to explicitly divide the work that can be
parallelized from that which must be run sequentially
• Map phases are farmed out to a storage layer that
attempts to co-locate them with the data being mapped
• Made for commodity scale-out systems; relatively cheap
storage allowed for sloppy but effective solutions (e.g.
storing data in triplicate)
Big Data circa 2014
• Hadoop has become the de facto big data processing
engine — and HDFS the de facto storage substrate
• But HDFS is designed around availability during/for
computation; it is not designed to be authoritative
• HDFS is used primarily for data that is redundant,
transient, replaceable or otherwise fungible
• Authoritative storage remains either enterprise storage
(on premises) or object storage (in the cloud)
• For analysis of non-fungible data, pattern is to ingest
data into a Hadoop cluster from authoritative storage
• But a new set of problems is poised to emerge...
Big Data circa 2014: Internet-of-things
• IDC forecasts that the “digital universe” will grow from
130 exabytes in 2005 to 40,000 exabytes in 2020 —
with as much of a third having “analytic value”
• This doesn’t even factor in the (long forecasted) rise of
the internet-of-things/industrial internet...
• Machine-generated data at the edge will effect a step
function in data sizes and processing methodologies
• No one really knows how much data will be generated
by IoT, but the numbers are insane (e.g., HD camera
generates 20 GB/hour; a Ford Energi engine generates
25 GB/hour; a GE jet engine generates 1TB/flight)
How to cope with IoT-generated data?
• IoT presents so much more data that we will
increasingly need data science to make sense of it
• To assure data, we need to retain as much raw data as
possible, storing it once and authoritatively
• Storing data authoritatively has ramifications for the
storage substrate
• To allow for science, we need to place an emphasis on
hypothesis exploration: it must be quick to iterate from
hypothesis to experiment to result to new hypothesis
• Emphasizing hypothesis exploration has ramifications
for the compute abstractions and data movement
The coming ramifications of IoT
• It will no longer be acceptable to discard data: all data
will need to be retained to explore future hypotheses
• It will no longer be acceptable to store three copies: 3X
on storage costs is too acute when data is massive
• It will no longer be acceptable to move data for analysis:
in some cases, not even over the internet!
• It will no longer be acceptable to dictate the abstraction:
we must accommodate anything that can process data
• These shifts are as significant as the shift from
traditional data warehousing to scale-out MapReduce!
IoT: Authoritative storage?
• “Filesystems” that are really just user-level programs
layered on local filesystems lack device-level visibility,
sacrificing reliability and performance
• Even in-kernel, we have seen the corrosiveness of an
abstraction divide in the historic divide between logical
volume management and the filesystem:
• The volume manager understands multiple disks, but
nothing of the higher level semantics of the filesystem
• The filesystem understands the higher semantics of the
data, but has no physical device understanding
• This divide became entrenched over the 1990s, and had
devastating ramifications for reliability and performance
The ZFS revolution
• Starting in 2001, Sun began a revolutionary new
software effort: to unify storage and eliminate the divide
• In this model, filesystems would lose their one-to-one
association with devices: many filesystems would be
multiplexed on many devices
• By starting with a clean sheet of paper, ZFS opened up
vistas of innovation — and by its architecture was able
to solve many otherwise intractable problems
• Sun shipped ZFS in 2005, and used it as the foundation
of its enterprise storage products starting in 2008
• ZFS was open sourced in 2005; it remains the only open
source enterprise-grade filesystem
ZFS advantages
• Copy-on-write design allows on-disk consistency to be
always assured (eliminating file system check)
• Copy-on-write design allows constant-time snapshots in
unlimited quantity — and writable clones!
• Filesystem architecture allows filesystems to be created
instantly and expanded — or shrunk! — on-the-fly
• Integrated volume management allows for intelligent
device behavior with respect to disk failure and recovery
• Adaptive replacement cache (ARC) allows for optimal
use of DRAM — especially on high DRAM systems
• Support for dedicated log and cache devices allows for
optimal use of flash-based SSDs
ZFS at Joyent
• Joyent was the earliest ZFS adopter: becoming (in
2005) the first production user of ZFS outside of Sun
• ZFS is one of the four foundational technologies of
Joyent’s SmartOS, our illumos derivative
• The other three foundational technologies in SmartOS are
DTrace, Zones and KVM
• Search “fork yeah illumos” for the (uncensored) history of
OpenSolaris, illumos, SmartOS and derivatives
• Joyent has extended ZFS to provide better support
multi-tenant operation with I/O throttling
ZFS as the basis for IoT?
• ZFS offers commodity hardware economics with
enterprise-grade reliability — and obviates the need for
cross-machine mirroring for durability
• But ZFS is not itself a scale-out distributed system, and
is ill suited to become one
• Conclusion: ZFS is a good building block for the data
explosion from IoT, but not the whole puzzle
IoT: Compute abstraction?
• To facilitate hypothesis exploration, we need to carefully
consider the abstraction for computation
• How is data exploration programmatically expressed?
• How can this be made to be multi-tenant?
• The key enabling technology for multi-tenancy is
virtualization — but where in the stack to virtualize?
• The historical answer — since the 1960s — has been to
virtualize at the level of the hardware:
• A virtual machine is presented upon which each
tenant runs an operating system of their choosing
• There are as many operating systems as tenants
• The historical motivation for hardware virtualization
remains its advantage today: it can run entire legacy
stacks unmodified
• However, hardware virtualization exacts a heavy tolls:
operating systems are not designed to share resources
like DRAM, CPU, I/O devices or the network
• Hardware virtualization limits tenancy and inhibits
performance!
Hardware-level virtualization?
• Virtualizing at the application platform layer addresses
the tenancy challenges of hardware virtualization…
• ...but at the cost of dictating abstraction to the developer
• With IoT, this is especially problematic: we can expect
much more analog data and much deeper numerical
analysis — and dependencies on native libraries and/or
domain-specific languages
• Virtualizing at the application platform layer poses many
other challenges:
• Security, resource containment, language specificity,
environment-specific engineering costs
Platform-level virtualization?
• Containers virtualizing the OS and hit the sweet spot:
• Single OS (single kernel) allows for efficient use of hardware
resources, and therefore allows load factors to be high
• Disjoint instances are securely compartmentalized by the
operating system
• Gives customers what appears to be a virtual machine
(albeit a very fast one) on which to run higher-level software
• Gives customers PaaS when the abstractions work for them,
IaaS when they need more generality
• OS-level virtualization allows for high levels of tenancy
without dictating abstraction or sacrificing efficiency
• Zones is a bullet-proof implementation of OS-level
virtualization — and is the core abstraction in Joyent’s
SmartOS
Joyent’s solution: OS containers
Idea: ZFS + Containers?
• Building a sophisticated distributed system on top of
ZFS and zones, we have built Manta, an internet-facing
object storage system offering in situ compute
• That is, the description of compute can be brought to
where objects reside instead of having to backhaul
objects to transient compute
• The abstractions made available for computation are
anything that can run on the OS...
• ...and as a reminder, the OS — Unix — was built around
the notion of ad hoc unstructured data processing, and
allows for remarkably terse expressions of computation
Manta: ZFS + Containers!
Aside: Unix
• When Unix appeared in the early 1970s, it was not just a
new system, but a new way of thinking about systems
• Instead of a sealed monolith, the operating system was
a collection of small, easily understood programs
• First Edition Unix (1971) contained many programs that
we still use today (ls, rm, cat, mv)
• Its very name conveyed this minimalist aesthetic: Unix is
a homophone of “eunuchs” — a castrated Multics
We were a bit oppressed by the big system mentality. Ken
wanted to do something simple. — Dennis Ritchie
Unix: Let there be light
• In 1969, Doug McIlroy had the idea of connecting
different components:
At the same time that Thompson and Ritchie were sketching
out a file system, I was sketching out how to do data
processing on the blackboard by connecting together
cascades of processes
• This was the primordial pipe, but it took three years to
persuade Thompson to adopt it:
And one day I came up with a syntax for the shell that went
along with the piping, and Ken said, “I’m going to do it!”
Unix: ...and there was light
And the next morning we had this
orgy of one-liners. — Doug McIlroy
The Unix philosophy
• The pipe — coupled with the small-system aesthetic —
gave rise to the Unix philosophy, as articulated by Doug
McIlroy:
• Write programs that do one thing and do it well
• Write programs to work together
• Write programs that handle text streams, because
that is a universal interface
• Four decades later, this philosophy remains the single
most important revolution in software systems thinking!
• In 1986, Jon Bentley posed the challenge that became
the Epic Rap Battle of computer science history:
Read a file of text, determine the n most frequently used
words, and print out a sorted list of those words along with
their frequencies.
• Don Knuth’s solution: an elaborate program in WEB, a
Pascal-like literate programming system of his own
invention, using a purpose-built algorithm
• Doug McIlroy’s solution shows the power of the Unix
philosophy:
tr -cs A-Za-z 'n' | tr A-Z a-z | 
sort | uniq -c | sort -rn | sed ${1}q
Doug McIlroy v. Don Knuth: FIGHT!
Big Data: History repeats itself?
• The original Google MapReduce paper (Dean et al.,
OSDI ’04) poses a problem disturbingly similar to
Bentley’s challenge nearly two decades prior:
Count of URL Access Frequency: The function processes
logs of web page requests and outputs ⟨URL, 1⟩. The
reduce function adds together all values for the same URL
and emits a ⟨URL, total count⟩ pair
• But the solutions do not adhere to the Unix philosophy...
• ...and nor do they make use of the substantial Unix
foundation for data processing
• e.g., Appendix A of the OSDI ’04 paper has a 71 line
word count in C++ — with nary a wc in sight
• Manta allows for an arbitrarily scalable variant of
McIlroy’s solution to Bentley’s challenge:
mfind -t o /bcantrill/public/v7/usr/man | 
mjob create -o -m "tr -cs A-Za-z 'n' | 
tr A-Z a-z | sort | uniq -c" -r 
"awk '{ x[$2] += $1 }
END { for (w in x) { print x[w] " " w } }' | 
sort -rn | sed ${1}q"
• This description not only terse, it is high performing: data
is left at rest — with the “map” phase doing heavy
reduction of the data stream
• As such, Manta — like Unix — is not merely syntactic
sugar; it converges compute and data in a new way
Manta: Unix for Big Data — and IoT
• Eventual consistency represents the wrong CAP
tradeoffs for most; we prefer consistency over
availability for writes (but still availability for reads)
• Many more details:
http://dtrace.org/blogs/dap/2013/07/03/fault-tolerance-in-manta/
• Celebrity endorsement:
Manta: CAP tradeoffs
• Hierarchical storage is an excellent idea (ht: Multics);
Manta implements proper directories, delimited with a
forward slash
• Manta implements a snapshot/link hybrid dubbed a
snaplink; can be used to effect versioning
• Manta has full support for CORS headers
• Manta uses SSH-based HTTP auth for client-side
tooling (IETF draft-cavage-http-signatures-00)
• Manta SDKs exist for node.js, R, Go, Java, Ruby,
Python — and of course, compute jobs may be in any of
these (plus Perl, Clojure, Lisp, Erlang, Forth, Prolog,
Fortran, Haskell, Lua, Mono, COBOL, Fortran, etc.)
• “npm install manta” for command line interface
Manta: Other design principles
• We believe compute/data convergence to be a
constraint imposed by IoT: stores of record must support
computation as a first-class, in situ operation
• We believe that some (and perhaps many) IoT
workloads will require computing at the edge — internet
transit may be prohibitive for certain applications
• We believe that Unix is a natural way of expressing this
computation — and that OS containers are the right way
to support this securely
• We believe that ZFS is the only sane storage
underpinning for such a system
• Manta will surely not be the only system to represent the
confluence of these — but it is the first
Manta and IoT
• Product page:
http://joyent.com/products/manta
• node.js module:
https://github.com/joyent/node-manta
• Manta documentation:
http://apidocs.joyent.com/manta/
• IRC, e-mail, Twitter, etc.:
#manta on freenode, manta@joyent.com, @mcavage,
@dapsays, @yunongx, @joyent
Manta: More information

More Related Content

What's hot (20)

PDF
The Peril and Promise of Early Adoption: Arriving 10 Years Early to Containers
bcantrill
 
PDF
Dynamic Languages in Production: Progress and Open Challenges
bcantrill
 
PDF
Oral tradition in software engineering: Passing the craft across generations
bcantrill
 
PDF
Platform as reflection of values: Joyent, node.js, and beyond
bcantrill
 
PDF
Debugging microservices in production
bcantrill
 
PDF
Bringing the Unix Philosophy to Big Data
bcantrill
 
PDF
The Container Revolution: Reflections after the first decade
bcantrill
 
PDF
Zebras all the way down: The engineering challenges of the data path
bcantrill
 
PDF
Docker's Killer Feature: The Remote API
bcantrill
 
PDF
Debugging under fire: Keeping your head when systems have lost their mind
bcantrill
 
PPTX
server to cloud: converting a legacy platform to an open source paas
Todd Fritz
 
PDF
node.js in production: Reflections on three years of riding the unicorn
bcantrill
 
PDF
Manta: a new internet-facing object storage facility that features compute by...
Hakka Labs
 
PDF
The State of Cloud 2016: The whirlwind of creative destruction
bcantrill
 
PDF
Debugging (Docker) containers in production
bcantrill
 
PDF
Crash Course in Open Source Cloud Computing
Mark Hinkle
 
PDF
BayLISA meetup: 8/16/12
bcantrill
 
PDF
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebula Project
 
PPTX
OSCON 2014 - Crash Course in Open Source Cloud Computing
Mark Hinkle
 
PPTX
LinuxFest Northwest: Crash Course in Open Source Cloud Computing
Mark Hinkle
 
The Peril and Promise of Early Adoption: Arriving 10 Years Early to Containers
bcantrill
 
Dynamic Languages in Production: Progress and Open Challenges
bcantrill
 
Oral tradition in software engineering: Passing the craft across generations
bcantrill
 
Platform as reflection of values: Joyent, node.js, and beyond
bcantrill
 
Debugging microservices in production
bcantrill
 
Bringing the Unix Philosophy to Big Data
bcantrill
 
The Container Revolution: Reflections after the first decade
bcantrill
 
Zebras all the way down: The engineering challenges of the data path
bcantrill
 
Docker's Killer Feature: The Remote API
bcantrill
 
Debugging under fire: Keeping your head when systems have lost their mind
bcantrill
 
server to cloud: converting a legacy platform to an open source paas
Todd Fritz
 
node.js in production: Reflections on three years of riding the unicorn
bcantrill
 
Manta: a new internet-facing object storage facility that features compute by...
Hakka Labs
 
The State of Cloud 2016: The whirlwind of creative destruction
bcantrill
 
Debugging (Docker) containers in production
bcantrill
 
Crash Course in Open Source Cloud Computing
Mark Hinkle
 
BayLISA meetup: 8/16/12
bcantrill
 
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebula Project
 
OSCON 2014 - Crash Course in Open Source Cloud Computing
Mark Hinkle
 
LinuxFest Northwest: Crash Course in Open Source Cloud Computing
Mark Hinkle
 

Similar to The Internet-of-things: Architecting for the deluge of data (20)

PDF
ITI015En-The evolution of databases (I)
Huibert Aalbers
 
PPTX
Big data and hadoop
Mohit Tare
 
KEY
What ya gonna do?
CQD
 
PDF
Chapter 5(2).pdf
MehariKiros3
 
PPT
Cloud computingjun28
Aravindharamanan S
 
PPT
Cloud computingjun28
Dennis Ebenezer
 
PPTX
Journey to the Programmable Data Center
Toby Weiss
 
PPT
Cloud Computing concepts and technologies
ssuser4c9444
 
PDF
Data Lake and the rise of the microservices
Bigstep
 
PDF
Latest trendsincloud computing
Liliana Ignat
 
PPTX
cloudcomputing.pptx
ahmedsamir339466
 
PPTX
Introduction to Cloud computing and Big Data-Hadoop
Nagarjuna D.N
 
PDF
Latest (storage IO) patterns for cloud-native applications
OpenEBS
 
DOCX
Cloud Computing presentation . docx
anandmahto1820
 
PPTX
Cloud architecture, conception and computing PPT
NangVictorin
 
PPTX
Lecture 3.31 3.32.pptx
RATISHKUMAR32
 
PDF
How Open Source is Transforming the Internet. Again.
Steve Hoffman
 
PPTX
The Effectiveness, Efficiency and Legitimacy of Outsourcing Your Data
DataCentred
 
PPT
Sameer Mitter | Introduction to Cloud computing
Sameer Mitter
 
PPT
cloud.ppt
sourabhsinghbhopal
 
ITI015En-The evolution of databases (I)
Huibert Aalbers
 
Big data and hadoop
Mohit Tare
 
What ya gonna do?
CQD
 
Chapter 5(2).pdf
MehariKiros3
 
Cloud computingjun28
Aravindharamanan S
 
Cloud computingjun28
Dennis Ebenezer
 
Journey to the Programmable Data Center
Toby Weiss
 
Cloud Computing concepts and technologies
ssuser4c9444
 
Data Lake and the rise of the microservices
Bigstep
 
Latest trendsincloud computing
Liliana Ignat
 
cloudcomputing.pptx
ahmedsamir339466
 
Introduction to Cloud computing and Big Data-Hadoop
Nagarjuna D.N
 
Latest (storage IO) patterns for cloud-native applications
OpenEBS
 
Cloud Computing presentation . docx
anandmahto1820
 
Cloud architecture, conception and computing PPT
NangVictorin
 
Lecture 3.31 3.32.pptx
RATISHKUMAR32
 
How Open Source is Transforming the Internet. Again.
Steve Hoffman
 
The Effectiveness, Efficiency and Legitimacy of Outsourcing Your Data
DataCentred
 
Sameer Mitter | Introduction to Cloud computing
Sameer Mitter
 
Ad

More from bcantrill (18)

PDF
Predicting the Present
bcantrill
 
PDF
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
 
PDF
Coming of Age: Developing young technologists without robbing them of their y...
bcantrill
 
PDF
I have come to bury the BIOS, not to open it: The need for holistic systems
bcantrill
 
PDF
Towards Holistic Systems
bcantrill
 
PDF
The Coming Firmware Revolution
bcantrill
 
PDF
Hardware/software Co-design: The Coming Golden Age
bcantrill
 
PDF
Tockilator: Deducing Tock execution flows from Ibex Verilator traces
bcantrill
 
PDF
No Moore Left to Give: Enterprise Computing After Moore's Law
bcantrill
 
PDF
Andreessen's Corollary: Ethical Dilemmas in Software Engineering
bcantrill
 
PDF
Visualizing Systems with Statemaps
bcantrill
 
PDF
Platform values, Rust, and the implications for system software
bcantrill
 
PDF
Is it time to rewrite the operating system in Rust?
bcantrill
 
PDF
dtrace.conf(16): DTrace state of the union
bcantrill
 
PDF
The Hurricane's Butterfly: Debugging pathologically performing systems
bcantrill
 
PDF
Papers We Love: ARC after dark
bcantrill
 
PDF
Principles of Technology Leadership
bcantrill
 
PDF
A crime against common sense
bcantrill
 
Predicting the Present
bcantrill
 
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
 
Coming of Age: Developing young technologists without robbing them of their y...
bcantrill
 
I have come to bury the BIOS, not to open it: The need for holistic systems
bcantrill
 
Towards Holistic Systems
bcantrill
 
The Coming Firmware Revolution
bcantrill
 
Hardware/software Co-design: The Coming Golden Age
bcantrill
 
Tockilator: Deducing Tock execution flows from Ibex Verilator traces
bcantrill
 
No Moore Left to Give: Enterprise Computing After Moore's Law
bcantrill
 
Andreessen's Corollary: Ethical Dilemmas in Software Engineering
bcantrill
 
Visualizing Systems with Statemaps
bcantrill
 
Platform values, Rust, and the implications for system software
bcantrill
 
Is it time to rewrite the operating system in Rust?
bcantrill
 
dtrace.conf(16): DTrace state of the union
bcantrill
 
The Hurricane's Butterfly: Debugging pathologically performing systems
bcantrill
 
Papers We Love: ARC after dark
bcantrill
 
Principles of Technology Leadership
bcantrill
 
A crime against common sense
bcantrill
 
Ad

Recently uploaded (20)

PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
introduction to computer hardware and sofeware
chauhanshraddha2007
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PPTX
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
PDF
Per Axbom: The spectacular lies of maps
Nexer Digital
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PDF
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
introduction to computer hardware and sofeware
chauhanshraddha2007
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
Per Axbom: The spectacular lies of maps
Nexer Digital
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
The Future of Artificial Intelligence (AI)
Mukul
 
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 

The Internet-of-things: Architecting for the deluge of data

  • 1. The Internet-of-things: Architecting for the deluge of data CTO [email protected] Bryan Cantrill @bcantrill
  • 2. Big Data circa 1994: Pre-Internet Source: BusinessWeek, September 5, 1994
  • 3. Aside: Internet circa 1994 Source: BusinessWeek, October 10, 1994
  • 4. Big Data circa 2004: Internet exhaust • Through the 1990s, Moore’s Law + Kryder’s Law grew faster than transaction rates, and what was “overwhelming” in 1994 was manageable by 2004 • But large internet concerns (Google, Facebook, Yahoo!) encountered a new class of problem: analyzing massive amounts of data emitted as a byproduct of activity • Data scaled with activity, not transactions — changing both data sizes and economics • Data sizes were too large for extant data warehousing solutions — and were embarrassingly parallel besides
  • 5. Big Data circa 2004: MapReduce • MapReduce, pioneered by Google and later emulated by Hadoop, pointed to a new paradigm where compute tasks are broken into map and reduce phases • Serves to explicitly divide the work that can be parallelized from that which must be run sequentially • Map phases are farmed out to a storage layer that attempts to co-locate them with the data being mapped • Made for commodity scale-out systems; relatively cheap storage allowed for sloppy but effective solutions (e.g. storing data in triplicate)
  • 6. Big Data circa 2014 • Hadoop has become the de facto big data processing engine — and HDFS the de facto storage substrate • But HDFS is designed around availability during/for computation; it is not designed to be authoritative • HDFS is used primarily for data that is redundant, transient, replaceable or otherwise fungible • Authoritative storage remains either enterprise storage (on premises) or object storage (in the cloud) • For analysis of non-fungible data, pattern is to ingest data into a Hadoop cluster from authoritative storage • But a new set of problems is poised to emerge...
  • 7. Big Data circa 2014: Internet-of-things • IDC forecasts that the “digital universe” will grow from 130 exabytes in 2005 to 40,000 exabytes in 2020 — with as much of a third having “analytic value” • This doesn’t even factor in the (long forecasted) rise of the internet-of-things/industrial internet... • Machine-generated data at the edge will effect a step function in data sizes and processing methodologies • No one really knows how much data will be generated by IoT, but the numbers are insane (e.g., HD camera generates 20 GB/hour; a Ford Energi engine generates 25 GB/hour; a GE jet engine generates 1TB/flight)
  • 8. How to cope with IoT-generated data? • IoT presents so much more data that we will increasingly need data science to make sense of it • To assure data, we need to retain as much raw data as possible, storing it once and authoritatively • Storing data authoritatively has ramifications for the storage substrate • To allow for science, we need to place an emphasis on hypothesis exploration: it must be quick to iterate from hypothesis to experiment to result to new hypothesis • Emphasizing hypothesis exploration has ramifications for the compute abstractions and data movement
  • 9. The coming ramifications of IoT • It will no longer be acceptable to discard data: all data will need to be retained to explore future hypotheses • It will no longer be acceptable to store three copies: 3X on storage costs is too acute when data is massive • It will no longer be acceptable to move data for analysis: in some cases, not even over the internet! • It will no longer be acceptable to dictate the abstraction: we must accommodate anything that can process data • These shifts are as significant as the shift from traditional data warehousing to scale-out MapReduce!
  • 10. IoT: Authoritative storage? • “Filesystems” that are really just user-level programs layered on local filesystems lack device-level visibility, sacrificing reliability and performance • Even in-kernel, we have seen the corrosiveness of an abstraction divide in the historic divide between logical volume management and the filesystem: • The volume manager understands multiple disks, but nothing of the higher level semantics of the filesystem • The filesystem understands the higher semantics of the data, but has no physical device understanding • This divide became entrenched over the 1990s, and had devastating ramifications for reliability and performance
  • 11. The ZFS revolution • Starting in 2001, Sun began a revolutionary new software effort: to unify storage and eliminate the divide • In this model, filesystems would lose their one-to-one association with devices: many filesystems would be multiplexed on many devices • By starting with a clean sheet of paper, ZFS opened up vistas of innovation — and by its architecture was able to solve many otherwise intractable problems • Sun shipped ZFS in 2005, and used it as the foundation of its enterprise storage products starting in 2008 • ZFS was open sourced in 2005; it remains the only open source enterprise-grade filesystem
  • 12. ZFS advantages • Copy-on-write design allows on-disk consistency to be always assured (eliminating file system check) • Copy-on-write design allows constant-time snapshots in unlimited quantity — and writable clones! • Filesystem architecture allows filesystems to be created instantly and expanded — or shrunk! — on-the-fly • Integrated volume management allows for intelligent device behavior with respect to disk failure and recovery • Adaptive replacement cache (ARC) allows for optimal use of DRAM — especially on high DRAM systems • Support for dedicated log and cache devices allows for optimal use of flash-based SSDs
  • 13. ZFS at Joyent • Joyent was the earliest ZFS adopter: becoming (in 2005) the first production user of ZFS outside of Sun • ZFS is one of the four foundational technologies of Joyent’s SmartOS, our illumos derivative • The other three foundational technologies in SmartOS are DTrace, Zones and KVM • Search “fork yeah illumos” for the (uncensored) history of OpenSolaris, illumos, SmartOS and derivatives • Joyent has extended ZFS to provide better support multi-tenant operation with I/O throttling
  • 14. ZFS as the basis for IoT? • ZFS offers commodity hardware economics with enterprise-grade reliability — and obviates the need for cross-machine mirroring for durability • But ZFS is not itself a scale-out distributed system, and is ill suited to become one • Conclusion: ZFS is a good building block for the data explosion from IoT, but not the whole puzzle
  • 15. IoT: Compute abstraction? • To facilitate hypothesis exploration, we need to carefully consider the abstraction for computation • How is data exploration programmatically expressed? • How can this be made to be multi-tenant? • The key enabling technology for multi-tenancy is virtualization — but where in the stack to virtualize?
  • 16. • The historical answer — since the 1960s — has been to virtualize at the level of the hardware: • A virtual machine is presented upon which each tenant runs an operating system of their choosing • There are as many operating systems as tenants • The historical motivation for hardware virtualization remains its advantage today: it can run entire legacy stacks unmodified • However, hardware virtualization exacts a heavy tolls: operating systems are not designed to share resources like DRAM, CPU, I/O devices or the network • Hardware virtualization limits tenancy and inhibits performance! Hardware-level virtualization?
  • 17. • Virtualizing at the application platform layer addresses the tenancy challenges of hardware virtualization… • ...but at the cost of dictating abstraction to the developer • With IoT, this is especially problematic: we can expect much more analog data and much deeper numerical analysis — and dependencies on native libraries and/or domain-specific languages • Virtualizing at the application platform layer poses many other challenges: • Security, resource containment, language specificity, environment-specific engineering costs Platform-level virtualization?
  • 18. • Containers virtualizing the OS and hit the sweet spot: • Single OS (single kernel) allows for efficient use of hardware resources, and therefore allows load factors to be high • Disjoint instances are securely compartmentalized by the operating system • Gives customers what appears to be a virtual machine (albeit a very fast one) on which to run higher-level software • Gives customers PaaS when the abstractions work for them, IaaS when they need more generality • OS-level virtualization allows for high levels of tenancy without dictating abstraction or sacrificing efficiency • Zones is a bullet-proof implementation of OS-level virtualization — and is the core abstraction in Joyent’s SmartOS Joyent’s solution: OS containers
  • 19. Idea: ZFS + Containers?
  • 20. • Building a sophisticated distributed system on top of ZFS and zones, we have built Manta, an internet-facing object storage system offering in situ compute • That is, the description of compute can be brought to where objects reside instead of having to backhaul objects to transient compute • The abstractions made available for computation are anything that can run on the OS... • ...and as a reminder, the OS — Unix — was built around the notion of ad hoc unstructured data processing, and allows for remarkably terse expressions of computation Manta: ZFS + Containers!
  • 21. Aside: Unix • When Unix appeared in the early 1970s, it was not just a new system, but a new way of thinking about systems • Instead of a sealed monolith, the operating system was a collection of small, easily understood programs • First Edition Unix (1971) contained many programs that we still use today (ls, rm, cat, mv) • Its very name conveyed this minimalist aesthetic: Unix is a homophone of “eunuchs” — a castrated Multics We were a bit oppressed by the big system mentality. Ken wanted to do something simple. — Dennis Ritchie
  • 22. Unix: Let there be light • In 1969, Doug McIlroy had the idea of connecting different components: At the same time that Thompson and Ritchie were sketching out a file system, I was sketching out how to do data processing on the blackboard by connecting together cascades of processes • This was the primordial pipe, but it took three years to persuade Thompson to adopt it: And one day I came up with a syntax for the shell that went along with the piping, and Ken said, “I’m going to do it!”
  • 23. Unix: ...and there was light And the next morning we had this orgy of one-liners. — Doug McIlroy
  • 24. The Unix philosophy • The pipe — coupled with the small-system aesthetic — gave rise to the Unix philosophy, as articulated by Doug McIlroy: • Write programs that do one thing and do it well • Write programs to work together • Write programs that handle text streams, because that is a universal interface • Four decades later, this philosophy remains the single most important revolution in software systems thinking!
  • 25. • In 1986, Jon Bentley posed the challenge that became the Epic Rap Battle of computer science history: Read a file of text, determine the n most frequently used words, and print out a sorted list of those words along with their frequencies. • Don Knuth’s solution: an elaborate program in WEB, a Pascal-like literate programming system of his own invention, using a purpose-built algorithm • Doug McIlroy’s solution shows the power of the Unix philosophy: tr -cs A-Za-z 'n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q Doug McIlroy v. Don Knuth: FIGHT!
  • 26. Big Data: History repeats itself? • The original Google MapReduce paper (Dean et al., OSDI ’04) poses a problem disturbingly similar to Bentley’s challenge nearly two decades prior: Count of URL Access Frequency: The function processes logs of web page requests and outputs ⟨URL, 1⟩. The reduce function adds together all values for the same URL and emits a ⟨URL, total count⟩ pair • But the solutions do not adhere to the Unix philosophy... • ...and nor do they make use of the substantial Unix foundation for data processing • e.g., Appendix A of the OSDI ’04 paper has a 71 line word count in C++ — with nary a wc in sight
  • 27. • Manta allows for an arbitrarily scalable variant of McIlroy’s solution to Bentley’s challenge: mfind -t o /bcantrill/public/v7/usr/man | mjob create -o -m "tr -cs A-Za-z 'n' | tr A-Z a-z | sort | uniq -c" -r "awk '{ x[$2] += $1 } END { for (w in x) { print x[w] " " w } }' | sort -rn | sed ${1}q" • This description not only terse, it is high performing: data is left at rest — with the “map” phase doing heavy reduction of the data stream • As such, Manta — like Unix — is not merely syntactic sugar; it converges compute and data in a new way Manta: Unix for Big Data — and IoT
  • 28. • Eventual consistency represents the wrong CAP tradeoffs for most; we prefer consistency over availability for writes (but still availability for reads) • Many more details: http://dtrace.org/blogs/dap/2013/07/03/fault-tolerance-in-manta/ • Celebrity endorsement: Manta: CAP tradeoffs
  • 29. • Hierarchical storage is an excellent idea (ht: Multics); Manta implements proper directories, delimited with a forward slash • Manta implements a snapshot/link hybrid dubbed a snaplink; can be used to effect versioning • Manta has full support for CORS headers • Manta uses SSH-based HTTP auth for client-side tooling (IETF draft-cavage-http-signatures-00) • Manta SDKs exist for node.js, R, Go, Java, Ruby, Python — and of course, compute jobs may be in any of these (plus Perl, Clojure, Lisp, Erlang, Forth, Prolog, Fortran, Haskell, Lua, Mono, COBOL, Fortran, etc.) • “npm install manta” for command line interface Manta: Other design principles
  • 30. • We believe compute/data convergence to be a constraint imposed by IoT: stores of record must support computation as a first-class, in situ operation • We believe that some (and perhaps many) IoT workloads will require computing at the edge — internet transit may be prohibitive for certain applications • We believe that Unix is a natural way of expressing this computation — and that OS containers are the right way to support this securely • We believe that ZFS is the only sane storage underpinning for such a system • Manta will surely not be the only system to represent the confluence of these — but it is the first Manta and IoT
  • 31. • Product page: http://joyent.com/products/manta • node.js module: https://github.com/joyent/node-manta • Manta documentation: http://apidocs.joyent.com/manta/ • IRC, e-mail, Twitter, etc.: #manta on freenode, [email protected], @mcavage, @dapsays, @yunongx, @joyent Manta: More information