SlideShare a Scribd company logo
The Database
 Revolution:
An Historical
 Perspective
The Database Revolution:
   An Historical Perspective

Brandon Byars
   bbyars@thoughtworks.com
John Finlay
   jfinlay@thoughtworks.com
Topics to be Covered
1. Historical Background
  – How Data Management brought us to today
2. Current State
  – RDBMS and recent catalysts for change
3. New-world Data Management problems
  – How NoSQL is being applied
4. NoSQL Issues
  – Hey, nobody’s perfect …
Historical Background
 To understand the transition
database engines are undergoing,

   we need to acknowledge
      how they originated,
    and how they evolved.

       Let’s go back to
    the dawn of time…
THE LATE ‘60s
Big Box Computing
• 360/85: 4 MB “core”; 1 MIPS; ~$1-5M
• Plus building,
  water cooling
• 29 MB disk x 8,
  $250K
• 4MB drum
• CICS, TSO
Data in Flat Files
• Common file formats: ISAM, later VSAM
• Single-user processing
  – e.g. batch job
• Multi-user processing
  – Generally queued by TP monitor
• Log-based Transaction management
  – Supporting COMMIT/ROLLBACK
Hierarchical Model
                              Course
                                  1101         Calculus

  Prereqs                           Offering
   1026          Trig                  110913         Thorvaldsen             2112
 1024       Algebra                       110912            Hayes                127A

  Instructor                              Students
   22628          Smith                    10274699            Barney
                                              10274484            Finlay
     Marks                                       10228437             Byars
    Assign 1            87
       Assign 2              37

• 1968: IBM introduces IMS/DB
  – Primary use: BOM (e.g. Apollo)
  – “Programmer as navigator” thru physical pointers
  – What other classes does Byers take?
Inverted List Model
                                                                       Address Converter
                          Bygum Finlay Smith
                                                                         ISN    BLOCK#
                                                                          …        …
Allen    Atkins Byggles    Bygum Chen Eggers      Finlay Myers   Rex    1265
                                                                        1266     48265
Austin   1625 Benson 1938       Bindle   1493   Byars   1266            1267     12973
                                                                          …        …

   • 1970: Software AG introduces Adabas
   • Similar to modern RDBMS
         – Normalized data with optional MU’s/PE’s
         – Multiple compressed B-tree Indexes per table
         – (Single table) search result sets
   • Data retrieval by record
   • FK references managed, followed in code
Network Model
             Prior                Next
                      Parent                          (Parent 2)
 Next                                      Prior
                                                           Prior
   1st   Child                        nth Child
                                     (1st Child)   Next
                         Direct
 Prior                                     Next
                     2nd Child
             Next                 Prior

• 1971: CODASYL navigational model
• 1983: Cullinet IDMS
   – Navigation slightly easier but still in code
   – A little late in the game
Relational Model
• 1970: Codd introduces Relational Calculus
• 1977: IBM spikes System/R
  – Origin of SQL
• 1979: Oracle; 1983, DB2
  – Acceptable performance on “modern” hardware
• Others follow: Informix (1980),
  Sybase/SQL Server (1984), MySQL (1995)
SQL Effect
• Perceived advantages of SQL:
  – “One size fits all”
  – No lock-in
  – Less reliance on code navigation
  – Cheaper cycles can do more
• non-RDBMS vendors forced to support SQL
  – Set results vs. sequential processing
  – “Tuna fish with your cottage cheese”
Today:
    Does a relational model always work?

•   Predefined data structures
•   Expensive / Slow
•   ACID not always required / desired
•   “Standard language” ain’t so standard
•   Not always TPC-A accounting/inventory apps
Example: EAV The RDBMS Way
EntityId        Attribute       Value
10              Name            Brandon Byars
10              Age             34
10              GoodLooking     Yes
10              LikesEAV        No
10              NetWorth        34.57
10              BirthDate       April 29, 1977


• “5th NF” to circumvent rigid data structures
• No data types, no FK’s
• All the RDBMS overhead without the benefits
Growth of Processing Power




                     Used with permission of Ben Klemens
                     “Moore’s law won’t save you”
                     http://modelingwithdata.org
How the World Has Changed
 Every 2 days we create as much information
as we did from the dawn of civilization to 2003
   - Paraphrasing Eric Schmidt, 2011
                              Old World   New World
Processor(Mips/core)               1        10,000
Processors x Nodes               1x1       8 x 1,000
Nodes / $M                         1         100
Memory/Node(MB)                    1       100,000
Disk Data(GB)                      1       1,000,000
Disk Storage (KB/$)                1      10,000,000
Users (1000’s)                     1       100,000
Support Staff (100’s)              1          1
The NoSQL Argument

ONE SIZE DOESN’T FIT ANYBODY


SINGLE CPU/NODE UNREALISTIC


  IT’S TIME FOR A REWRITE
CAP Theorem

                  Neo4J
                  RDBMS
Consistency                     Availability

        Redis     Unicorns      DNS
       Quorum                 Dynamo
       Bigtable              CouchDB
       MongoDB               Cassandra



         Partition Tolerance
ACID versus BASE
• ACID: The RDBMS keystone
  – Atomic
  – Consistent
  – Isolated
  – Durable

• BASE: A new alternative
  – Basically Available
  – Soft State
  – Eventually Consistent
Distribution: Types of Failures
•   Memory and network corruption
•   Large clock skew
•   Hung machines
•   Extended and asymmetric network partitions
•   Bugs in other systems used
•   Overflow of file system quotas
•   Planned and un- planned maintenance
•   Disk failure
Catalyst to Change
Problem:
Index the Interwebs
Problem: Index the Interwebs
                 Word                URL
                 nosql               http://nosql.com
Inverted Index   nosql               http://mapreduce.com
                 nosql               http://hadoop.com




                 URL                            IncomingLinks
                 http://nosql.com               5763

PageRank         http://mapreduce.com           100346

                 http://hadoop.com              87234
MapReduce: Inverted Index
                      HTML Documents
                                                Map



Word    URL                                           (word, URL)
nosql   http://nosql.com
nosql   http://mapreduce.com

                           (word, list(URL))

                                               Reduce
MapReduce: Incoming Links
                       HTML Documents
                                             Map



URL                    IncomingLinks               (targetURL, 1)
http://nosql.com       5763
http://mapreduce.com   100346

                       (targetURL, count)

                                            Reduce
Problem:
Quick Search Results
Bigtable
“A Bigtable is a sparse, distributed, persistent
multidimensional sorted map”
Bigtable
• Sorted map abstraction
• Allows quick random read/write of massive
  amounts of structured data
• Single row transactions
• Unlimited columns on row
Facebook Messages
RowKey                 Message:Offset   Message:Subject
bbyars:hbase:17        34               FB messages
bbyars:nosql:17        56               FB messages
jfinlay:oracle:19      5                Geospatial
jfinlay:oracle:24      87               RAC
jfinlay:postgres:19    32               Geospatial
Problem:
e-Commerce at Web Scale
Dynamo
“In particular, applications have received successful
responses (without timing out) for 99.9995% of its
requests and no data loss event has occurred to date.”

•   Incremental scalability
•   Symmetry
•   Decentralization
•   Heterogeneity
Dynamo: Ring Partitioning
             A       Hash(key)


     G               B




 F                       C



         E       D
Dynamo: Tuning Knobs


N             R            W
Big Data: Clones
Map Reduce           Bigtable        Dynamo




2nd Generation:
Problem:
Social Networking
Graph Databases
Graph Databases
Problem:
Flexible Schema
Document Databases
{
    _id:ObjectId("4c4ba5c0672c685e5e8aabf3"),
    name: "Brandon Byars”,
    children: [
      {
         name: "Jackson Byars",
         birthDate: "February 15, 1999"
      },
      {
         name: "Zachary Byars",
         birthdate: "January 15, 2009"
      }]
}
Document Databases
Problem:
NoSQL?
Nosql
Nosql
Nosql
Summary
          One size really doesn’t fit all

                PreSQL      RDBMS           NoSQL
Overhead          low         high            low
Data Model       rigid     changeable    rigid/flexible
Code Lock-in     high        medium           high
Distribution      n/a       achievable      easy/hard
Extras            few         many            few
Cost             high      medium-high   low-medium
Future Choices
• RDBMS still the “go-to” solution in most cases
• But, look at
  new alternatives


 Keep Business,
  not Fashion,
    in mind
The Database Revolution:
   An Historical Perspective

Brandon Byars
   bbyars@thoughtworks.com
John Finlay
   jfinlay@thoughtworks.com

More Related Content

PPTX
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Cloudera, Inc.
 
PPSX
Exchange 2010 ha ctd
Kaliyan S
 
PDF
Advanced mysql replication for the masses
Giuseppe Maxia
 
PPT
Dbms presentaion
sai kumar rachakonda
 
PPTX
NoSQL Intro with cassandra
Brian Enochson
 
PPTX
Introduction to NoSQL
Yan Cui
 
PDF
What should be done to IR algorithms to meet current, and possible future, ha...
Simon Lia-Jonassen
 
PPTX
An Introduction to Big Data, NoSQL and MongoDB
William LaForest
 
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Cloudera, Inc.
 
Exchange 2010 ha ctd
Kaliyan S
 
Advanced mysql replication for the masses
Giuseppe Maxia
 
Dbms presentaion
sai kumar rachakonda
 
NoSQL Intro with cassandra
Brian Enochson
 
Introduction to NoSQL
Yan Cui
 
What should be done to IR algorithms to meet current, and possible future, ha...
Simon Lia-Jonassen
 
An Introduction to Big Data, NoSQL and MongoDB
William LaForest
 

Similar to Nosql (20)

PPTX
Introduction of DBMS,RDBMS,SQL
pranavi ch
 
PDF
Webinar: The Future of SQL
Crate.io
 
PDF
PayPal Big Data and MySQL Cluster
Mat Keep
 
PDF
MySQL Cluster Scaling to a Billion Queries
Bernd Ocklin
 
PPTX
Yes sql08 inmemorydb
Daniel Austin
 
PPTX
C*ollege Credit: Keep the DB, Lose the A
DataStax
 
PDF
OSDC 2018 | The operational brain: how new Paradigms like Machine Learning ar...
NETWAYS
 
PDF
Designs, Lessons and Advice from Building Large Distributed Systems
Daehyeok Kim
 
PPT
WebServices_Grid.ppt
EqinNiftalyev
 
PPTX
Big Data - Yesterday, Today and Tomorrow by John Mashey, Techviser
Angela Hey
 
PPTX
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Andrii Vozniuk
 
PDF
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
Chris Richardson
 
PPTX
Scaling Your Database in the Cloud
RightScale
 
PPTX
Intro to Big Data and NoSQL
Don Demcsak
 
PDF
Spring one2gx2010 spring-nonrelational_data
Roger Xia
 
PDF
Scalable Storage for Massive Volume Data Systems
Lars Nielsen
 
PDF
Scaling with sync_replication using Galera and EC2
Marco Tusa
 
PDF
Modern Database Systems (for Genealogy)
Steven Francia
 
PDF
Is NoSQL The Future of Data Storage?
Saltmarch Media
 
PDF
Everything We Learned About In-Memory Data Layout While Building VoltDB
jhugg
 
Introduction of DBMS,RDBMS,SQL
pranavi ch
 
Webinar: The Future of SQL
Crate.io
 
PayPal Big Data and MySQL Cluster
Mat Keep
 
MySQL Cluster Scaling to a Billion Queries
Bernd Ocklin
 
Yes sql08 inmemorydb
Daniel Austin
 
C*ollege Credit: Keep the DB, Lose the A
DataStax
 
OSDC 2018 | The operational brain: how new Paradigms like Machine Learning ar...
NETWAYS
 
Designs, Lessons and Advice from Building Large Distributed Systems
Daehyeok Kim
 
WebServices_Grid.ppt
EqinNiftalyev
 
Big Data - Yesterday, Today and Tomorrow by John Mashey, Techviser
Angela Hey
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Andrii Vozniuk
 
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
Chris Richardson
 
Scaling Your Database in the Cloud
RightScale
 
Intro to Big Data and NoSQL
Don Demcsak
 
Spring one2gx2010 spring-nonrelational_data
Roger Xia
 
Scalable Storage for Massive Volume Data Systems
Lars Nielsen
 
Scaling with sync_replication using Galera and EC2
Marco Tusa
 
Modern Database Systems (for Genealogy)
Steven Francia
 
Is NoSQL The Future of Data Storage?
Saltmarch Media
 
Everything We Learned About In-Memory Data Layout While Building VoltDB
jhugg
 
Ad

Recently uploaded (20)

PPTX
Smart Infrastructure and Automation through IoT Sensors
Rejig Digital
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
Why Your AI & Cybersecurity Hiring Still Misses the Mark in 2025
Virtual Employee Pvt. Ltd.
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
Building High-Performance Oracle Teams: Strategic Staffing for Database Manag...
SMACT Works
 
PPTX
How to Build a Scalable Micro-Investing Platform in 2025 - A Founder’s Guide ...
Third Rock Techkno
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Revolutionize Operations with Intelligent IoT Monitoring and Control
Rejig Digital
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Software Development Company | KodekX
KodekX
 
PPTX
C Programming Basics concept krnppt.pptx
Karan Prajapat
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
PDF
madgavkar20181017ppt McKinsey Presentation.pdf
georgschmitzdoerner
 
PDF
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
Captain IT
 
Smart Infrastructure and Automation through IoT Sensors
Rejig Digital
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
Why Your AI & Cybersecurity Hiring Still Misses the Mark in 2025
Virtual Employee Pvt. Ltd.
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Building High-Performance Oracle Teams: Strategic Staffing for Database Manag...
SMACT Works
 
How to Build a Scalable Micro-Investing Platform in 2025 - A Founder’s Guide ...
Third Rock Techkno
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Revolutionize Operations with Intelligent IoT Monitoring and Control
Rejig Digital
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Software Development Company | KodekX
KodekX
 
C Programming Basics concept krnppt.pptx
Karan Prajapat
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
madgavkar20181017ppt McKinsey Presentation.pdf
georgschmitzdoerner
 
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
Captain IT
 
Ad

Nosql

  • 1. The Database Revolution: An Historical Perspective
  • 2. The Database Revolution: An Historical Perspective Brandon Byars [email protected] John Finlay [email protected]
  • 3. Topics to be Covered 1. Historical Background – How Data Management brought us to today 2. Current State – RDBMS and recent catalysts for change 3. New-world Data Management problems – How NoSQL is being applied 4. NoSQL Issues – Hey, nobody’s perfect …
  • 4. Historical Background To understand the transition database engines are undergoing, we need to acknowledge how they originated, and how they evolved. Let’s go back to the dawn of time…
  • 6. Big Box Computing • 360/85: 4 MB “core”; 1 MIPS; ~$1-5M • Plus building, water cooling • 29 MB disk x 8, $250K • 4MB drum • CICS, TSO
  • 7. Data in Flat Files • Common file formats: ISAM, later VSAM • Single-user processing – e.g. batch job • Multi-user processing – Generally queued by TP monitor • Log-based Transaction management – Supporting COMMIT/ROLLBACK
  • 8. Hierarchical Model Course 1101 Calculus Prereqs Offering 1026 Trig 110913 Thorvaldsen 2112 1024 Algebra 110912 Hayes 127A Instructor Students 22628 Smith 10274699 Barney 10274484 Finlay Marks 10228437 Byars Assign 1 87 Assign 2 37 • 1968: IBM introduces IMS/DB – Primary use: BOM (e.g. Apollo) – “Programmer as navigator” thru physical pointers – What other classes does Byers take?
  • 9. Inverted List Model Address Converter Bygum Finlay Smith ISN BLOCK# … … Allen Atkins Byggles Bygum Chen Eggers Finlay Myers Rex 1265 1266 48265 Austin 1625 Benson 1938 Bindle 1493 Byars 1266 1267 12973 … … • 1970: Software AG introduces Adabas • Similar to modern RDBMS – Normalized data with optional MU’s/PE’s – Multiple compressed B-tree Indexes per table – (Single table) search result sets • Data retrieval by record • FK references managed, followed in code
  • 10. Network Model Prior Next Parent (Parent 2) Next Prior Prior 1st Child nth Child (1st Child) Next Direct Prior Next 2nd Child Next Prior • 1971: CODASYL navigational model • 1983: Cullinet IDMS – Navigation slightly easier but still in code – A little late in the game
  • 11. Relational Model • 1970: Codd introduces Relational Calculus • 1977: IBM spikes System/R – Origin of SQL • 1979: Oracle; 1983, DB2 – Acceptable performance on “modern” hardware • Others follow: Informix (1980), Sybase/SQL Server (1984), MySQL (1995)
  • 12. SQL Effect • Perceived advantages of SQL: – “One size fits all” – No lock-in – Less reliance on code navigation – Cheaper cycles can do more • non-RDBMS vendors forced to support SQL – Set results vs. sequential processing – “Tuna fish with your cottage cheese”
  • 13. Today: Does a relational model always work? • Predefined data structures • Expensive / Slow • ACID not always required / desired • “Standard language” ain’t so standard • Not always TPC-A accounting/inventory apps
  • 14. Example: EAV The RDBMS Way EntityId Attribute Value 10 Name Brandon Byars 10 Age 34 10 GoodLooking Yes 10 LikesEAV No 10 NetWorth 34.57 10 BirthDate April 29, 1977 • “5th NF” to circumvent rigid data structures • No data types, no FK’s • All the RDBMS overhead without the benefits
  • 15. Growth of Processing Power Used with permission of Ben Klemens “Moore’s law won’t save you” http://modelingwithdata.org
  • 16. How the World Has Changed Every 2 days we create as much information as we did from the dawn of civilization to 2003 - Paraphrasing Eric Schmidt, 2011 Old World New World Processor(Mips/core) 1 10,000 Processors x Nodes 1x1 8 x 1,000 Nodes / $M 1 100 Memory/Node(MB) 1 100,000 Disk Data(GB) 1 1,000,000 Disk Storage (KB/$) 1 10,000,000 Users (1000’s) 1 100,000 Support Staff (100’s) 1 1
  • 17. The NoSQL Argument ONE SIZE DOESN’T FIT ANYBODY SINGLE CPU/NODE UNREALISTIC IT’S TIME FOR A REWRITE
  • 18. CAP Theorem Neo4J RDBMS Consistency Availability Redis Unicorns DNS Quorum Dynamo Bigtable CouchDB MongoDB Cassandra Partition Tolerance
  • 19. ACID versus BASE • ACID: The RDBMS keystone – Atomic – Consistent – Isolated – Durable • BASE: A new alternative – Basically Available – Soft State – Eventually Consistent
  • 20. Distribution: Types of Failures • Memory and network corruption • Large clock skew • Hung machines • Extended and asymmetric network partitions • Bugs in other systems used • Overflow of file system quotas • Planned and un- planned maintenance • Disk failure
  • 23. Problem: Index the Interwebs Word URL nosql http://nosql.com Inverted Index nosql http://mapreduce.com nosql http://hadoop.com URL IncomingLinks http://nosql.com 5763 PageRank http://mapreduce.com 100346 http://hadoop.com 87234
  • 24. MapReduce: Inverted Index HTML Documents Map Word URL (word, URL) nosql http://nosql.com nosql http://mapreduce.com (word, list(URL)) Reduce
  • 25. MapReduce: Incoming Links HTML Documents Map URL IncomingLinks (targetURL, 1) http://nosql.com 5763 http://mapreduce.com 100346 (targetURL, count) Reduce
  • 27. Bigtable “A Bigtable is a sparse, distributed, persistent multidimensional sorted map”
  • 28. Bigtable • Sorted map abstraction • Allows quick random read/write of massive amounts of structured data • Single row transactions • Unlimited columns on row
  • 29. Facebook Messages RowKey Message:Offset Message:Subject bbyars:hbase:17 34 FB messages bbyars:nosql:17 56 FB messages jfinlay:oracle:19 5 Geospatial jfinlay:oracle:24 87 RAC jfinlay:postgres:19 32 Geospatial
  • 31. Dynamo “In particular, applications have received successful responses (without timing out) for 99.9995% of its requests and no data loss event has occurred to date.” • Incremental scalability • Symmetry • Decentralization • Heterogeneity
  • 32. Dynamo: Ring Partitioning A Hash(key) G B F C E D
  • 34. Big Data: Clones Map Reduce Bigtable Dynamo 2nd Generation:
  • 39. Document Databases { _id:ObjectId("4c4ba5c0672c685e5e8aabf3"), name: "Brandon Byars”, children: [ { name: "Jackson Byars", birthDate: "February 15, 1999" }, { name: "Zachary Byars", birthdate: "January 15, 2009" }] }
  • 45. Summary One size really doesn’t fit all PreSQL RDBMS NoSQL Overhead low high low Data Model rigid changeable rigid/flexible Code Lock-in high medium high Distribution n/a achievable easy/hard Extras few many few Cost high medium-high low-medium
  • 46. Future Choices • RDBMS still the “go-to” solution in most cases • But, look at new alternatives Keep Business, not Fashion, in mind
  • 47. The Database Revolution: An Historical Perspective Brandon Byars [email protected] John Finlay [email protected]

Editor's Notes

  • #3: TWQuestions welcome
  • #4: Survey – where we’ve been and where we’re headed
  • #5: How did data management evolve to produce our current “universal database” (RDBMS) world?AND How did that world influence current NoSQL development efforts?Let’s cover a little historySo you’ll understand how we got to the current stateTo bring to light what data processing was like before “Relational” became the normTo explain that, in some ways, we’ve sorta been there before.So let’s go back to when it all got started. The dawn of time…
  • #6: Hey, this is what it was really like.These were some of my co-workers…
  • #7: Big iron (IBM, Burroughs, Univac) single-mainframe hardwarethe ONLY choice.Minimalmemory(1-16MB) meant marginal data retentionSo everything needed to be written to disk.1 MIP meant efficiency was paramount.S/W commonly built in Assembler.Compare to my phone:32 GB plus “external storage” in the form of SD cards/cloud/etc.40 MIP processor in here – just for the A/D conversionSO: How did anyone do data management?
  • #8: The first data management facilitiesFlat sequential filesLater improvement: (single-key) indexingOne-user-at-a-time required no synchronizationTP monitors (e.g. CICS) allowed many users “simultaneous” accessUser Interface was restricted to 24x80 character screensCapabilities somewhat equivalent to Internet Explorer… The monitor used logging & 2PC to synchronize across files & guarantee stateThus logging (then as now) was a major limiting factor
  • #9: In 1968, IMS (from IBM) arrived, bundling data management into a separate engine.Specifically, at the time, to manage hierarchical data (e.g. BOM)IMS moved multi-file transaction management from the TP monitor to the data engineMany new mechanisms for organizing and traversing data were introduced: not just VSAM, but HDAM, HIDAM, HISAM…All were fairly complex to useBecause the Programmer was the navigator through pointers from record to recordMeaning, all traversal, logical “JOINS” etc., were explicitly done in code (usually cobol)Functionality was limited by the data structures.Obviously, some queries were much more expensive & code intensive than others
  • #10: In 1970, a new style of data management appeared courtesy of Software AGFlat tables appeared, with multiple indicesData retrieval and buffering was managed via the data engine itselfNo more physical address pointers for the programmer to handle“Fifth Generation” languages commonly usedThese data structures are identical to relational database structures in a number of waysMultiple B-tree indices means flexible SEARCH capability <describe the B-tree structure and ISN’s>Data stored in nth NF relations (denormalized for performance, eg aggregates in MU/PE)BUT“Join” still done by the programmer (given this record, go get that child record…) because…Data retrieval was still a record-by-record request to the engineForeign Key relationships not constrained by the engineNo engine assistance on “best means of traversal”Limited to 16M records/tableSince the data management overhead was minimal (relative to today), this was (and continues to be) an extremely fast engine.
  • #11: The Network data model appeared in 1983, based on work by Charles Bachman, who developed the CODASYL model as extensions to the COBOL languageIMS actually a restrictive form of this model.The ability to traverse data in many dimensions removed many of the limitations of the hierarchical modelOddly, this was a full return to the programmer as navigator, albeit with supportive tools/languageAppeared a little late in the game: today it’s almost nonexistent
  • #12: In 1970, Codd invented Relational Calculus, a mathematically sound means of describing what is wanted, not how to get it.System/R and all following RDBMS engines follow only a subset of Codd’s original “12 Rules”What is this Relational Model? Take the Inverted List model, and addData statistics generated and used by the engine to …Decide itself what the “best” traversal method is…Retrieve that data as a single “set”, and…Enforce constraints (values, FK’s, etc.) on the dataOf course, this increased overhead dramatically.Up to 95% of cpu demand is logging, locking/latching, 2-3PC, enforcing constraints, etc.By about 1988, along with processors that achieved approx 10 MIPS, performance was acceptable; butpeople knew it would only get better, given Moore’s Law, and …were willing to accept the cost for the perceived advantages.So the marketplace for RDBMS exploded, and all other engines became passé. SQL SERVER! There’s our tie-in to this conference.
  • #13: Those “Perceived Advantages”:One size: data is normalized, you can do anything with it; traversal is no longer a programmer issueMeaning, again, the programmer just specifies what is wanted, not how to get it.Also meaning the data structures could be altered with no change in how that data was accessedLock-in: Hey, SQL is a universal language! No longer forced to stay with a single vendor! Prices will come down!Less reliance on code navigation: once you’ve understood SQL, …Cheaper cycles: CPUs becoming powerful enough to take ever more responsibility away from the developerTuna fish: Quote from Peter Pagé, CTO of Software AG after 3 years of fighting: “There is a massive impedance mismatch between set logic and sequential processing that is the norm for almost all languages. But Ok, if you want <>, we’ll do give it to you.”
  • #14: So for the last 20 years, data processing has been largely handled via Relational databases.CLICKBut data structures are still fairly rigidColumns are strictly predefinedIt can’t easily handle BIG data/documents/key-value/entity-attribute-value columnsCLICKBut Enterprise-level engines are still very expensiveIn absolute $ cost, speed and overheadCLICKBut absolute data accessibility, consistency, isolation, etc. is often unnecessary. Even the management of “optimal traversal” is not required most of the time, since most queries are now compiled once as Stored Procedures, obviating the original premise of users directly and dynamically using the database engine in an ad hoc fashion.CLICKBut SQL has a different flavor on every engine, so there is still a high degree of engine lock-in.CLICKBut we’re not always doing TPC-A type transactions. (a standard benchmark (now defunct) that WAS used to measure db performance)
  • #15: An entity (a “table”) is defined purely by a dictionary’s definition of what columns are contained therein.Columns are merely named attributes with defined value types, so a new column can be added in no timeConceivably, beyond the simple dictionary definitions, all data can be maintained in one “table”.CLICKThe so-called “5th Normal Form” data architecture was designed toCircumvent the RDBMS limitation on dynamic metadata definitions,Allowing new entity attributes to be added on-the-fly.Architecture used by, e.g., NetCracker, OpenMRSTheoretically, this is an interesting idea.In reality,CLICKNo real data types (all data stored as characters)No foreign keysNo constraint managementNo truly effective indexingCLICKMeans we are still incurring ALL the RDBMS overhead without any of the benefits.Now of course, this design works, and is “future-proof” assuming Moore’s Law continues to hold. BUT…
  • #16: Here’s a chart of cpu processing power over time. There’s some interesting things to note.Note the scale is logarithmic means incredible growth in processing power has occurredNotice how we’ve pretty well leveled out over the last 10 years?Sneaky tricks to make it look like processors are continuing the growth trend have given way to improvements in memory bandwidth, power consumption, etc.andAll current database engine architectures were designed when 1 MIP CPU’s were the norm, and the engine ran on one node.Now processor speed isn’t the whole story. The data processing world has changed over time in numerous ways…
  • #17: Here’s a chart (normalized to 1 for approximately 1970) listing some interesting growth factors, with the results of the previous slide shown as the first entry.These are rough sizes … what a company with “large data processing needs” would see.I tried not to be picky … an order of magnitude comparison was good enough for these purposes.Drastic reductions in cost; unbelievable improvements in capacity and power. Are we taking advantage of this? With RDBMS, not universally. Consider how many phsyical nodes we can set up for the same cost as before: 3 orders of magnitude growth. Now imagine spreading an RDBMS over 1,000 nodes. It would be an abomination.Or, for example, let’s look at “Disk Data”, or how much data was/is retained on disk in a large data processing shop.I worked for the Los Angeles County Justice Department from ‘87 to about ‘91.Data center was like an airplane hangarDisk storage boxes – 3330’s and 3350’s – were arrayed in a vast area like the closing scene of Indiana JonesPeople’s jaws would drop when they heard LA County had 1 TB of disk storage, and needed more. How could anyone need that much storage? Like, 100K per county resident!LA County’s problem was not that they needed more storage; they’re problem was that they couldn’t get more electricity to run the storage. They had maxed out the county’s electrical infrastructure.In contrast: about a month ago, I bought 4 TB of storage, 2 drives, for a dictionary-size RAID-1 NAS in my home, for $250.I told a friend of mine about it.He said, in all seriousness, “Oooh, that won’t last very long.”How could an RDBMS provide acceptable response times when managing a Petabyte? Or (soon) an Exabyte? Imagine searching that. Without a decent index. Which is what Data Mining is all about.
  • #18: So some of the problems with Relational databases are obvious.CLICKThe mass migration to RDBMS is like one of those “In Soviet Russia” jokes:In Database Management, RDBMS runs you!CLICKToday,Medium size datacenters house dozens if not hundreds of servers, whereas a relational database generally lives on one server only.Data volumes are rapidly outstripping a single engine’s ability to navigate effectively.Enterprise database engines still cost huge amounts.Procuring open source software is becoming the norm.CLICKWhat happens when we take old and/or new data models and apply them to modern h/w?Let’s start with a couple of useful concepts...
  • #19: Eric Brewer’s 2000 ACM keynote speech introduced the CAP theoremHe argued that these three concepts – C, A, and P – in various combinations, were what was possible for a data engine to provide.Consistency = strong, “read your writes” consistency – all clients see same view (sometimes have to say no)Availability = if you can talk to a node, you can get an answer.Partition Tolerance = Data on multiple nodes: Anything less than total network failure still functionsHis theory goes on to say that no data engine can satisfy all three requirements simultaneously. See the unicorns?Example:CA = requires all nodes to be in constant contact with each otherAn RDBMS is generally focused on a single node; partitioning-without-pain is out of the question, because a network failure can lose a shard, which can kill the whole.To achieve combinations other than CA, To take advantage of modern hardwareTo handle problems that don’t fit nicely in the Relational worldwe need to alter some our “requirements”, specifically those dealing with ACID transactions.
  • #20: These acronyms provide an approximate description of the difference between RDBMS and NoSQL enginesACID represents the “rules” we’ve come to know and live by wrt Relational databases.They present a pessimistic approach, consistent at end of every operation. It looks like a serial rather than parallel process.Atomic: all-or-nothing transactions. Consistent: data always agrees. Isolated: no inter-transaction interference. Durable: no lost data.This is what demands the 95% overhead to achieve, and mostly forces the engine to remain in a single server.There’s an alternative model, that opens the doors for other types of data processing:CLICKBASE – is an artificial acronym (to counteract ACID), which is: some parts are always basically available (even though not all parts may be) providing soft-state services (ie the possibility of data inconsistencies or versioned data)with eventual consistency guaranteed. It’s an optimistic approach, accepting the fact that consistency is always in state of flux.Basically Available, Soft State, Eventually Consistent. Sounds Like a poorly worded personal entry on a dating website.And with that, I’ll hand the presentation over to my esteemed colleague, BB. Thank you.
  • #21: Much of nosql attacks Brewer’s P in CAP; but look out!
  • #24: Data warehouse problem, not latency sensitive
  • #25: Note file icon, distributed FSAppend-only FS
  • #26: Google’s golden hammerFB – uses hadoop for all data warehouse ops (15TB / day)Yahoo 82pb data, 40k machines, large clusters 4000 machines
  • #27: * Low latency search of entire web, look-ahead
  • #28: Column families – versioning, compression, bloom filter policiesReversed DNS name, two column families, different timestamped valuesRows stored in lexicographic order, partitioned automatically – allows efficient range queries with smart key selection
  • #29: * Google Earth, Google Analytics* Facebook Messages,Yahoo web crawl cache, StumbleUpon, Twitter (analytics takes most storage, not tweets)
  • #30: * Inverted index - Can only search your inbox; userId first, Works for type-ahead, partial word searchesLooking for multiple words – joined at app layer* Google Analytics -> key (websites name, time) -> allows efficient chronological queries, augmented by a table with predefined summaries populated by M/RGoogle Earth -> spatial key guarantees contiguity between rows, preprocessing via M/R, serves low latencyKey design – range queries + auto sharding
  • #31: * Always let customers add items to shopping cart
  • #32: Scaling with minimal impact to operators and systemNo distinguished nodes (like in Hadoop)P2P techniquesNot just different hardware – work distribution must be proportional to sever capability.Allows updating sections of infrastructure at a time
  • #33: Consistent Hashing – output range of hash function treated as ring (largest hash value wraps around)Each node in system assigned random value, represents its position on ringEach data item identified by key is assigned to a node by hashing the key to find position on ring, then walking clockwiseVector clocks allow identifying inconsistencies at READ time – developers deal with itConflict resolution can be done by app (“merge) or data store, which has fewer options (e.g. “last write wins”)
  • #34: N = nodes to store dataR = nodes that must participate in a read (increase for consistency, decrease for latency)W = nodes that must participate in a write (decrease for availability, increase for consistency)Typical configuration = (3, 2, 2)Can optimize (e.g. product catalog)
  • #35: * facebook doesn't have a data warehouse - they use hadoop and hbase for all analytics
  • #37: Codasyl network model of the 70’sRDBMS can tell you avg salary of everyone; graph tell you who most likely to buy you a beer
  • #38: Also: PLM, Fraud detection, intelligence activities, genomics
  • #40: Hierarchical model (IMS)JavascriptAPICouchbase = mobile support
  • #43: http://browsertoolkit.com/fault-tolerance.png
  • #45: M/R Problems:* Debugging
  • #46: So: in some ways we’ve returned to the pre-RDBMS era.NoSQL brings us full-circle, back to engines thatFit today’s hardware configurations;Apply to particular problem domains; andDemand high development lock-in.
  • #47: Don’t be fooled into thinking RDBMS is dead.
  • #48: TWQuestions welcome