SlideShare a Scribd company logo
P. Taylor Goetz
Development Lead, Health Market Science
tgoetz@healthmarketscience.com
@ptgoetz
Agenda
     Cassandra   @ HMS
     Storm Overview
     Storm-Cassandra
     Examples / Demo
     Future : Trident
Our Products




   Master Data Management
     Good, bad doctors?
   Prescriber eligibility and remediation.
Cassandra to the Rescue



1000’s of Feeds

                  C*                                   Masterfile




                  Big Data for us == Variety of Data

        Δt
But…
 Search unstructured data
 Real-time Analytics / Reporting
 Transactional Processing
     Changes reflected immediately.
     Wide-row Indexes
What might that look like?


                wide-row index
    C*                                          I’m happy




     RDBMS




             Provide for Polyglot Persistence
What we did wrong…




 Could not react to transactional changes
 Needed extra logic to track what changed
 Took too long
C* at HMS
   Load, Standardize, Match, Consolidate
     Write results to C*


   Track Changes over Time
     Practitioner Data
     Feed Quality
C* at HMS
   Best Practices
     Prefer Write over Read (esp. Read before Write)
     Avoid Queries/Scans
      ○ Fetch by key whenever possible
      ○ Put Comparators to work
      ○ Pre-compute whenever possible
How?
   Treat All Data as Immutable
     Updates are inserts with new version/timestamp


   Data Model
     Heavy use of composites
     Timestamps/Versions in Keys


   Treat Feeds as Real-Time Streams
What Storm is to us…
 Crud
  Op     ETL   Dimensional
                             Enrichment
                 Counts



                               Fuzzy
         SoR    RDBMS
                               Index



        A High Throughput
        Data Processing Pipeline
Cassandra and Storm at Health Market Sceince
Storm Overview
 Open-Sourced by Twitter in 2011
 Distributed Realtime Computation System
 Fault Tolerant
 Highly Scalable
 Guaranteed Processing
 Operates on one or more streams of data
Anatomy of a Storm Cluster

   Nimbus
     Master Node
   Zookeeper
     Cluster Coordination
   Supervisors
     Worker Nodes
Storm Primatives
   Streams
     Unbounded sequence of tuples
   Spouts
     Stream Sources
   Bolts
     Unit of Computation
   Topologies
     Combination of n Spouts and n Bolts
     Defines the overall “Computation”
Storm Spouts
   Represents a source (stream) of data
     Queues (JMS, Kafka, Kestrel, etc.)
     Twitter Firehose
     Sensor Data
   Emits “Tuples” (Events) based on source
     Primary Storm data structure
     Set of Key-Value pairs
Storm Bolts
 Receive Tuples from Spouts or other Bolts
 Operate on, or React to Data
     Functions/Filters/Joins/Aggregations
     Database writes/lookups
   Optionally emit additional Tuples
Storm Topologies
 Data flow between spouts and bolts
 Routing of Tuples between spouts/bolts
     Stream “Groupings”
 Parallelism of Components
 Long-Lived
Storm Topologies
Storm and Cassandra
   Use Cases:
     Write Storm Tuple data to C*
      ○ Computation Results
      ○ Pre-computed indices


     Read data from C* and emit Storm Tuples
      ○ Dynamic Lookups




                      http://github.com/hmsonline/storm-cassandra
Storm Cassandra Bolt
Types
                      CassandraBolt



                        Cassandra
                        LookupBolt
                                                     C*
   CassandraBolt
     Writes data to Cassandra
     Available in Batching and Non-Batching
   CassandraLookupBolt
     Reads data from Cassandra
                       http://github.com/hmsonline/storm-cassandra
Storm-Cassandra Project
   Provides generic Bolts for writing/reading
    Storm Tuples to/from C*


                             Tuple
              Tuple         Mapper         Rows




               Tuples
                            Columns
                            Mapper         Columns    C*
                        http://github.com/hmsonline/storm-cassandra
Storm-Cassandra Project
   TupleMapper Interface
     Tells the CassandraBolt how to write a tuple to an
     arbitrary data model


   Given a Storm Tuple:
     Map to Column Family
     Map to Row Key
     Map to Columns


                       http://github.com/hmsonline/storm-cassandra
Storm-Cassandra Project
   ColumnsMapper Interface
     Tells the CassandraLookupBolt how to transform a
     C* row into a Storm Tuple


   Given a C* Row Key and list of Columns:
     Return a list of Storm Tuples




                       http://github.com/hmsonline/storm-cassandra
Storm-Cassandra Project
   Current State:
     Version 0.4.0
     Uses Astyanax Client
     Several out-of-the-box *Mapper Implementations:
      ○ Basic Key-Value Columns
      ○ Value-less Columns
      ○ Counter Columns
      ○ Lookup by row key
      ○ Lookup by range query
     Composite Key/Column Support
     Trident support

                       http://github.com/hmsonline/storm-cassandra
Storm-Cassandra Project
   Future Plans:
     Switch to CQL
     Enhanced Trident Support




                      http://github.com/hmsonline/storm-cassandra
Word Count Demo




          http://github.com/hmsonline/storm-cassandra
DRPC
Reach Demo
Cassandra and Storm at Health Market Sceince
Trident
   Provides a higher-level abstraction for stream
    processing
     Constructs for state management and Batching
 Adds additional primitives that abstract away
  common topological patterns
 Deprecates transactional topologies
 Distributes with Storm
Sample Trident Operations
   Partition Local
     Functions      ( execute(x)  x + y )
     Filters        ( isKeep(x)  0,x )
     PartitionAggregate
      ○ Combiner    ( pairwise combining )
      ○ Reducer     ( iterative accumulation )
      ○ Aggregator ( byoa)
A sample topology
TridentTopology topology = new TridentTopology();
TridentStatewordCounts =
topology.newStream("spout1", spout)
     .each(new Fields("sentence"),
new Split(),
         new Fields("word"))
     .groupBy(new Fields("word"))
     .persistentAggregate(
MemcachedState.opaque(serverLocations),
new Count(),
        new Fields("count"))
     .parallelismHint(6);

                     https://github.com/nathanmarz/storm/wiki/Trident-state
Trident State
Sequenced writes by batch/transaction id.
   Spouts
     Transactional
      ○ Batch contents never change
     Opaque
      ○ Batch contents can change
   State
     Transactional
      ○ Store tx_id with counts to maintain sequencing of writes.
     Opaque
      ○ Store previous value in order to overwrite the current value
        when contents of a batch change.
Shameless Shoutouts
   HMS (https://github.com/hmsonline/)
     storm-cassandra
     storm-elastic-search
     storm-jdbi (coming soon)


   ptgoetz (https://github.com/ptgoetz)
     storm-jms
     storm-signals
P. Taylor Goetz
Development Lead, Health Market Science
tgoetz@healthmarketscience.com
@ptgoetz

More Related Content

PPTX
PDF
Real-time streams and logs with Storm and Kafka
PDF
Introduction to Twitter Storm
PDF
Storm Real Time Computation
PPTX
Slide #1:Introduction to Apache Storm
PDF
Introduction to Apache Storm
PDF
Learning Stream Processing with Apache Storm
PPTX
Improved Reliable Streaming Processing: Apache Storm as example
Real-time streams and logs with Storm and Kafka
Introduction to Twitter Storm
Storm Real Time Computation
Slide #1:Introduction to Apache Storm
Introduction to Apache Storm
Learning Stream Processing with Apache Storm
Improved Reliable Streaming Processing: Apache Storm as example

What's hot (19)

PPTX
Introduction to Storm
PDF
Real-Time Analytics with Kafka, Cassandra and Storm
PPTX
Introduction to Storm
PPTX
Spark vs storm
PDF
Developing Java Streaming Applications with Apache Storm
PPTX
Multi-Tenant Storm Service on Hadoop Grid
PDF
Hadoop Summit Europe 2014: Apache Storm Architecture
PPTX
Realtime Statistics based on Apache Storm and RocketMQ
PDF
Storm and Cassandra
PDF
Storm Anatomy
PPTX
Scaling Apache Storm (Hadoop Summit 2015)
PPTX
Apache Storm Internals
PDF
Scaling Apache Storm - Strata + Hadoop World 2014
PDF
Storm: The Real-Time Layer - GlueCon 2012
PDF
PHP Backends for Real-Time User Interaction using Apache Storm.
PPTX
Apache Storm 0.9 basic training - Verisign
PDF
Storm
PDF
Apache Storm Tutorial
PPTX
Yahoo compares Storm and Spark
Introduction to Storm
Real-Time Analytics with Kafka, Cassandra and Storm
Introduction to Storm
Spark vs storm
Developing Java Streaming Applications with Apache Storm
Multi-Tenant Storm Service on Hadoop Grid
Hadoop Summit Europe 2014: Apache Storm Architecture
Realtime Statistics based on Apache Storm and RocketMQ
Storm and Cassandra
Storm Anatomy
Scaling Apache Storm (Hadoop Summit 2015)
Apache Storm Internals
Scaling Apache Storm - Strata + Hadoop World 2014
Storm: The Real-Time Layer - GlueCon 2012
PHP Backends for Real-Time User Interaction using Apache Storm.
Apache Storm 0.9 basic training - Verisign
Storm
Apache Storm Tutorial
Yahoo compares Storm and Spark
Ad

Viewers also liked (12)

PPTX
The Big Data Quadfecta
PPTX
The Future of Apache Storm
PDF
Apache storm vs. Spark Streaming
PPTX
Multi-tenant Apache Storm as a service
PDF
Deep learning with Hortonworks and Apache Spark - Hortonworks technical workshop
PPTX
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
PPTX
Jubatus 1.0 の紹介
PPTX
Introduction To HBase
PPTX
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
PPTX
Hive on spark is blazing fast or is it final
PPTX
Enabling the Real Time Analytical Enterprise
PPTX
Big Data Analytics with Hadoop
The Big Data Quadfecta
The Future of Apache Storm
Apache storm vs. Spark Streaming
Multi-tenant Apache Storm as a service
Deep learning with Hortonworks and Apache Spark - Hortonworks technical workshop
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Jubatus 1.0 の紹介
Introduction To HBase
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive on spark is blazing fast or is it final
Enabling the Real Time Analytical Enterprise
Big Data Analytics with Hadoop
Ad

Similar to Cassandra and Storm at Health Market Sceince (20)

PPTX
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
PPTX
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
PPTX
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
PPTX
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
PPTX
Cassandra synergy
PDF
Cassandra Talk: Austin JUG
PPTX
Cassandra's Sweet Spot - an introduction to Apache Cassandra
PDF
On Rails with Apache Cassandra
PPT
NOSQL and Cassandra
PDF
HPTS 2011: The NoSQL Ecosystem
PDF
The NoSQL Ecosystem
ODP
Web-scale data processing: practical approaches for low-latency and batch
PPT
Scaling web applications with cassandra presentation
KEY
Cassandra and Rails at LA NoSQL Meetup
PDF
Apache cassandra & apache spark for time series data
PDF
Outside The Box With Apache Cassnadra
PDF
About "Apache Cassandra"
PDF
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
PDF
Cassandra: Open Source Bigtable + Dynamo
PPT
Scaling Web Applications with Cassandra Presentation (1).ppt
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Cassandra synergy
Cassandra Talk: Austin JUG
Cassandra's Sweet Spot - an introduction to Apache Cassandra
On Rails with Apache Cassandra
NOSQL and Cassandra
HPTS 2011: The NoSQL Ecosystem
The NoSQL Ecosystem
Web-scale data processing: practical approaches for low-latency and batch
Scaling web applications with cassandra presentation
Cassandra and Rails at LA NoSQL Meetup
Apache cassandra & apache spark for time series data
Outside The Box With Apache Cassnadra
About "Apache Cassandra"
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
Cassandra: Open Source Bigtable + Dynamo
Scaling Web Applications with Cassandra Presentation (1).ppt

Recently uploaded (20)

PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Advanced IT Governance
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
HCSP-Presales-Campus Network Planning and Design V1.0 Training Material-Witho...
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Sensors and Actuators in IoT Systems using pdf
PPTX
Cloud computing and distributed systems.
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
madgavkar20181017ppt McKinsey Presentation.pdf
PDF
KodekX | Application Modernization Development
PDF
cuic standard and advanced reporting.pdf
PPTX
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Advanced Soft Computing BINUS July 2025.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Advanced IT Governance
Reach Out and Touch Someone: Haptics and Empathic Computing
HCSP-Presales-Campus Network Planning and Design V1.0 Training Material-Witho...
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Sensors and Actuators in IoT Systems using pdf
Cloud computing and distributed systems.
Dropbox Q2 2025 Financial Results & Investor Presentation
Review of recent advances in non-invasive hemoglobin estimation
madgavkar20181017ppt McKinsey Presentation.pdf
KodekX | Application Modernization Development
cuic standard and advanced reporting.pdf
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
Advanced methodologies resolving dimensionality complications for autism neur...
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Advanced Soft Computing BINUS July 2025.pdf

Cassandra and Storm at Health Market Sceince

  • 1. P. Taylor Goetz Development Lead, Health Market Science [email protected] @ptgoetz
  • 2. Agenda  Cassandra @ HMS  Storm Overview  Storm-Cassandra  Examples / Demo  Future : Trident
  • 3. Our Products  Master Data Management  Good, bad doctors?  Prescriber eligibility and remediation.
  • 4. Cassandra to the Rescue 1000’s of Feeds C* Masterfile Big Data for us == Variety of Data Δt
  • 5. But…  Search unstructured data  Real-time Analytics / Reporting  Transactional Processing  Changes reflected immediately.  Wide-row Indexes
  • 6. What might that look like? wide-row index C* I’m happy RDBMS Provide for Polyglot Persistence
  • 7. What we did wrong…  Could not react to transactional changes  Needed extra logic to track what changed  Took too long
  • 8. C* at HMS  Load, Standardize, Match, Consolidate  Write results to C*  Track Changes over Time  Practitioner Data  Feed Quality
  • 9. C* at HMS  Best Practices  Prefer Write over Read (esp. Read before Write)  Avoid Queries/Scans ○ Fetch by key whenever possible ○ Put Comparators to work ○ Pre-compute whenever possible
  • 10. How?  Treat All Data as Immutable  Updates are inserts with new version/timestamp  Data Model  Heavy use of composites  Timestamps/Versions in Keys  Treat Feeds as Real-Time Streams
  • 11. What Storm is to us… Crud Op ETL Dimensional Enrichment Counts Fuzzy SoR RDBMS Index A High Throughput Data Processing Pipeline
  • 13. Storm Overview  Open-Sourced by Twitter in 2011  Distributed Realtime Computation System  Fault Tolerant  Highly Scalable  Guaranteed Processing  Operates on one or more streams of data
  • 14. Anatomy of a Storm Cluster  Nimbus  Master Node  Zookeeper  Cluster Coordination  Supervisors  Worker Nodes
  • 15. Storm Primatives  Streams  Unbounded sequence of tuples  Spouts  Stream Sources  Bolts  Unit of Computation  Topologies  Combination of n Spouts and n Bolts  Defines the overall “Computation”
  • 16. Storm Spouts  Represents a source (stream) of data  Queues (JMS, Kafka, Kestrel, etc.)  Twitter Firehose  Sensor Data  Emits “Tuples” (Events) based on source  Primary Storm data structure  Set of Key-Value pairs
  • 17. Storm Bolts  Receive Tuples from Spouts or other Bolts  Operate on, or React to Data  Functions/Filters/Joins/Aggregations  Database writes/lookups  Optionally emit additional Tuples
  • 18. Storm Topologies  Data flow between spouts and bolts  Routing of Tuples between spouts/bolts  Stream “Groupings”  Parallelism of Components  Long-Lived
  • 20. Storm and Cassandra  Use Cases:  Write Storm Tuple data to C* ○ Computation Results ○ Pre-computed indices  Read data from C* and emit Storm Tuples ○ Dynamic Lookups http://github.com/hmsonline/storm-cassandra
  • 21. Storm Cassandra Bolt Types CassandraBolt Cassandra LookupBolt C*  CassandraBolt  Writes data to Cassandra  Available in Batching and Non-Batching  CassandraLookupBolt  Reads data from Cassandra http://github.com/hmsonline/storm-cassandra
  • 22. Storm-Cassandra Project  Provides generic Bolts for writing/reading Storm Tuples to/from C* Tuple Tuple Mapper Rows Tuples Columns Mapper Columns C* http://github.com/hmsonline/storm-cassandra
  • 23. Storm-Cassandra Project  TupleMapper Interface  Tells the CassandraBolt how to write a tuple to an arbitrary data model  Given a Storm Tuple:  Map to Column Family  Map to Row Key  Map to Columns http://github.com/hmsonline/storm-cassandra
  • 24. Storm-Cassandra Project  ColumnsMapper Interface  Tells the CassandraLookupBolt how to transform a C* row into a Storm Tuple  Given a C* Row Key and list of Columns:  Return a list of Storm Tuples http://github.com/hmsonline/storm-cassandra
  • 25. Storm-Cassandra Project  Current State:  Version 0.4.0  Uses Astyanax Client  Several out-of-the-box *Mapper Implementations: ○ Basic Key-Value Columns ○ Value-less Columns ○ Counter Columns ○ Lookup by row key ○ Lookup by range query  Composite Key/Column Support  Trident support http://github.com/hmsonline/storm-cassandra
  • 26. Storm-Cassandra Project  Future Plans:  Switch to CQL  Enhanced Trident Support http://github.com/hmsonline/storm-cassandra
  • 27. Word Count Demo http://github.com/hmsonline/storm-cassandra
  • 28. DRPC
  • 31. Trident  Provides a higher-level abstraction for stream processing  Constructs for state management and Batching  Adds additional primitives that abstract away common topological patterns  Deprecates transactional topologies  Distributes with Storm
  • 32. Sample Trident Operations  Partition Local  Functions ( execute(x)  x + y )  Filters ( isKeep(x)  0,x )  PartitionAggregate ○ Combiner ( pairwise combining ) ○ Reducer ( iterative accumulation ) ○ Aggregator ( byoa)
  • 33. A sample topology TridentTopology topology = new TridentTopology(); TridentStatewordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate( MemcachedState.opaque(serverLocations), new Count(), new Fields("count")) .parallelismHint(6); https://github.com/nathanmarz/storm/wiki/Trident-state
  • 34. Trident State Sequenced writes by batch/transaction id.  Spouts  Transactional ○ Batch contents never change  Opaque ○ Batch contents can change  State  Transactional ○ Store tx_id with counts to maintain sequencing of writes.  Opaque ○ Store previous value in order to overwrite the current value when contents of a batch change.
  • 35. Shameless Shoutouts  HMS (https://github.com/hmsonline/)  storm-cassandra  storm-elastic-search  storm-jdbi (coming soon)  ptgoetz (https://github.com/ptgoetz)  storm-jms  storm-signals
  • 36. P. Taylor Goetz Development Lead, Health Market Science [email protected] @ptgoetz