This document discusses using MapReduce with Cassandra. It describes how writing to Cassandra from MapReduce has always been possible, while reading was enabled starting with Cassandra 0.6.x. Using MapReduce with Cassandra provides analytics capabilities and avoids single points of failure compared to MapReduce with HBase. The document covers setup and configuration considerations like locality, and provides examples of a separate cluster approach and hybrid cluster approach. It also outlines future work like improving output to Cassandra and adding Hive support.
Apache Cassandra operations have the reputation to be simple on single datacenter deployments and / or low volume clusters but they become way more complex on high latency multi-datacenter clusters with high volume and / or high throughout: basic Apache Cassandra operations such as repairs, compactions or hints delivery can have dramatic consequences even on a healthy high latency multi-datacenter cluster.
In this presentation, Julien will go through Apache Cassandra mutli-datacenter concepts first then show multi-datacenter operations essentials in details: bootstrapping new nodes and / or datacenter, repairs strategy, Java GC tuning, OS tuning, Apache Cassandra configuration and monitoring.
Based on his 3 years experience managing a multi-datacenter cluster against Apache Cassandra 2.0, 2.1, 2.2 and 3.0, Julien will give you tips on how to anticipate and prevent / mitigate issues related to basic Apache Cassandra operations with a multi-datacenter cluster.
About the Speaker
Julien Anguenot VP Software Engineering, iland Internet Solutions, Corp
Julien currently serves as iland's Vice President of Software Engineering. Prior to joining iland, Mr. Anguenot held tech leadership positions at several open source content management vendors and tech startups in Europe and in the U.S. Julien is a long time Open Source software advocate, contributor and speaker: Zope, ZODB, Nuxeo contributor, Zope and OpenStack foundations member, his talks includes Apache Con, Cassandra summit, OpenStack summit, The WWW Conference or still EuroPython.
Apache Cassandra operations have the reputation to be quite simple against single datacenter clusters and / or low volume clusters but they become way more complex against high latency multi-datacenter clusters: basic operations such as repair, compaction or hints delivery can have dramatic consequences even on a healthy cluster.
In this presentation, Julien will go through Cassandra operations in details: bootstrapping new nodes and / or datacenter, repair strategies, compaction strategies, GC tuning, OS tuning, large batch of data removal and Apache Cassandra upgrade strategy.
Julien will give you tips and techniques on how to anticipate issues inherent to multi-datacenter cluster: how and what to monitor, hardware and network considerations as well as data model and application level bad design / anti-patterns that can affect your multi-datacenter cluster performances.
To date, Hadoop usage has focused primarily on offline analysis--making sense of web logs, parsing through loads of unstructured data in HDFS, etc. But what if you want to run map/reduce against your live data set without affecting online performance? Combining Hadoop with Cassandra's multi-datacenter replication capabilities makes this possible. If you're interested in getting value from your data without the hassle and latency of first moving it into Hadoop, this talk is for you. I'll show you how to connect all the parts, enabling you to write map/reduce jobs or run Pig queries against your live data. As a bonus I'll cover writing map/reduce in Scala, which is particularly well-suited for the task.
Introduction to Cassandra: Replication and ConsistencyBenjamin Black
A short introduction to replication and consistency in the Cassandra distributed database. Delivered April 28th, 2010 at the Seattle Scalability Meetup.
This document introduces Apache Cassandra, a distributed column-oriented NoSQL database. It discusses Cassandra's architecture, data model, query language (CQL), and how to install and run Cassandra. Key points covered include Cassandra's linear scalability, high availability and fault tolerance. The document also demonstrates how to use the nodetool utility and provides guidance on backing up and restoring Cassandra data.
Co-Founder and CTO of Instaclustr, Ben Bromhead's presentation at the Cassandra Summit 2016, in San Jose.
This presentation will show how create truly elastic Cassandra deployments on AWS allowing you to scale and shrink your large Cassandra deployments multiple times a day. Leveraging a combination of EBS backed disks, JBOD, token pinning and our previous work on bootstrapping from backups you will be able to dramatically reduce costs per cluster by scaling to match your daily workloads.
Apache Cassandra is a free, distributed, open source, and highly scalable NoSQL database that is designed to handle large amounts of data across many commodity servers. It provides high availability with no single point of failure, linear scalability, and tunable consistency. Cassandra's architecture allows it to spread data across a cluster of servers and replicate across multiple data centers for fault tolerance. It is used by many large companies for applications that require high performance, scalability, and availability.
This document summarizes challenges with large partitions in Cassandra and potential solutions. When a large partition is read, the key cache can cause garbage collection pressure as it stores the partition's index on the Java heap. Currently, the index is stored off-heap only if the partition exceeds a configurable size, otherwise it is kept on-heap. Fully migrating the key cache off-heap is another potential solution but incurs serialization costs.
This document provides an agenda and introduction for a presentation on Apache Cassandra and DataStax Enterprise. The presentation covers an introduction to Cassandra and NoSQL, the CAP theorem, Apache Cassandra features and architecture including replication, consistency levels and failure handling. It also discusses the Cassandra Query Language, data modeling for time series data, and new features in DataStax Enterprise like Spark integration and secondary indexes on collections. The presentation concludes with recommendations for getting started with Cassandra in production environments.
Detail behind the Apache Cassandra 2.0 release and what is new in it including Lightweight Transactions (compare and swap) Eager retries, Improved compaction, Triggers (experimental) and more!
• CQL cursors
Whether running load tests or migrating historic data, loading data directly into Cassandra can be very useful to bypass the system’s write path.
In this webinar, we will look at how data is stored on disk in sstables, how to generate these structures directly, and how to load this data rapidly into your cluster using sstableloader. We'll also review different use cases for when you should and shouldn't use this method.
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016DataStax
A deep learning startup has a requirement for a robust and scalable data architecture. Training a Deep Neural Network requires 10s-100s of millions of examples consisting of data and metadata. In addition to training it is necessary to support test/validation, data exploration and more traditional data science analytics workloads. As a startup we have minimal resources and an engineering team of 1.
Cassandra, Spark and Kafka running on Mesos in AWS is a scalable architecture that is fast and easy to set up and maintain to deliver a data architecture for Deep Learning.
About the Speaker
Andrew Jefferson VP Engineering, Tractable
A software engineer specialising in realtime data systems. I've worked at companies from Startups to Apple on applications ranging from Ticketing to Genetics. Currently building data systems for training and exploiting Deep Neural Networks.
Cassandra is the dominant data store used at Netflix and it's health is critical to many of its services. In this talk we will share details of the recent redesign of our health monitoring system and how we leveraged a reactive stream processing system to give us a real-time view our entire fleet while dramatically improving accuracy and reducing false alarms in our alerting.
About the Speaker
Jason Cacciatore Senior Software Engineer, Netflix
Jason Cacciatore is a Senior Software Engineer at Netflix, where he's been working for the past several years. He's interested in stateful distributed systems and has a diverse background in technology. In his spare time he enjoys spending time with his wife and two sons, reading non-fiction, and watching Netflix documentaries.
Cassandra concepts, patterns and anti-patternsDave Gardner
The document discusses Cassandra concepts, patterns, and anti-patterns. It begins with an agenda that covers choosing NoSQL, Cassandra concepts based on Dynamo and Bigtable, and patterns and anti-patterns of use. It then delves into Cassandra concepts such as consistent hashing, vector clocks, gossip protocol, hinted handoff, read repair, and consistency levels. It also discusses Bigtable concepts like sparse column-based data model, SSTables, commit log, and memtables. Finally, it outlines several patterns and anti-patterns of Cassandra use.
Cassandra is a distributed database management system designed to handle large amounts of data across many commodity servers. It provides high availability with no single points of failure and linear scalability as nodes are added. Cassandra uses a peer-to-peer distributed architecture and tunable consistency levels to achieve high performance and availability without requiring strong consistency. It is based on Amazon's Dynamo and Google's Bigtable papers and provides a combination of their features.
Understanding Data Consistency in Apache CassandraDataStax
This document provides an overview of data consistency in Apache Cassandra. It discusses how Cassandra writes data to commit logs and memtables before flushing to SSTables. It also reviews the CAP theorem and how Cassandra offers tunable consistency levels for both reads and writes. Strategies for choosing consistency levels for writes, such as ANY, ONE, QUORUM, and ALL are presented. The document also covers read repair and hinted handoffs in Cassandra. Examples of CQL queries with different consistency levels are given and information on where to download Cassandra is provided at the end.
We run multiple DataStax Enterprise clusters in Azure each holding 300 TB+ data to deeply understand Office 365 users. In this talk, we will deep dive into some of the key challenges and takeaways faced in running these clusters reliably over a year. To name a few: process crashes, ephemeral SSDs contributing to data loss, slow streaming between nodes, mutation drops, compaction strategy choices, schema updates when nodes are down and backup/restore. We will briefly talk about our contributions back to Cassandra, and our path forward using network attached disks offered via Azure premium storage.
About the Speaker
Anubhav Kale Sr. Software Engineer, Microsoft
Anubhav is a senior software engineer at Microsoft. His team is responsible for building big data platform using Cassandra, Spark and Azure to generate per-user insights of Office 365 users.
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...DataStax
EmoDB is an open source RESTful data store built on top of Cassandra that stores JSON documents and, most notably, offers a databus that allows subscribers to watch for changes to those documents in real time. It features massive non-blocking global writes, asynchronous cross data center communication, and schema-less json content.
For non-blocking global writes, we created a ""JSON delta"" specification that defines incremental updates to any json document. Each row, in Cassandra, is thus a sequence of deltas that serves as a Conflict-free Replicated Datatype (CRDT) for EmoDB's system of record. We introduce the concept of ""distributed compactions"" to frequently compact these deltas for efficient reads.
Finally, the databus forms a crucial piece of our data infrastructure and offers a change queue to real time streaming applications.
About the Speaker
Fahd Siddiqui Lead Software Engineer, Bazaarvoice
Fahd Siddiqui is a Lead Software Engineer at Bazaarvoice in the data infrastructure team. His interests include highly scalable, and distributed data systems. He holds a Master's degree in Computer Engineering from the University of Texas at Austin, and frequently talks at Austin C* User Group. About Bazaarvoice: Bazaarvoice is a network that connects brands and retailers to the authentic voices of people where they shop. More at www.bazaarvoice.com
Apache Cassandra operations have the reputation to be simple on single datacenter deployments and / or low volume clusters but they become way more complex on high latency multi-datacenter clusters with high volume and / or high throughout: basic Apache Cassandra operations such as repairs, compactions or hints delivery can have dramatic consequences even on a healthy high latency multi-datacenter cluster.
In this presentation, Julien will go through Apache Cassandra mutli-datacenter concepts first then show multi-datacenter operations essentials in details: bootstrapping new nodes and / or datacenter, repairs strategy, Java GC tuning, OS tuning, Apache Cassandra configuration and monitoring.
Based on his 3 years experience managing a multi-datacenter cluster against Apache Cassandra 2.0, 2.1, 2.2 and 3.0, Julien will give you tips on how to anticipate and prevent / mitigate issues related to basic Apache Cassandra operations with a multi-datacenter cluster.
This presentation will investigate how using micro-batching for submitting writes to Cassandra can improve throughput and reduce client application CPU load.
Micro-batching combines writes for the same partition key into a single network request and ensures they hit the "fast path" for writes on a Cassandra node.
About the Speaker
Adam Zegelin Technical Co-founder, Instaclustr
As Instaclustrs founding software engineer, Adam provides the foundation knowledge of our capability and engineering environment. He delivers business-focused value to our code-base and overall capability architecture. Adam is also focused on providing Instaclustr's contribution to the broader open source community on which our products and services rely, including Apache Cassandra, Apache Spark and other technologies such as CoreOS and Docker.
Cassandra Troubleshooting for 2.1 and laterJ.B. Langston
Troubleshooting Cassandra 2.1: A Guided Tour of nodetool and system.log. From Cassandra Summit 2015. Download and check out the presenter notes for tips!
I’ll give a general lay of the land for troubleshooting Cassandra. Then I’ll take you on a deep dive through nodetool and system.log and give you a guided tour of the useful information they provide for troubleshooting. I’ll devote special attention to monitoring the various processes that Cassandra uses to do its work and how to effectively search for information about specific error messages online.
Open Source Monitoring for Java with JMX and Graphite (GeeCON 2013)Cyrille Le Clerc
Fast feedback from monitoring is a key of Continuous Delivery. JMX is the right Java API to do so but it unfortunately stayed underused and underappreciated as it was difficult to connect to monitoring and graphing systems.
Throw in the sin bin the poor solutions based on log files and weakly secured web interfaces! A new generation of Open Source tooling makes it easy to graph java application metrics and integrate them to traditional monitoring systems like Nagios.
Following the logic of DevOps, we will look together how best to integrate the monitoring dimension in a project: from design to development, to QA and finally to production on both traditional deployment and in the Cloud.
Come and discover how the JmxTrans-Graphite ticket can make your life easier.
This document introduces Apache Cassandra, a distributed column-oriented NoSQL database. It discusses Cassandra's architecture, data model, query language (CQL), and how to install and run Cassandra. Key points covered include Cassandra's linear scalability, high availability and fault tolerance. The document also demonstrates how to use the nodetool utility and provides guidance on backing up and restoring Cassandra data.
Co-Founder and CTO of Instaclustr, Ben Bromhead's presentation at the Cassandra Summit 2016, in San Jose.
This presentation will show how create truly elastic Cassandra deployments on AWS allowing you to scale and shrink your large Cassandra deployments multiple times a day. Leveraging a combination of EBS backed disks, JBOD, token pinning and our previous work on bootstrapping from backups you will be able to dramatically reduce costs per cluster by scaling to match your daily workloads.
Apache Cassandra is a free, distributed, open source, and highly scalable NoSQL database that is designed to handle large amounts of data across many commodity servers. It provides high availability with no single point of failure, linear scalability, and tunable consistency. Cassandra's architecture allows it to spread data across a cluster of servers and replicate across multiple data centers for fault tolerance. It is used by many large companies for applications that require high performance, scalability, and availability.
This document summarizes challenges with large partitions in Cassandra and potential solutions. When a large partition is read, the key cache can cause garbage collection pressure as it stores the partition's index on the Java heap. Currently, the index is stored off-heap only if the partition exceeds a configurable size, otherwise it is kept on-heap. Fully migrating the key cache off-heap is another potential solution but incurs serialization costs.
This document provides an agenda and introduction for a presentation on Apache Cassandra and DataStax Enterprise. The presentation covers an introduction to Cassandra and NoSQL, the CAP theorem, Apache Cassandra features and architecture including replication, consistency levels and failure handling. It also discusses the Cassandra Query Language, data modeling for time series data, and new features in DataStax Enterprise like Spark integration and secondary indexes on collections. The presentation concludes with recommendations for getting started with Cassandra in production environments.
Detail behind the Apache Cassandra 2.0 release and what is new in it including Lightweight Transactions (compare and swap) Eager retries, Improved compaction, Triggers (experimental) and more!
• CQL cursors
Whether running load tests or migrating historic data, loading data directly into Cassandra can be very useful to bypass the system’s write path.
In this webinar, we will look at how data is stored on disk in sstables, how to generate these structures directly, and how to load this data rapidly into your cluster using sstableloader. We'll also review different use cases for when you should and shouldn't use this method.
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016DataStax
A deep learning startup has a requirement for a robust and scalable data architecture. Training a Deep Neural Network requires 10s-100s of millions of examples consisting of data and metadata. In addition to training it is necessary to support test/validation, data exploration and more traditional data science analytics workloads. As a startup we have minimal resources and an engineering team of 1.
Cassandra, Spark and Kafka running on Mesos in AWS is a scalable architecture that is fast and easy to set up and maintain to deliver a data architecture for Deep Learning.
About the Speaker
Andrew Jefferson VP Engineering, Tractable
A software engineer specialising in realtime data systems. I've worked at companies from Startups to Apple on applications ranging from Ticketing to Genetics. Currently building data systems for training and exploiting Deep Neural Networks.
Cassandra is the dominant data store used at Netflix and it's health is critical to many of its services. In this talk we will share details of the recent redesign of our health monitoring system and how we leveraged a reactive stream processing system to give us a real-time view our entire fleet while dramatically improving accuracy and reducing false alarms in our alerting.
About the Speaker
Jason Cacciatore Senior Software Engineer, Netflix
Jason Cacciatore is a Senior Software Engineer at Netflix, where he's been working for the past several years. He's interested in stateful distributed systems and has a diverse background in technology. In his spare time he enjoys spending time with his wife and two sons, reading non-fiction, and watching Netflix documentaries.
Cassandra concepts, patterns and anti-patternsDave Gardner
The document discusses Cassandra concepts, patterns, and anti-patterns. It begins with an agenda that covers choosing NoSQL, Cassandra concepts based on Dynamo and Bigtable, and patterns and anti-patterns of use. It then delves into Cassandra concepts such as consistent hashing, vector clocks, gossip protocol, hinted handoff, read repair, and consistency levels. It also discusses Bigtable concepts like sparse column-based data model, SSTables, commit log, and memtables. Finally, it outlines several patterns and anti-patterns of Cassandra use.
Cassandra is a distributed database management system designed to handle large amounts of data across many commodity servers. It provides high availability with no single points of failure and linear scalability as nodes are added. Cassandra uses a peer-to-peer distributed architecture and tunable consistency levels to achieve high performance and availability without requiring strong consistency. It is based on Amazon's Dynamo and Google's Bigtable papers and provides a combination of their features.
Understanding Data Consistency in Apache CassandraDataStax
This document provides an overview of data consistency in Apache Cassandra. It discusses how Cassandra writes data to commit logs and memtables before flushing to SSTables. It also reviews the CAP theorem and how Cassandra offers tunable consistency levels for both reads and writes. Strategies for choosing consistency levels for writes, such as ANY, ONE, QUORUM, and ALL are presented. The document also covers read repair and hinted handoffs in Cassandra. Examples of CQL queries with different consistency levels are given and information on where to download Cassandra is provided at the end.
We run multiple DataStax Enterprise clusters in Azure each holding 300 TB+ data to deeply understand Office 365 users. In this talk, we will deep dive into some of the key challenges and takeaways faced in running these clusters reliably over a year. To name a few: process crashes, ephemeral SSDs contributing to data loss, slow streaming between nodes, mutation drops, compaction strategy choices, schema updates when nodes are down and backup/restore. We will briefly talk about our contributions back to Cassandra, and our path forward using network attached disks offered via Azure premium storage.
About the Speaker
Anubhav Kale Sr. Software Engineer, Microsoft
Anubhav is a senior software engineer at Microsoft. His team is responsible for building big data platform using Cassandra, Spark and Azure to generate per-user insights of Office 365 users.
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...DataStax
EmoDB is an open source RESTful data store built on top of Cassandra that stores JSON documents and, most notably, offers a databus that allows subscribers to watch for changes to those documents in real time. It features massive non-blocking global writes, asynchronous cross data center communication, and schema-less json content.
For non-blocking global writes, we created a ""JSON delta"" specification that defines incremental updates to any json document. Each row, in Cassandra, is thus a sequence of deltas that serves as a Conflict-free Replicated Datatype (CRDT) for EmoDB's system of record. We introduce the concept of ""distributed compactions"" to frequently compact these deltas for efficient reads.
Finally, the databus forms a crucial piece of our data infrastructure and offers a change queue to real time streaming applications.
About the Speaker
Fahd Siddiqui Lead Software Engineer, Bazaarvoice
Fahd Siddiqui is a Lead Software Engineer at Bazaarvoice in the data infrastructure team. His interests include highly scalable, and distributed data systems. He holds a Master's degree in Computer Engineering from the University of Texas at Austin, and frequently talks at Austin C* User Group. About Bazaarvoice: Bazaarvoice is a network that connects brands and retailers to the authentic voices of people where they shop. More at www.bazaarvoice.com
Apache Cassandra operations have the reputation to be simple on single datacenter deployments and / or low volume clusters but they become way more complex on high latency multi-datacenter clusters with high volume and / or high throughout: basic Apache Cassandra operations such as repairs, compactions or hints delivery can have dramatic consequences even on a healthy high latency multi-datacenter cluster.
In this presentation, Julien will go through Apache Cassandra mutli-datacenter concepts first then show multi-datacenter operations essentials in details: bootstrapping new nodes and / or datacenter, repairs strategy, Java GC tuning, OS tuning, Apache Cassandra configuration and monitoring.
Based on his 3 years experience managing a multi-datacenter cluster against Apache Cassandra 2.0, 2.1, 2.2 and 3.0, Julien will give you tips on how to anticipate and prevent / mitigate issues related to basic Apache Cassandra operations with a multi-datacenter cluster.
This presentation will investigate how using micro-batching for submitting writes to Cassandra can improve throughput and reduce client application CPU load.
Micro-batching combines writes for the same partition key into a single network request and ensures they hit the "fast path" for writes on a Cassandra node.
About the Speaker
Adam Zegelin Technical Co-founder, Instaclustr
As Instaclustrs founding software engineer, Adam provides the foundation knowledge of our capability and engineering environment. He delivers business-focused value to our code-base and overall capability architecture. Adam is also focused on providing Instaclustr's contribution to the broader open source community on which our products and services rely, including Apache Cassandra, Apache Spark and other technologies such as CoreOS and Docker.
Cassandra Troubleshooting for 2.1 and laterJ.B. Langston
Troubleshooting Cassandra 2.1: A Guided Tour of nodetool and system.log. From Cassandra Summit 2015. Download and check out the presenter notes for tips!
I’ll give a general lay of the land for troubleshooting Cassandra. Then I’ll take you on a deep dive through nodetool and system.log and give you a guided tour of the useful information they provide for troubleshooting. I’ll devote special attention to monitoring the various processes that Cassandra uses to do its work and how to effectively search for information about specific error messages online.
Open Source Monitoring for Java with JMX and Graphite (GeeCON 2013)Cyrille Le Clerc
Fast feedback from monitoring is a key of Continuous Delivery. JMX is the right Java API to do so but it unfortunately stayed underused and underappreciated as it was difficult to connect to monitoring and graphing systems.
Throw in the sin bin the poor solutions based on log files and weakly secured web interfaces! A new generation of Open Source tooling makes it easy to graph java application metrics and integrate them to traditional monitoring systems like Nagios.
Following the logic of DevOps, we will look together how best to integrate the monitoring dimension in a project: from design to development, to QA and finally to production on both traditional deployment and in the Cloud.
Come and discover how the JmxTrans-Graphite ticket can make your life easier.
The document provides an introduction to Cassandra presented by Nick Bailey. It discusses key Cassandra concepts like cluster architecture, data modeling using CQL, and best practices. Examples are provided to illustrate how to model time-series data and denormalize schemas to support different queries. Tools for testing Cassandra implementations like CCM and client drivers are also mentioned.
Anti-entropy repairs are known to be a very peculiar maintenance operation of Cassandra clusters. They are problematic mostly because of the potential of having negative impact on the cluster's performance. Another problematic aspect is the difficulty of managing the repairs of Cassandra clusters in a careful way that would prevent the negative performance impact.
Based on the long-term pain we have been experiencing with managing repairs of nearly 100 Cassandra clusters, and being unable to find a solution that would meet our needs, we went ahead and developed an open-source tool, named Cassandra Reaper [1], for easy management of Cassandra repairs.
Cassandra Reaper is a tool that automates the management of anti-entropy repairs of Cassandra clusters in a rather smart, efficient and careful manner while requiring minimal Cassandra expertise.
I will have to cover some basics of eventual consistency mechanisms of Cassandra, after which I will be able to focus on the features of Cassandra Reaper and our six months of experience having the tool managing the repairs of our production clusters.
I don't think it's hyperbole when I say that Facebook, Instagram, Twitter & Netflix now define the dimensions of our social & entertainment universe. But what kind of technology engines purr under the hoods of these social media machines?
Here is a tech student's perspective on making the paradigm shift to "Big Data" using innovative models: alphabet blocks, nesting dolls, & LEGOs!
Get info on:
- What is Cassandra (C*)?
- Installing C* Community Version on Amazon Web Services EC2
- Data Modelling & Database Design in C* using CQL3
- Industry Use Cases
VisualOps is a Platform as a Service (PaaS) that aims to simplify operations tasks like deploying code, configuring infrastructure, scaling, and handling failures. It provides an IDE interface for visually designing infrastructure as code using reusable components. The designs are rendered into recipes that agents on instances use to automate configuration via SaltStack. VisualOps handles dependencies and ensures environments always match their designs. It improves on other PaaS options by allowing more customization of architectures, integration of resources, and deploying associated environments together.
This document summarizes a presentation about performance management tools. It discusses the different viewpoints of management, engineering, QA testing, and operations when it comes to performance monitoring. It also provides overviews of various free network monitoring, system monitoring, load generation, and data collection tools that can be used for capacity planning and performance management, including SNMP tools like Nagios, OpenNMS, Zenoss, and ntop as well as the RRDtool, SE Toolkit, and Cacti.
AWS: Architecting for resilience & cost at scaleJos Boumans
As anyone using AWS will be able to tell you, there are good parts, and there are the bad ones. If you come from a datacenter background, you are most definitely not in Kansas anymore, and we had our share of learning experiences as a result.
This is the story of all the pitfalls we encountered, and how, through architecture, convention and common sense, we managed to build an infrastructure that is "Always Up" from the end user perspective and incredibly economical to build, scale and operate.
The talk will focus on leveraging the strong/economical points of AWS, while avoiding the weak/expensive ones. I'll give a break down of the pain points, how we managed them and how we avoided painting ourselves in a corner accidentally.
For many companies starting today, success is defined by large traffic or user numbers; if you are one of those companies, these lessons will very likely save you significant operational headaches.
Devoxx UK: Reliability & Scale in AWS while letting you sleep through the night Jos Boumans
Updated version of Reliability & Scale in AWS while letting you sleep through the night
===============================================================
More and more startups/companies are deploying their infrastructure directly and exclusively in EC2 or similar cloud provider. With that comes a whole new set of challenges and paradigms around scalability, reliability and availability.
This talk will focus on how to leverage all the infrastructure parts of AWS, augment them with great (affordable) third party services and solid Open Source Software to create an operations environment that will scale with you, be as reliable as it can be, providing you and your peers with all the data you need to make good decisions to support (rapid) changes while letting you sleep through the night. And all that using a tiny operations team.
It may make you coffee in the morning too.
Krux operates a large infrastructure serving thousands of user requests per second. They use Puppet and tools like Cloudkick, Foreman, Boto, and Vagrant to manage their infrastructure in an automated and scalable way. Their Puppet configuration is split into modules, environments, and datacenters. They launch AWS nodes programmatically and configure them with Puppet. Cloudkick is used for monitoring and parallel SSH. Boto allows full Python API access to AWS. Vagrant allows consistently provisioning development machines locally. Automation and external configuration enable their small operations team to manage a large, dynamic infrastructure.
The document discusses Cassandra compaction, which merges SSTables to remove duplicate or overwritten data and dropped deleted data. It describes different types of compactions like minor, major, and single SSTable compactions. Compaction strategies like SizeTieredCompactionStrategy and LeveledCompactionStrategy are covered, which determine what SSTables to compact. The code walkthrough explains classes involved in compaction like CompactionManager, CompactionTask, CompactionController, and how iterators are used to merge cells and partitions during compaction.
Cassandra Community Webinar | Getting Started with Apache Cassandra with Patr...DataStax Academy
Video: http://youtu.be/B-bTPSwhsDY
Abstract
Patrick McFadin (@PatrickMcFadin), Chief Evangelist for Apache Cassandra at DataStax, will be presenting an introduction to Cassandra as a key player in database technologies. Both large and small companies alike chose Apache Cassandra as their database solution and Patrick will be presenting on why they made that choice.
Patrick will also be discussing Cassandra's architecture, including: data modeling, time-series storage and replication strategies, providing a holistic overview of how Cassandra works and the best way to get started.
About Patrick McFadin
Prior to working for DataStax, Patrick was the Chief Architect at Hobsons, an education services company. His responsibilities included ensuring product availability and scaling for all higher education products. Prior to this position, he was the Director of Engineering at Hobsons which he came to after they acquired his company, Link-11 Systems, a software services company. While at Link-11 Systems, he built the first widely popular CRM system for universities, Connect. He obtained a BS in Computer Engineering from Cal Poly, San Luis Obispo and holds the distinction of being the only recipient of a medal (asanyone can find out) for hacking while serving in the US Navy.
Cassandra Summit 2014: Cassandra at Instagram 2014DataStax Academy
Presenter: Rick Branson, Infrastructure Engineer at Instagram
As Instagram has scaled to over 200 million users, so has our use of Cassandra. We've built new features and rebuilt old on Cassandra, and it's become an extremely mission-critical foundation of our production infrastructure. Rick will deliver a refresh of our use cases and go deep on the technical challenges we faced during our expansion.
Cassandra Day Denver 2014: Introduction to Apache CassandraDataStax Academy
Speaker: Jon Haddad, Technical Evangelist for Apache Cassandra at DataStax
This is a crash course introduction to Cassandra. You'll step away understanding how it's possible to to utilize this distributed database to achieve high availability across multiple data centers, scale out as your needs grow, and not be woken up at 3am just because a server failed. We'll cover the basics of data modeling with CQL, and understand how that data is stored on disk. We'll wrap things up by setting up Cassandra locally, so bring your laptops!
The document provides an overview of Apache Cassandra's architecture and design. It was created to address the needs of building reliable, high-performing, and always-available distributed databases. Cassandra is based on Dynamo and BigTable and uses a distributed hashing technique to partition and replicate data across nodes. It supports configurable replication across multiple data centers for high availability. Writes are sent to the local node and replicated to other nodes based on consistency level, while reads can be served from any replica.
Compaction in Apache Cassandra is the process of merging SSTables to reclaim disk space used by deleted or overwritten data. It occurs automatically in the background after memtables are flushed to disk or manually via nodetool. There are minor, major, and single-SSTable compactions. The compaction strategy, such as size-tiered, leveled, or date-tiered, determines how SSTables are merged.
Cacti is an open source software that uses RRDTool to graph and store time-series data from data sources like SNMP. It stores data in a MySQL database and uses PHP to provide a frontend interface for creating graphs, templates, and managing users. Cacti supports unlimited graph items, auto-padding, custom data gathering scripts, and SNMP to monitor network traffic and system metrics over time through graphs. It also provides features like data source templates, host templates, and user management to scale monitoring of large networks.
Cassandra Day SV 2014: Beyond Read-Modify-Write with Apache CassandraDataStax Academy
This document discusses strategies for updating data in Apache Cassandra beyond using read-modify-write operations. It describes how eventual consistency allows safe updates without locking by propagating changes asynchronously. It also covers Cassandra features like collections, lightweight transactions, and content-addressable storage that provide flexible data models for modern web-scale applications while avoiding the need for read-modify-write in many cases.
To understand whether Cassandra is a good fit for a given use case, it is necessary to have a rudimentary understanding of the read and write mechanisms employed inside the database itself.
This presentation will lay out the different parts of the read and write paths in Cassandra, and will show how they all fit together. We will also briefly discuss the new changes introduced in the first 3.0 alpha1 Cassandra release.
Cassandra & Python - Springfield MO User GroupAdam Hutson
Adam Hutson gave an overview of Cassandra and how to use it with Python. Key points include:
- Cassandra is a distributed database with no single point of failure and linear scalability. It favors availability over consistency.
- The Python driver allows connecting to Cassandra clusters and executing queries using prepared statements, batches, and custom consistency levels.
- Best practices include reusing a single session object, specifying keyspaces, authorizing connections, and shutting down clusters to avoid resource leaks.
This document provides an overview of Cassandra, a decentralized, distributed database management system. It discusses why the author's company chose Cassandra over other options like HBase and MySQL for their real-time data needs. The document then covers Cassandra's data model, architecture, data partitioning, replication, and other key aspects like writes, reads, deletes, and compaction. It also notes some limitations of Cassandra and provides additional resource links.
This document provides an overview of the Cassandra NoSQL database. It begins with definitions of Cassandra and discusses its history and origins from projects like Bigtable and Dynamo. The document outlines Cassandra's architecture including its peer-to-peer distributed design, data partitioning, replication, and use of gossip protocols for cluster management. It provides examples of key features like tunable consistency levels and flexible schema design. Finally, it discusses companies that use Cassandra like Facebook and provides performance comparisons with MySQL.
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
Find out about breakthrough architectures for fast OLAP performance querying Cassandra data with Apache Spark, including a new open source project, FiloDB.
The document discusses Cassandra's data model and how it replaces HDFS services. It describes:
1) Two column families - "inode" and "sblocks" - that replace the HDFS NameNode and DataNode services respectively, with "inode" storing metadata and "sblocks" storing file blocks.
2) CFS reads involve reading the "inode" info to find the block and subblock, then directly accessing the data from the Cassandra SSTable file on the node where it is stored.
3) Keyspaces are containers for column families in Cassandra, and the NetworkTopologyStrategy places replicas across data centers to enable local reads and survive failures.
Design Patterns for Distributed Non-Relational Databasesguestdfd1ec
The document discusses design patterns for distributed non-relational databases, including consistent hashing for key placement, eventual consistency models, vector clocks for determining history, log-structured merge trees for storage layout, and gossip protocols for cluster management without a single point of failure. It raises questions to ask presenters about scalability, reliability, performance, consistency models, cluster management, data models, and real-life considerations for using such systems.
This document discusses Cassandra and Hadoop. It describes how Netflix used Cassandra to store user and usage data across multiple data centers and Amazon Web Services regions. Cassandra provided fast writes and reads at scale. The document also discusses how Cassandra can be used as the data store for Hadoop, providing analytics on logs and metrics data. Cassandra offers operational simplicity and high availability through its peer-to-peer and tunable consistency models.
Apache Cassandra is a highly scalable, distributed database designed to handle large amounts of data across many servers with no single point of failure. It uses a peer-to-peer distributed system where data is replicated across multiple nodes for availability even if some nodes fail. Cassandra uses a column-oriented data model with dynamic schemas and supports fast writes and linear scalability.
Cassandra is a highly scalable, distributed, and fault-tolerant NoSQL database. It partitions data across nodes through consistent hashing of row keys, and replicates data for fault tolerance based on a replication factor. Cassandra provides tunable consistency levels for reads and writes. It uses a gossip protocol for node discovery and a commit log with memtables and SSTables for write durability and reads.
Cassandra is a highly scalable, distributed, and fault-tolerant NoSQL database. It partitions data across nodes through consistent hashing of row keys, and replicates data for fault tolerance based on a replication factor. Cassandra provides tunable consistency levels for reads and writes. It uses a gossip protocol for node discovery and a commit log for write durability.
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...Data Con LA
Scylla is a new, open-source NoSQL data store with a novel design optimized for modern hardware, capable of 1.8 million requests per second per node, while providing Apache Cassandra compatibility and scaling properties. While conventional NoSQL databases suffer from latency hiccups, expensive locking, and low throughput due to low processor utilization, the Scylla design is based on a modern shared-nothing approach. Scylla runs multiple engines, one per core, each with its own memory, CPU and multi-queue NIC. The result is a NoSQL database that delivers an order of magnitude more performance, with less performance tuning needed from the administrator.
With extra performance to work with, NoSQL projects can have more flexibility to focus on other concerns, such as functionality and time to market. Come for the tech details on what Scylla does under the hood, and leave with some ideas on how to do more with NoSQL, faster.
Speaker bio
Don Marti is technical marketing manager for ScyllaDB. He has written for Linux Weekly News, Linux Journal, and other publications. He co-founded the Linux consulting firm Electric Lichen. Don is a strategic advisor for Mozilla, and has previously served as president and vice president of the Silicon Valley Linux Users Group and on the program committees for Uselinux, Codecon, and LinuxWorld Conference and Expo.
This document provides an overview of Cassandra, a NoSQL database. It discusses that Cassandra is an open source, distributed database designed to handle large amounts of structured data across nodes. The document outlines Cassandra's architecture, which involves distributing data across peer nodes so that there is no single point of failure. It also discusses Cassandra's data model, including keyspaces, column families, and the use of the Cassandra Query Language to define schemas, insert and query data. In closing, the document notes that Cassandra is well-suited for applications that require scaling to handle large, variable workloads across data centers with high performance and availability.
Basics of Distributed Systems - Distributed StorageNilesh Salpe
The document discusses distributed systems. It defines a distributed system as a collection of computers that appear as one computer to users. Key characteristics are that the computers operate concurrently but fail independently and do not share a global clock. Examples given are Amazon.com and Cassandra database. The document then discusses various aspects of distributed systems including distributed storage, computation, synchronization, consensus, messaging, load balancing and serialization.
This document discusses Apache Cassandra, a distributed database management system designed to handle large amounts of data across many commodity servers. It summarizes Cassandra's origins from Amazon Dynamo and Google Bigtable, describes its data model and client APIs. The document also provides examples of using Cassandra and discusses considerations around operations and performance.
SFJava, SFNoSQL, SFMySQL, Marakana & Microsoft come together for a presentation evening of three NoSQL technologies - Apache Cassandra, Mongodb, Hadoop.
This talk lays out a few talking points for Apache Cassandra.
This document provides an overview of Apache Cassandra including its history, architecture, data modeling concepts, and how to install and use it with Python. Key points include that Cassandra is a distributed, scalable NoSQL database designed without single points of failure. It discusses Cassandra's architecture including nodes, datacenters, clusters, commit logs, memtables, and SSTables. Data modeling concepts explained are keyspaces, column families, and designing for even data distribution and minimizing reads. The document also provides examples of creating a keyspace, reading data using Python driver, and demoing data clustering.
Apache Cassandra is a scalable distributed hash map that stores data across multiple commodity servers. It provides high availability with no single point of failure and scales horizontally as more servers are added. Cassandra uses an eventually consistent model and tunable consistency levels. Data is organized into keyspaces containing column families with rows and columns.
This document provides an overview of Cassandra, including its data model, APIs, architecture, partitioning, replication, consistency, failure handling, and local persistence. Cassandra is a distributed database modeled after Amazon's Dynamo and Google's Bigtable. It uses a gossip-based protocol for cluster management and provides tunable consistency levels.
This document discusses Cassandra and how it is used for various use cases including storing user and device data at large internet companies. Cassandra provides simple operational models and high availability across multiple data centers and regions. It also integrates with Hadoop for analytics workloads where data is stored in Cassandra and processed by Hadoop tools. The community around Cassandra continues to enhance it with new features and more robust support.
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Aqusag Technologies
In late April 2025, a significant portion of Europe, particularly Spain, Portugal, and parts of southern France, experienced widespread, rolling power outages that continue to affect millions of residents, businesses, and infrastructure systems.
Quantum Computing Quick Research Guide by Arthur MorganArthur Morgan
This is a Quick Research Guide (QRG).
QRGs include the following:
- A brief, high-level overview of the QRG topic.
- A milestone timeline for the QRG topic.
- Links to various free online resource materials to provide a deeper dive into the QRG topic.
- Conclusion and a recommendation for at least two books available in the SJPL system on the QRG topic.
QRGs planned for the series:
- Artificial Intelligence QRG
- Quantum Computing QRG
- Big Data Analytics QRG
- Spacecraft Guidance, Navigation & Control QRG (coming 2026)
- UK Home Computing & The Birth of ARM QRG (coming 2027)
Any questions or comments?
- Please contact Arthur Morgan at [email protected].
100% human made.
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc
Most consumers believe they’re making informed decisions about their personal data—adjusting privacy settings, blocking trackers, and opting out where they can. However, our new research reveals that while awareness is high, taking meaningful action is still lacking. On the corporate side, many organizations report strong policies for managing third-party data and consumer consent yet fall short when it comes to consistency, accountability and transparency.
This session will explore the research findings from TrustArc’s Privacy Pulse Survey, examining consumer attitudes toward personal data collection and practical suggestions for corporate practices around purchasing third-party data.
Attendees will learn:
- Consumer awareness around data brokers and what consumers are doing to limit data collection
- How businesses assess third-party vendors and their consent management operations
- Where business preparedness needs improvement
- What these trends mean for the future of privacy governance and public trust
This discussion is essential for privacy, risk, and compliance professionals who want to ground their strategies in current data and prepare for what’s next in the privacy landscape.
Procurement Insights Cost To Value Guide.pptxJon Hansen
Procurement Insights integrated Historic Procurement Industry Archives, serves as a powerful complement — not a competitor — to other procurement industry firms. It fills critical gaps in depth, agility, and contextual insight that most traditional analyst and association models overlook.
Learn more about this value- driven proprietary service offering here.
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell
With expertise in data architecture, performance tracking, and revenue forecasting, Andrew Marnell plays a vital role in aligning business strategies with data insights. Andrew Marnell’s ability to lead cross-functional teams ensures businesses achieve sustainable growth and operational excellence.
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfSoftware Company
Explore the benefits and features of advanced logistics management software for businesses in Riyadh. This guide delves into the latest technologies, from real-time tracking and route optimization to warehouse management and inventory control, helping businesses streamline their logistics operations and reduce costs. Learn how implementing the right software solution can enhance efficiency, improve customer satisfaction, and provide a competitive edge in the growing logistics sector of Riyadh.
Role of Data Annotation Services in AI-Powered ManufacturingAndrew Leo
From predictive maintenance to robotic automation, AI is driving the future of manufacturing. But without high-quality annotated data, even the smartest models fall short.
Discover how data annotation services are powering accuracy, safety, and efficiency in AI-driven manufacturing systems.
Precision in data labeling = Precision on the production floor.
This is the keynote of the Into the Box conference, highlighting the release of the BoxLang JVM language, its key enhancements, and its vision for the future.
AI and Data Privacy in 2025: Global TrendsInData Labs
In this infographic, we explore how businesses can implement effective governance frameworks to address AI data privacy. Understanding it is crucial for developing effective strategies that ensure compliance, safeguard customer trust, and leverage AI responsibly. Equip yourself with insights that can drive informed decision-making and position your organization for success in the future of data privacy.
This infographic contains:
-AI and data privacy: Key findings
-Statistics on AI data privacy in the today’s world
-Tips on how to overcome data privacy challenges
-Benefits of AI data security investments.
Keep up-to-date on how AI is reshaping privacy standards and what this entails for both individuals and organizations.
Technology Trends in 2025: AI and Big Data AnalyticsInData Labs
At InData Labs, we have been keeping an ear to the ground, looking out for AI-enabled digital transformation trends coming our way in 2025. Our report will provide a look into the technology landscape of the future, including:
-Artificial Intelligence Market Overview
-Strategies for AI Adoption in 2025
-Anticipated drivers of AI adoption and transformative technologies
-Benefits of AI and Big data for your business
-Tips on how to prepare your business for innovation
-AI and data privacy: Strategies for securing data privacy in AI models, etc.
Download your free copy nowand implement the key findings to improve your business.
Big Data Analytics Quick Research Guide by Arthur MorganArthur Morgan
This is a Quick Research Guide (QRG).
QRGs include the following:
- A brief, high-level overview of the QRG topic.
- A milestone timeline for the QRG topic.
- Links to various free online resource materials to provide a deeper dive into the QRG topic.
- Conclusion and a recommendation for at least two books available in the SJPL system on the QRG topic.
QRGs planned for the series:
- Artificial Intelligence QRG
- Quantum Computing QRG
- Big Data Analytics QRG
- Spacecraft Guidance, Navigation & Control QRG (coming 2026)
- UK Home Computing & The Birth of ARM QRG (coming 2027)
Any questions or comments?
- Please contact Arthur Morgan at [email protected].
100% human made.
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungenpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-und-verwaltung-von-multiuser-umgebungen/
HCL Nomad Web wird als die nächste Generation des HCL Notes-Clients gefeiert und bietet zahlreiche Vorteile, wie die Beseitigung des Bedarfs an Paketierung, Verteilung und Installation. Nomad Web-Client-Updates werden “automatisch” im Hintergrund installiert, was den administrativen Aufwand im Vergleich zu traditionellen HCL Notes-Clients erheblich reduziert. Allerdings stellt die Fehlerbehebung in Nomad Web im Vergleich zum Notes-Client einzigartige Herausforderungen dar.
Begleiten Sie Christoph und Marc, während sie demonstrieren, wie der Fehlerbehebungsprozess in HCL Nomad Web vereinfacht werden kann, um eine reibungslose und effiziente Benutzererfahrung zu gewährleisten.
In diesem Webinar werden wir effektive Strategien zur Diagnose und Lösung häufiger Probleme in HCL Nomad Web untersuchen, einschließlich
- Zugriff auf die Konsole
- Auffinden und Interpretieren von Protokolldateien
- Zugriff auf den Datenordner im Cache des Browsers (unter Verwendung von OPFS)
- Verständnis der Unterschiede zwischen Einzel- und Mehrbenutzerszenarien
- Nutzung der Client Clocking-Funktion
Artificial Intelligence is providing benefits in many areas of work within the heritage sector, from image analysis, to ideas generation, and new research tools. However, it is more critical than ever for people, with analogue intelligence, to ensure the integrity and ethical use of AI. Including real people can improve the use of AI by identifying potential biases, cross-checking results, refining workflows, and providing contextual relevance to AI-driven results.
News about the impact of AI often paints a rosy picture. In practice, there are many potential pitfalls. This presentation discusses these issues and looks at the role of analogue intelligence and analogue interfaces in providing the best results to our audiences. How do we deal with factually incorrect results? How do we get content generated that better reflects the diversity of our communities? What roles are there for physical, in-person experiences in the digital world?
1. 101
for System Administrators
[email protected]
twitter.com/NathanMilford
http://blog.milford.io
2. What is Cassandra?
• It is a distributed, columnar database.
• Originally created at Facebook in 2008, is now a top level
Apache Project.
• Combines the best features of Amazon's Dynamo
(replication, mostly) and Google's Big Table (data model).
6. @
Rocking ~30 Billion impressions a month, like a bawse.
Used for semi-persistent storage of recommendations.
• 14 nodes in two data centers.
• Dell R610, 8 cores, 32G of RAM, 6 x 10K SAS drives.
• Using 0.8 currently, just upgraded. In production since 0.4.
• We use Hector.
• ~70-80G per node, ~550G dataset unreplicated.
• RP + OldNTS @ RF2. PropFileSnitch, RW @ CL.ONE.
Excited for NTS
Excited for TTLs!
7. How We Use Cassandra
+--------------+
| Tomcat | Tomcat serves Recs from Memcached.
+------------+-+
^ |
Flume ships logs to Hadoop DWH.
| | L Bunch of algos run against log data.
+-----+-----+ | O
| Memcached | | G Results are crammed into Cassandra.
+-----------+ | S
^ |
Keyspace per algo.
| | V CacheWarmer...
+-------+-----+ | I
| CacheWarmer | | A
Sources recs from Cassandra (and other
+-------------+ |
^ | F
sources)
| | L
Dumps them in Memcached.
+-----+-----+ | U
| Cassandra | | M
+-----------+ | E
^ |
| v
+----+--------+
| Hadoop/Hive |
+-------------+
* Simplified Workflow
8. Before I get too technical, relax.
The following slides may sound complex at first, but at the
end of the day, to get your feet wet all you need to do is:
yum install / apt-get install
Define a Seed.
Define Strategies.
service cassandra start
Go get a beer.
In my experience, once the cluster has been setup there is
not much else to do other than occasional tuning as you
learn how your data behaves.
9. Why Cassandra?
• Minimal Administration.
• No Single Point of Failure.
• Scales Horizontally.
• Writes are durable.
• Consistency is tuneable as needed on reads and writes.
• Schema is flexible, can be updated live.
• Handles failure gracefully, Cassandra is crash-only.
• Replication is easy, Rack and Datacenter aware.
10. Data Model
Keyspace = {
Column Family: {
Row Key: {
Column Name: "Column Value"
Column Name: "Column Value"
}
}
}
A Keyspace is a container for Column Families. Analogous to a
database in MySQL.
A Column Family is a container for a group of columns. Analagous to
a table in MySQL.
A Column is the basic unit. Key, Value and Timestamp.
A Row is a smattering of column data.
11. Gossip
In config, define seed(s).
Used for intra-cluster
communication.
Cluster self-assembles.
Works with failure detection.
Routes client requests.
12. Pluggable Partitioning
RandomPartitioner (RP)
Orders by MD5 of the key.
Most common.
Distributes relatively evenly.
There are others, but you probably will not use them.
13. Distributed Hash Table: The Ring
For Random Partitioner:
• Ring is made up of a range from
0 to 2**127.
• Token is MD5(Key).
• Each node is given a slice of the
ring
o Initial token is defined,
node owns that token up to
the next node's initial token.
Rock your tokens here:
http://blog.milford.io/cassandra-token-calculator/
14. Pluggable Topology Discovery
Cassandra needs to know about your network to direct
replica placement. Snitches inform Cassandra about it.
SimpleSnitch
Default, good for 1 data center.
RackInferringSnitch
Infers location from the IP's octets.
10.D.R.N (Data center, Rack, Node)
PropertyFileSnitch
cassandra-topology.properties
IP=DC:RACK (arbitrary values)
10.10.10.1=NY1:R1
EC2Snitch
Discovers AWS AZ and Regions.
15. Pluggable Replica Placement
SimpleStratgy
Places replicas in the adjacent nodes on the ring.
NetworkTopologyStrategy
Used with property file snitch.
Explicitly pick how replicas are placed.
strategy_options = [{NY1:2, LA1:2}];
16. Reading & Writing
Old method uses Thrift which is
usually abstracted using APIs
(ex. Hector, PyCassa, PHPCass)
Now we have CQL and JDBC!
SELECT * FROM ColumnFamily WHERE rowKey='Name';
Since all nodes are equal, you can read and write to any node.
The node you connect to becomes a Coordinator for that
request and routes your data to the proper nodes.
Connection pooling to nodes is sometimes handled by the API
Framework, otherwise use RRDNS or HAProxy.
17. Tunable Consistency
It is difficult to keep replicas of data consistant across nodes, let alone
across continents.
In any distributed system you have to make tradeoffs between how
consistent your dataset is versus how avaliable it is and how tolerant
the system is of partitions. (a.k.a CAP theorm.)
Cassandra chooses to focus on making the data avaliable and partition
tolerant and empower you to chose how consistant you need it to be.
Cassandra is awesomesauce because you choose what is more
important to your query, consistency or latency.
18. Per-Query Consistency Levels
Latency increases the more nodes you have to involve.
ANY: For writes only. Writes to any avaliable node and expects
Cassandra to sort it out. Fire and forget.
ONE: Reads or writes to the closest replica.
QUORUM: Writes to half+1 of the appropriate replicas before the
operation is successful. A read is sucessful when half+1 replicas agree
on a value to return.
LOCAL_QUORUM: Same as above, but only to the local datacenter in
a multi-datacenter topology.
ALL: For writes, all replicas need to ack the write. For reads, returns
the record with the newest timestamp once all replicas reply. In both
cases, if we're missing even one replica, the operation fails.
19. Cassandra Write Path
Cassandra identifies which node owns the token you're trying to
write based on your partitioning, replication and placement
strategies.
Data Written to CommitLog
Sequential writes to disk, kinda like a MySQL binlog.
Mostly written to, is only read from upon a restart.
Data Written to Memtable.
Acts as a Write-back cache of data.
Memtable hits a threshold (configurable) it is flushed to disk as an
SSTable. An SSTable (Sorted String Table) is an immutable file on
disk. More on compaction later.
20. Cassandra Read Path
Cassandra identifies which node owns the token you're trying to
read based on your partitioning, replication and placement
strategies.
First checks the Bloom filter, which can save us some time.
A space-efficient structure that tests if a key is on the node.
False positives are possible.
False negatives are impossible.
Then checks the index.
Tells us which SStable file the data is in.
And how far into the SStable file to look so we don't need to
scan the whole thing.
21. Distributed Deletes
Hard to delete stuff in a distributed system.
Difficult to keep track of replicas.
SSTables are immutable.
Deleted items are tombstoned (marked for deletion).
Data still exists, just can't be read by API.
Cleaned out during major compaction, when SSTables are
merged/remade.
22. Compaction
• When you have enough disparate SSTable files taking
up space, they are merge sorted into single SSTable
files.
• An expensive process (lots of GC, can eat up half of your
disk space)
• Tombstones discarded.
• Manual or automatic.
• Pluggable in 1.0.
• Leveled Compaction in 1.0
23. Repair
Anti-Entropy and Read Repair
During node repair and QUORUM & ALL reads,
ColumnFamilies are compared with replicas and
discrepancies resolved.
Put manual repair in cron to run at an interval =< the
value of GCGraceSeconds to catch old tombstones or
risk forgotten deletes.
Hinted Handoff
If a node is down, writes spool on other nodes and are
handed off then it comes back.
Sometimes left off, since a returning node can get
flooded.
24. Caching
Key Cache
Puts the location of keys in memory.
Improves seek times for keys on disk.
Enabled per ColumnFamily.
On by default at 200,000 keys.
Row Cache
Keeps full rows of hot data in memory.
Enable per ColumnFamily.
Skinny rows are more efficient.
Row Cache is consulted first,
then the Key Cache
Will require a bit of tuning.
25. Hardware
RAM: Depends on use. Stores some
objects off Heap.
CPU: More cores the better.
Cassandra is built with concurrency in mind.
Disk: Cassandra tries to minimize random IO. Minimum of 2
disks. Keep CommitLog and Data on separate spindles.
RAID10 or RAID0 as you see fit. I set mine up thus:
1 Disk = OS + Commitlog & RAID10 = DataSSTables
Network: 1 x 1gigE is fine, more the better and Gossip and
Data can be defined on separate interfaces.
26. What about 'Cloud' environments?
EC2Snitch
• Maps EC2 Regions to Racks
• Maps EC2 Availability Zones to DCs
• Use Network Topology Strategy
Avoid EBS. Use RAID0/RAID10 across ephemeral drives.
Replicate across Availability Zones.
Netflix is moving to 100% Cassandra on EC2:
http://www.slideshare.net/adrianco/migrating-netflix-from-
oracle-to-global-cassandra
27. Installing
RedHat
rpm -i http://rpm.datastax.com/EL/6/x86_64/riptano-release-5-1.el6.noarch.rpm
yum -y install apache-cassandra
Debian
Add to /etc/apt/sources.list
deb http://www.apache.org/dist/cassandra/debian unstable main
deb-src http://www.apache.org/dist/cassandra/debian unstable main
wget http://www.apache.org/dist/cassandra/KEYS -O- | sudo apt-key add -
sudo apt-get update
sudo apt-get install cassandra
29. Hot Tips
• Use Sun/Oracle JVM (1.6 u22+)
• Use JNA Library.
o Keep disk_access_mode as auto.
o BTW, it is not using all your RAM, is like FS Cache.
• Don't use autobootstrap, specify initial token.
• Super columns impose a performance penalty.
• Enable GC logging in cassandra-env.sh
• Don't use a large heap. (Yay off-heap caching!)
• Don't use swap.
30. Monitoring
Install MX4J jar into class path or ping JMX directly.
curl | grep | awk it into Nagios, Ganglia, Cacti or what have you.
31. What to Monitor
Heap Size and Usage
CompactionStage
Garbage Collections
Compaction Count
IO Wait
Cache Hit Rate
RowMutationStage (Writes)
ReadStage
Active and Pending
Active and Pending
32. Adding/Removing/Replacing Nodes
Adding a Node
Calculate new tokens.
Set correct initial token on the new node
Once it bootstrapped, nodetool move on other nodes.
Removing a Node
nodetool decommission drains data to other nodes
nodetool removetoken tells the cluster to get the
data from other replicas (faster, more expensive on live
nodes).
Replacing a Node
Bring up replacement node with same IP and token.
Run nodetool repair.
33. Useful nodetool commands.
nodetool info - Displays node-level info.
nodetool ring - Displays info on nodes on the ring.
nodetool cfstats - Displays ColumnFamily statistics.
nodetool tpstats - Displays what operations Cassandra
is doing right now.
nodetool netstats - Displays streaming information.
nodetool drain - Flushes Memtables to SSTables on disk
and stops accepting writes. Useful before a restart to make
startup quicker (no CommitLog to replay)
39. Backups
Single Node Snapshot
nodetool snapshot
nodetool clearsnapshot
Makes a hardlink of SSTables that you can tarball.
Cluster-wide Snapshot.
clustertool global_snapshot
clustertool clear_global_snapshot
Just does local snapshots on all nodes.
To restore:
Stop the node.
Clear CommitLogs.
Zap *.db files in the Keyspace directory.
Copy the snapshot over from the snapshots subdirectory.
Start the node and wait for load to decrease.
40. Shutdown Best Practice
While Cassandra is crash-safe, you can make a cleaner
shutdown and save some time during startup thus:
Make other nodes think this one is down.
nodetool -h $(hostname) -p 8080 disablegossip
Wait a few secs, cut off anyone from writing to this node.
nodetool -h $(hostname) -p 8080 dissablethrift
Flush all memtables to disk.
nodetool -h $(hostname) -p 8080 drain
Shut it down.
/etc/init.d/cassandra stop
41. Rolling Upgrades
From 0.7 you can do rolling upgrades. Check for cassandra.yaml changes!
On each node, one by one:
Shutdown as in previous slide, but do a snapshot after draining.
Remove old jars, rpms, debs. Your data will not be touched.
Add new jars, rpms, debs.
/etc/init.d/cassandra start
Wait for the node to come back up and for the other nodes to see it.
When done, before you run repair, on each node run:
nodetool -h $(hostname) -p 8080 scrub
This is rebuilding the sstables to make them up to date.
It is essentially a major compaction, without compacting, so it is a bit
expensive.
Run repair on your nodes to clean up the data.
nodetool -h $(hostname) -p 8080 repair
42. Join Us!
http://www.meetup.com/NYC-Cassandra-User-Group/
We'll be ing You!
These slides can be found here:
http://www.slideshare.net/nmilford/cassandra-for-sysadmins