Covering theory and operational aspects of bring up Apache Cassandra clusters - this presentation can be used as a field reference. Presented by Alex Thompson at the Sydney Cassandra Meetup.
Some vignettes and advice based on prior experience with Cassandra clusters in live environments. Includes some material from other operational slides.
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1DataStax Academy
This document summarizes some lesser known features in Apache Cassandra 2.1, including:
1) Cassandra's logging was changed to use Logback, allowing for faster and more configurable logging through a logback.xml file.
2) New default paths were added in Cassandra 2.1 for data, commit logs, and configurations to keep directories cleaner.
3) A number of command line parameters and YAML configuration options were added for more control over logging levels, commit log handling, compaction settings, and more.
4) Enhancements were made to the CQL shell cqlsh and nodetool for additional debugging and management capabilities.
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...DataStax
Successfully running Apache Cassandra in production often means knowing what configuration settings to change and which ones to leave as default. Over the years the cassandra.yaml file has grown to provide a number of settings that can improve stability and performance. While the file contains plenty of helpful comments, there is more to be said about the settings and when to change them.
In this talk Edward Capriolo, Consultant at The Last Pickle, will break down the parameters in the configuration files. Looking at those that are essential to getting started, those that impact performance, those that improve availability, the exotic ones, and the ones that should not be played with. This talk is ideal for someone someone setting up Cassandra for the first time up to people with deployments in productions and wondering what the more exotic configuration options do.
About the Speaker
Edward Capriolo Consultant, The Last Pickle
Long time Apache Cassandra user, big data enthusiast.
Cassandra Troubleshooting (for 2.0 and earlier)J.B. Langston
I’ll give a general lay of the land for troubleshooting Cassandra. Then I’ll take you on a deep dive through nodetool and system.log and give you a guided tour of the useful information they provide for troubleshooting. I’ll devote special attention to monitoring the various processes that Cassandra uses to do its work and how to effectively search for information about specific error messages online.
This is the old version of this presentation for Cassandra 2.0 and earlier. Check out the updated slide deck for Cassandra 2.1.
This document provides documentation for Percona XtraDB Cluster, an open-source high availability and scalability solution for MySQL users. It includes sections on installation from binaries or source code, key features like high availability and multi-master replication, FAQs, how-tos, limitations, and other documentation. Percona XtraDB Cluster provides synchronous replication across multiple MySQL/Percona Server nodes, allowing for high availability and the ability to write to any node.
Slides from my talk at Cassandra Summit 2016 on troubleshooting Cassandra. This is a reprise of my popular talk from last summit, reorganized, expanded, and updated for Cassandra 3.0. In it I share the secrets I've learned in four years of supporting hundreds of customers using Apache Cassandra and DataStax Enterprise. Be sure to check out presenter notes for additional tips and links to further resources.
DataStax: Extreme Cassandra Optimization: The SequelDataStax Academy
Al has been using Cassandra since version 0.6 and has spent the last few months doing little else but tune Cassandra clusters. In this talk, Al will show how to tune Cassandra for efficient operation using multiple views into system metrics, including OS stats, GC logs, JMX, and cassandra-stress.
MapR clusters disks into storage pools for data distribution. By default, storage pools contain 3 disks each. The mrconfig command can be used to create, remove, and manage storage pools and disks. Each node supports up to 36 storage pools. Zookeeper should always be started before other services and is critical for high availability. Logs are centrally stored for 30 days by default and can be configured through yarn-site.xml.
The document compares two methods for limiting CPU usage of databases on the same server: instance caging and processor_group_name binding. It provides facts about how each method works, observations on performance differences, and examples of customer cases where each method may be best. Instance caging allows limiting CPU count online but the SGA is interleaved, while binding groups databases to specific CPUs requiring a restart but keeps the SGA local. The best choice depends on factors like database count and whether guaranteed CPU resources are needed for some databases.
Cassandra Summit 2014: Performance Tuning Cassandra in AWSDataStax Academy
Presenters: Michael Nelson, Development Manager at FamilySearch
A recent research project at FamilySearch.org pushed Cassandra to very high scale and performance limits in AWS using a real application. Come see how we achieved 250K reads/sec with latencies under 5 milliseconds on a 400-core cluster holding 6 TB of data while maintaining transactional consistency for users. We'll cover tuning of Cassandra's caches, other server-side settings, client driver, AWS cluster placement and instance types, and the tradeoffs between regular & SSD storage.
Apache Cassandra operations have the reputation to be quite simple against single datacenter clusters and / or low volume clusters but they become way more complex against high latency multi-datacenter clusters: basic operations such as repair, compaction or hints delivery can have dramatic consequences even on a healthy cluster.
In this presentation, Julien will go through Cassandra operations in details: bootstrapping new nodes and / or datacenter, repair strategies, compaction strategies, GC tuning, OS tuning, large batch of data removal and Apache Cassandra upgrade strategy.
Julien will give you tips and techniques on how to anticipate issues inherent to multi-datacenter cluster: how and what to monitor, hardware and network considerations as well as data model and application level bad design / anti-patterns that can affect your multi-datacenter cluster performances.
Introduction to hadoop high availability Omid Vahdaty
Understand how to create a highly available Hadoop cluster.
Active/passive. with manual failover. links to help you get started, knowing to focus on. common mistakes etc.
Setting up mongodb sharded cluster in 30 minutesSudheer Kondla
The document describes how to configure and deploy a MongoDB sharded cluster with 6 virtual machines in 30 minutes. It provides step-by-step instructions on installing MongoDB, setting up the config servers, adding shards, and enabling sharding for databases and collections. Key aspects include designating MongoDB instances as config servers, starting mongos processes connected to the config servers, adding shards by hostname and port, and enabling sharding on specific databases and collections with shard keys.
Redundancy for Big Hadoop Clusters is hard - Stuart PookEvention
Criteo had an Hadoop cluster with 39 PB raw stockage, 13404 CPUs, 105 TB RAM, 40 TB data imported per day and over 100000 jobs per day. This cluster was critical in both stockage and compute but without backups. After many efforts to increase our redundancy, we now have two clusters that, combined, have more than 2000 nodes, 130 PB, two different versions of Hadoop and 200000 jobs per day but these clusters do not yet provide a redundant solution to our all storage and compute needs. This talk discusses the choices and issues we solved in creating a 1200 node cluster with new hardware in a new data centre. Some of the challenges involved in running two different clusters in parallel will be presented. We will also analyse what went right (and wrong) in our attempt to achieve redundancy and our plans to improve our capacity to handle the loss of a data centre.
Apache Cassandra operations have the reputation to be simple on single datacenter deployments and / or low volume clusters but they become way more complex on high latency multi-datacenter clusters with high volume and / or high throughout: basic Apache Cassandra operations such as repairs, compactions or hints delivery can have dramatic consequences even on a healthy high latency multi-datacenter cluster.
In this presentation, Julien will go through Apache Cassandra mutli-datacenter concepts first then show multi-datacenter operations essentials in details: bootstrapping new nodes and / or datacenter, repairs strategy, Java GC tuning, OS tuning, Apache Cassandra configuration and monitoring.
Based on his 3 years experience managing a multi-datacenter cluster against Apache Cassandra 2.0, 2.1, 2.2 and 3.0, Julien will give you tips on how to anticipate and prevent / mitigate issues related to basic Apache Cassandra operations with a multi-datacenter cluster.
Advanced Apache Cassandra Operations with JMXzznate
Nodetool is a command line interface for managing a Cassandra node. It provides commands for node administration, cluster inspection, table operations and more. The nodetool info command displays node-specific information such as status, load, memory usage and cache details. The nodetool compactionstats command shows compaction status including active tasks and progress. The nodetool tablestats command displays statistics for a specific table including read/write counts, space usage, cache usage and latency.
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...DataStax
Customizing JVM settings for the needs of an application can be a tricky business, especially when running externally developed software such as Cassandra. In this talk I will share our experiences and the procedure that we have used to test and validate changes with Java tuning. We'll explore with two recent experiences: changes and monitoring of G1 garbage collection, and moving buffer objects off the heap.
For the talk, I'll discuss our tuning process at Knewton. I will share some of the challenges that we faced while identifying what we expected to learn. I'll discuss how we isolated and minimized variables across tests, the importance of the duration of these tests, and how we try to separate correlation from causation. I will demonstrate how to use and interpret the results of the custom scripts that we were driven to develop to gain visibility into our G1GC processes; these scripts will be open sourced.
About the Speaker
Carlos Monroy Senior Software Engineer, Knewton
Carlos Monroy is a senior engineer on the database team at Knewton, an education company that created an adaptive learning platform. Carlos has been developing software professionally since 1998. His experience holding multiple roles on the software lifecycle provides him a wholistic approach. Having used over a half dozen relational database engines, he has recently come over to the NoSQL side, first working with HBase and for the last three years Cassandra.
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark StreamingAbdelhamide EL ARIB
This document compares Apache Spark Streaming and Apache Kafka for real-time data pipelines. It outlines the key differences between the two in areas like new file detection, processing, failure handling, deployment, scaling, and monitoring. Some key points are that Spark Streaming allows detecting new files within directories but requires separate streams for different data sources, while Kafka can detect new files across sources using a watcher connector. Kafka Connect is also better for scaling tasks up and down dynamically compared to Spark Streaming. The document recommends considering your specific data sources, sinks, and integration testing needs to determine the best solution.
This document discusses using PostgreSQL with Amazon RDS. It begins with an introduction to Amazon RDS and then discusses setting up a PostgreSQL RDS instance, available features like backups and monitoring, limitations, pricing, and references for further reading. The document is intended to provide an overview of deploying and managing PostgreSQL on Amazon RDS.
A brief history of Instagram's adoption cycle of the open source distributed database Apache Cassandra, in addition to details about it's use case and implementation. This was presented at the San Francisco Cassandra Meetup at the Disqus HQ in August 2013.
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...DataStax
Making sure your Data Model will work on the production cluster after 6 months as well as it does on your laptop is an important skill. It's one that we use every day with our clients at The Last Pickle, and one that relies on tools like the cassandra-stress. Knowing how the data model will perform under stress once it has been loaded with data can prevent expensive re-writes late in the project.
In this talk Christopher Batey, Consultant at The Last Pickle, will shed some light on how to use the cassandra-stress tool to test your own schema, graph the results and even how to extend the tool for your own use cases. While this may be called premature optimisation for a RDBS, a successful Cassandra project depends on it's data model.
About the Speaker
Christopher Batey Consultant / Software Engineer, The Last Pickle
Christopher (@chbatey) is a part time consultant at The Last Pickle where he works with clients to help them succeed with Apache Cassandra as well as a freelance software engineer working in London. Likes: Scala, Haskell, Java, the JVM, Akka, distributed databases, XP, TDD, Pairing. Hates: Untested software, code ownership. You can checkout his blog at: http://www.batey.info
Building highly scalable website requires to understand the core building blocks of your applicative environment. In this talk we dive into Jahia core components to understand how they interact and how by (1) respecting a few architectural practices and (2) fine tuning Jahia components and the JVM, you will be able to build a highly scalable service
Out of the box replication in postgres 9.4(pg confus)Denish Patel
This document contains notes from a presentation on PostgreSQL replication. It discusses write-ahead logs (WAL), replication history in PostgreSQL from versions 7.0 to 9.4, how to set up basic replication, tools for backups and monitoring replication, and demonstrates setting up replication without third party tools using pg_basebackup, replication slots, and pg_receivexlog. It also includes contact information for the presenter and an invitation to join the PostgreSQL Slack channel.
Cassandra Troubleshooting for 2.1 and laterJ.B. Langston
Troubleshooting Cassandra 2.1: A Guided Tour of nodetool and system.log. From Cassandra Summit 2015. Download and check out the presenter notes for tips!
I’ll give a general lay of the land for troubleshooting Cassandra. Then I’ll take you on a deep dive through nodetool and system.log and give you a guided tour of the useful information they provide for troubleshooting. I’ll devote special attention to monitoring the various processes that Cassandra uses to do its work and how to effectively search for information about specific error messages online.
Vous avez récemment commencé à travailler sur Spark et vos jobs prennent une éternité pour se terminer ? Cette présentation est faite pour vous.
Himanshu Arora et Nitya Nand YADAV ont rassemblé de nombreuses bonnes pratiques, optimisations et ajustements qu'ils ont appliqué au fil des années en production pour rendre leurs jobs plus rapides et moins consommateurs de ressources.
Dans cette présentation, ils nous apprennent les techniques avancées d'optimisation de Spark, les formats de sérialisation des données, les formats de stockage, les optimisations hardware, contrôle sur la parallélisme, paramétrages de resource manager, meilleur data localité et l'optimisation du GC etc.
Ils nous font découvrir également l'utilisation appropriée de RDD, DataFrame et Dataset afin de bénéficier pleinement des optimisations internes apportées par Spark.
This document discusses scaling Cassandra for big data applications. It describes how Ooyala uses Cassandra for fast access to data generated by MapReduce, high availability key-value storage from Storm, and playhead tracking for cross-device resume. It outlines Ooyala's experience migrating to newer Cassandra versions as data doubled yearly, including removing expired tombstones, schema changes, and Linux performance tuning.
Drivers connect applications to Cassandra clusters and maintain connections to nodes. They probe clusters to discover nodes, token ranges, and latency. Drivers are data-aware and can route queries to appropriate replicas or fail over if needed. Cassandra clusters can span multiple data centers for redundancy, workload separation, and geographic distribution of data and queries. Configuration files like cassandra.yaml and cassandra-env.sh are used to configure memory, data storage, caching, and other settings. Cassandra clusters should be provisioned on commodity servers using tools like cassandra-stress to test workloads and estimate needed nodes.
The document provides information on tools for monitoring and administering Cassandra clusters. It discusses Cassandra-specific tools like nodetool for monitoring metrics and performing administrative tasks. It also lists system monitoring tools for metrics like CPU usage, disk I/O, network activity, and more. Finally, it gives best practices for hardware selection with Cassandra including recommendations for memory, CPU, and disk space.
The document compares two methods for limiting CPU usage of databases on the same server: instance caging and processor_group_name binding. It provides facts about how each method works, observations on performance differences, and examples of customer cases where each method may be best. Instance caging allows limiting CPU count online but the SGA is interleaved, while binding groups databases to specific CPUs requiring a restart but keeps the SGA local. The best choice depends on factors like database count and whether guaranteed CPU resources are needed for some databases.
Cassandra Summit 2014: Performance Tuning Cassandra in AWSDataStax Academy
Presenters: Michael Nelson, Development Manager at FamilySearch
A recent research project at FamilySearch.org pushed Cassandra to very high scale and performance limits in AWS using a real application. Come see how we achieved 250K reads/sec with latencies under 5 milliseconds on a 400-core cluster holding 6 TB of data while maintaining transactional consistency for users. We'll cover tuning of Cassandra's caches, other server-side settings, client driver, AWS cluster placement and instance types, and the tradeoffs between regular & SSD storage.
Apache Cassandra operations have the reputation to be quite simple against single datacenter clusters and / or low volume clusters but they become way more complex against high latency multi-datacenter clusters: basic operations such as repair, compaction or hints delivery can have dramatic consequences even on a healthy cluster.
In this presentation, Julien will go through Cassandra operations in details: bootstrapping new nodes and / or datacenter, repair strategies, compaction strategies, GC tuning, OS tuning, large batch of data removal and Apache Cassandra upgrade strategy.
Julien will give you tips and techniques on how to anticipate issues inherent to multi-datacenter cluster: how and what to monitor, hardware and network considerations as well as data model and application level bad design / anti-patterns that can affect your multi-datacenter cluster performances.
Introduction to hadoop high availability Omid Vahdaty
Understand how to create a highly available Hadoop cluster.
Active/passive. with manual failover. links to help you get started, knowing to focus on. common mistakes etc.
Setting up mongodb sharded cluster in 30 minutesSudheer Kondla
The document describes how to configure and deploy a MongoDB sharded cluster with 6 virtual machines in 30 minutes. It provides step-by-step instructions on installing MongoDB, setting up the config servers, adding shards, and enabling sharding for databases and collections. Key aspects include designating MongoDB instances as config servers, starting mongos processes connected to the config servers, adding shards by hostname and port, and enabling sharding on specific databases and collections with shard keys.
Redundancy for Big Hadoop Clusters is hard - Stuart PookEvention
Criteo had an Hadoop cluster with 39 PB raw stockage, 13404 CPUs, 105 TB RAM, 40 TB data imported per day and over 100000 jobs per day. This cluster was critical in both stockage and compute but without backups. After many efforts to increase our redundancy, we now have two clusters that, combined, have more than 2000 nodes, 130 PB, two different versions of Hadoop and 200000 jobs per day but these clusters do not yet provide a redundant solution to our all storage and compute needs. This talk discusses the choices and issues we solved in creating a 1200 node cluster with new hardware in a new data centre. Some of the challenges involved in running two different clusters in parallel will be presented. We will also analyse what went right (and wrong) in our attempt to achieve redundancy and our plans to improve our capacity to handle the loss of a data centre.
Apache Cassandra operations have the reputation to be simple on single datacenter deployments and / or low volume clusters but they become way more complex on high latency multi-datacenter clusters with high volume and / or high throughout: basic Apache Cassandra operations such as repairs, compactions or hints delivery can have dramatic consequences even on a healthy high latency multi-datacenter cluster.
In this presentation, Julien will go through Apache Cassandra mutli-datacenter concepts first then show multi-datacenter operations essentials in details: bootstrapping new nodes and / or datacenter, repairs strategy, Java GC tuning, OS tuning, Apache Cassandra configuration and monitoring.
Based on his 3 years experience managing a multi-datacenter cluster against Apache Cassandra 2.0, 2.1, 2.2 and 3.0, Julien will give you tips on how to anticipate and prevent / mitigate issues related to basic Apache Cassandra operations with a multi-datacenter cluster.
Advanced Apache Cassandra Operations with JMXzznate
Nodetool is a command line interface for managing a Cassandra node. It provides commands for node administration, cluster inspection, table operations and more. The nodetool info command displays node-specific information such as status, load, memory usage and cache details. The nodetool compactionstats command shows compaction status including active tasks and progress. The nodetool tablestats command displays statistics for a specific table including read/write counts, space usage, cache usage and latency.
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...DataStax
Customizing JVM settings for the needs of an application can be a tricky business, especially when running externally developed software such as Cassandra. In this talk I will share our experiences and the procedure that we have used to test and validate changes with Java tuning. We'll explore with two recent experiences: changes and monitoring of G1 garbage collection, and moving buffer objects off the heap.
For the talk, I'll discuss our tuning process at Knewton. I will share some of the challenges that we faced while identifying what we expected to learn. I'll discuss how we isolated and minimized variables across tests, the importance of the duration of these tests, and how we try to separate correlation from causation. I will demonstrate how to use and interpret the results of the custom scripts that we were driven to develop to gain visibility into our G1GC processes; these scripts will be open sourced.
About the Speaker
Carlos Monroy Senior Software Engineer, Knewton
Carlos Monroy is a senior engineer on the database team at Knewton, an education company that created an adaptive learning platform. Carlos has been developing software professionally since 1998. His experience holding multiple roles on the software lifecycle provides him a wholistic approach. Having used over a half dozen relational database engines, he has recently come over to the NoSQL side, first working with HBase and for the last three years Cassandra.
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark StreamingAbdelhamide EL ARIB
This document compares Apache Spark Streaming and Apache Kafka for real-time data pipelines. It outlines the key differences between the two in areas like new file detection, processing, failure handling, deployment, scaling, and monitoring. Some key points are that Spark Streaming allows detecting new files within directories but requires separate streams for different data sources, while Kafka can detect new files across sources using a watcher connector. Kafka Connect is also better for scaling tasks up and down dynamically compared to Spark Streaming. The document recommends considering your specific data sources, sinks, and integration testing needs to determine the best solution.
This document discusses using PostgreSQL with Amazon RDS. It begins with an introduction to Amazon RDS and then discusses setting up a PostgreSQL RDS instance, available features like backups and monitoring, limitations, pricing, and references for further reading. The document is intended to provide an overview of deploying and managing PostgreSQL on Amazon RDS.
A brief history of Instagram's adoption cycle of the open source distributed database Apache Cassandra, in addition to details about it's use case and implementation. This was presented at the San Francisco Cassandra Meetup at the Disqus HQ in August 2013.
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...DataStax
Making sure your Data Model will work on the production cluster after 6 months as well as it does on your laptop is an important skill. It's one that we use every day with our clients at The Last Pickle, and one that relies on tools like the cassandra-stress. Knowing how the data model will perform under stress once it has been loaded with data can prevent expensive re-writes late in the project.
In this talk Christopher Batey, Consultant at The Last Pickle, will shed some light on how to use the cassandra-stress tool to test your own schema, graph the results and even how to extend the tool for your own use cases. While this may be called premature optimisation for a RDBS, a successful Cassandra project depends on it's data model.
About the Speaker
Christopher Batey Consultant / Software Engineer, The Last Pickle
Christopher (@chbatey) is a part time consultant at The Last Pickle where he works with clients to help them succeed with Apache Cassandra as well as a freelance software engineer working in London. Likes: Scala, Haskell, Java, the JVM, Akka, distributed databases, XP, TDD, Pairing. Hates: Untested software, code ownership. You can checkout his blog at: http://www.batey.info
Building highly scalable website requires to understand the core building blocks of your applicative environment. In this talk we dive into Jahia core components to understand how they interact and how by (1) respecting a few architectural practices and (2) fine tuning Jahia components and the JVM, you will be able to build a highly scalable service
Out of the box replication in postgres 9.4(pg confus)Denish Patel
This document contains notes from a presentation on PostgreSQL replication. It discusses write-ahead logs (WAL), replication history in PostgreSQL from versions 7.0 to 9.4, how to set up basic replication, tools for backups and monitoring replication, and demonstrates setting up replication without third party tools using pg_basebackup, replication slots, and pg_receivexlog. It also includes contact information for the presenter and an invitation to join the PostgreSQL Slack channel.
Cassandra Troubleshooting for 2.1 and laterJ.B. Langston
Troubleshooting Cassandra 2.1: A Guided Tour of nodetool and system.log. From Cassandra Summit 2015. Download and check out the presenter notes for tips!
I’ll give a general lay of the land for troubleshooting Cassandra. Then I’ll take you on a deep dive through nodetool and system.log and give you a guided tour of the useful information they provide for troubleshooting. I’ll devote special attention to monitoring the various processes that Cassandra uses to do its work and how to effectively search for information about specific error messages online.
Vous avez récemment commencé à travailler sur Spark et vos jobs prennent une éternité pour se terminer ? Cette présentation est faite pour vous.
Himanshu Arora et Nitya Nand YADAV ont rassemblé de nombreuses bonnes pratiques, optimisations et ajustements qu'ils ont appliqué au fil des années en production pour rendre leurs jobs plus rapides et moins consommateurs de ressources.
Dans cette présentation, ils nous apprennent les techniques avancées d'optimisation de Spark, les formats de sérialisation des données, les formats de stockage, les optimisations hardware, contrôle sur la parallélisme, paramétrages de resource manager, meilleur data localité et l'optimisation du GC etc.
Ils nous font découvrir également l'utilisation appropriée de RDD, DataFrame et Dataset afin de bénéficier pleinement des optimisations internes apportées par Spark.
This document discusses scaling Cassandra for big data applications. It describes how Ooyala uses Cassandra for fast access to data generated by MapReduce, high availability key-value storage from Storm, and playhead tracking for cross-device resume. It outlines Ooyala's experience migrating to newer Cassandra versions as data doubled yearly, including removing expired tombstones, schema changes, and Linux performance tuning.
Drivers connect applications to Cassandra clusters and maintain connections to nodes. They probe clusters to discover nodes, token ranges, and latency. Drivers are data-aware and can route queries to appropriate replicas or fail over if needed. Cassandra clusters can span multiple data centers for redundancy, workload separation, and geographic distribution of data and queries. Configuration files like cassandra.yaml and cassandra-env.sh are used to configure memory, data storage, caching, and other settings. Cassandra clusters should be provisioned on commodity servers using tools like cassandra-stress to test workloads and estimate needed nodes.
The document provides information on tools for monitoring and administering Cassandra clusters. It discusses Cassandra-specific tools like nodetool for monitoring metrics and performing administrative tasks. It also lists system monitoring tools for metrics like CPU usage, disk I/O, network activity, and more. Finally, it gives best practices for hardware selection with Cassandra including recommendations for memory, CPU, and disk space.
This document provides instructions for implementing an Oracle 11g R2 Real Application Cluster on a Red Hat Enterprise Linux 5.0 system using a two-node configuration. It describes pre-installation steps including hardware and network configuration, installing prerequisite packages and libraries, and configuring the Oracle ASM library driver. Detailed steps are provided for installing Oracle Grid Infrastructure and database software, and configuring the single client access name and storage area network.
Testing Delphix: easy data virtualizationFranck Pachot
The document summarizes the author's testing of the Delphix data virtualization software. Some key points:
- Delphix allows users to easily provision virtual copies of database sources on demand for tasks like testing, development, and disaster recovery.
- It works by maintaining incremental snapshots of source databases and virtualizing the data access. Copies can be provisioned in minutes and rewound to past points in time.
- The author demonstrated provisioning a copy of an Oracle database using Delphix and found the process very simple. Delphix integrates deeply with databases.
- Use cases include giving databases to each tester/developer, enabling continuous integration testing, creating QA environments with real
This document summarizes the work done to set up a computing cluster at Florida Tech. It describes installing Rocks on a frontend server and nodes, setting up Condor as the batch job system, and using OpenFiler for network attached storage. The cluster originally had one frontend and one compute node from the University of Florida. Future work involves recovering from a hard drive failure on the frontend and continuing the installation of the Open Science Grid.
Step by Step to Install oracle grid 11.2.0.3 on solaris 11.1Osama Mustafa
The document provides step-by-step instructions to install Oracle Grid Infrastructure 11g Release 2 (11.2.0.3) on Solaris 11.1. It describes preparing the OS by creating users, groups and directories. It also covers configuring networking, disks and memory parameters. The main steps are: installing Grid software and configuring ASM, followed by installing the Oracle Database and configuring it on the RAC nodes using dbca. Setting up SSH access between nodes and troubleshooting installation errors are also addressed. The goal is to build a fully configured two-node Oracle RAC environment with ASM and single sign-on capabilities.
Windows 2000 is a 32-bit operating system designed for compatibility, reliability, and performance. It includes several key components like the kernel, executive services, and environmental subsystems. The kernel schedules threads and handles exceptions/interrupts. Executive services include the object manager, virtual memory manager, process manager, and I/O manager. Environmental subsystems allow running applications from other operating systems. The document also discusses disk structure, file systems, networking, and other OS concepts.
Pollfish is a survey platform which provides access to millions of targeted users. Pollfish allows easy distribution and targeting of surveys through existing mobile apps. (https://www.pollfish.com/). At pollfish we use Cassandra for difference use cases, eg. for application data store to maximize write throughput when appropriate and for our analytics project to find insights in application generated data. As a medium to accomplish our success so far, we use the Datastax's DSE 4.6 environment which integrates Appache Cassadra, Spark and a hadoop compatible file system (CFS). We will discuss how we started, how the journey was and the impressions gained so far along with some tips learned the hard way. This is a result of joint work of an excellent team here at Pollfish.
Hadoop Interview Questions and Answers by rohit kapakapa rohit
Hadoop Interview Questions and Answers - More than 130 real time questions and answers covering hadoop hdfs,mapreduce and administrative concepts by rohit kapa
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
Know how to setup a Hadoop Cluster With HDFS High Availability here : www.edureka.co/blog/how-to-set-up-hadoop-cluster-with-hdfs-high-availability/
The document discusses setting up a Squid proxy server on a Linux system to improve network security and performance for a home network. It recommends using an old Pentium II computer with at least 80-100MB of RAM as the proxy server. The document provides instructions for installing Squid and configuring the Squid.conf file to optimize disk usage, caching, and logging. It also explains how to set up the Squid proxy server to work with an iptables firewall for access control and protection from intruders.
The document discusses two techniques for upgrading a 10g Oracle RAC cluster to 11gR2 grid infrastructure (GI):
1) Creating a new cluster by uninstalling the existing 10g software, installing 11gR2 GI, and migrating the database and services to the new cluster.
2) Upgrading the existing cluster, but the existing discusses issues encountered with this approach during the rootUpgrade.sh script and cluster restart.
It also summarizes the steps taken to migrate an existing 11gR2 ASM configuration to an extended RAC configuration, distributing the disk groups across two separate storage systems.
- The document describes installing Oracle Real Application Clusters (RAC) and Cluster Ready Services (CRS) on a two-node Windows cluster.
- It involves a two phase installation - first installing and configuring CRS, then installing the Oracle Database with RAC.
- Key steps include configuring shared disks and partitions for the Oracle Cluster Registry, voting disk, and Automatic Storage Management; installing and configuring CRS; and then installing Oracle Database with RAC.
Impetus provides expert consulting services around Hadoop implementations, including R&D, assessment, deployment (on private and public clouds), optimizations for enhanced static shared data implementations.
This presentation speaks about Advanced Hadoop Tuning and Optimisation.
Hbase in action - Chapter 09: Deploying HBasephanleson
Hbase in action - Chapter 09: Deploying HBase
Learning HBase, Real-time Access to Your Big Data, Data Manipulation at Scale, Big Data, Text Mining, HBase, Deploying HBase
The document provides information on MongoDB replication and sharding. Replication allows for redundancy and increased data availability by synchronizing data across multiple database servers. A replica set consists of a primary node that receives writes and secondary nodes that replicate the primary. Sharding partitions data across multiple machines or shards to improve scalability and allow for larger data sets and higher throughput. Sharded clusters have shards that store data, config servers that store metadata, and query routers that direct operations to shards.
The DrupalCampLA 2011 presentation on backend performance. The slides go over optimizations that can be done through the LAMP (or now VAN LAMMP stack for even more performance) to get everything up and running.
Thingyan is now a global treasure! See how people around the world are search...Pixellion
We explored how the world searches for 'Thingyan' and 'သင်္ကြန်' and this year, it’s extra special. Thingyan is now officially recognized as a World Intangible Cultural Heritage by UNESCO! Dive into the trends and celebrate with us!
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsContify
AI competitor analysis helps businesses watch and understand what their competitors are doing. Using smart competitor intelligence tools, you can track their moves, learn from their strategies, and find ways to do better. Stay smart, act fast, and grow your business with the power of AI insights.
For more information please visit here https://www.contify.com/
By James Francis, CEO of Paradigm Asset Management
In the landscape of urban safety innovation, Mt. Vernon is emerging as a compelling case study for neighboring Westchester County cities. The municipality’s recently launched Public Safety Camera Program not only represents a significant advancement in community protection but also offers valuable insights for New Rochelle and White Plains as they consider their own safety infrastructure enhancements.
Mieke Jans is a Manager at Deloitte Analytics Belgium. She learned about process mining from her PhD supervisor while she was collaborating with a large SAP-using company for her dissertation.
Mieke extended her research topic to investigate the data availability of process mining data in SAP and the new analysis possibilities that emerge from it. It took her 8-9 months to find the right data and prepare it for her process mining analysis. She needed insights from both process owners and IT experts. For example, one person knew exactly how the procurement process took place at the front end of SAP, and another person helped her with the structure of the SAP-tables. She then combined the knowledge of these different persons.
computer organization and assembly language : its about types of programming language along with variable and array description..https://www.nfciet.edu.pk/
Building Apache Cassandra clusters for massive scale
1. Building Apache Cassandra
clusters for massive scale
Covering theory and operational aspects of bring up
Apache Cassandra clusters - this presentation can be used
as a field reference.
Alex Thompson, Solution Architect APAC - DataStax Australia Pty Ltd
3. Build a best practice reproducible machine
image using automation:
Use one of the core test linux distros and versions: RHEL, CentOS or Ubuntu Server.
Select a cloud server or on-premise hardware that at least meets minimum specifications for Apache Cassandra, refer
to this guide for details: Planning Apache Cassandra Hardware
For production, load testing and production like workloads do NOT use a SAN, NAS, CEPH or any other type of shared
storage, DO use directly attached SSDs.
More RAM is better, more CPU is better but don’t get stuck in the RDBMS trap of vertically scaling, Apache Cassandra
works best with many more medium spec’d nodes than a smaller amount of very large nodes - think horizontal scaling
not vertical scaling.
3
4. Build a best practice reproducible machine
image using automation:
Use an automation tool like Ansible, Salt, Chef or Puppet to:
1. Apply Apache Cassandra OS specific settings for Linux
2. Install Java JDK 1.8.latest
3. Install but not start Apache Cassandra via yum or apt (a tarball is also available)
4. Copy over this nodes cassandra.yaml and cassandra-env.sh
5. Lock down all ports except the required Apache Cassandra ports in iptables, you can see a list of the ports and
their usage here: Securing Firewall but as a simple list you need access on 22 (SSH), 7000, 7001(SSL), 9042
(CQL), 9160(Thrift - optional) and 7199(JMX-optional)
Refer to the presentation by Jon from Macquarie Bank on the use of Ansible and lessons learned for an in depth
discussion on automation - November 2016 meetup.
4
5. Minimum node specific cassandra.yaml
fields for automation deployment scripts:
cluster_name All nodes participating in a cluster must have the identical cluster name.
hints_directory Where to store hints for other nodes that are down, small disk space requirement.
authenticator Used to identify users; default is wide open, lock this down in combination with transport layer security and
on disk encryption if internet exposed.
authorizer Used to limit access/provide permissions; default is wide open, lock this down in combination with transport
layer security and on disk encryption if internet exposed.
data_file_directories Where you will store data for this node, this will be the largest consumer of disk space. You should put your
commitlog_directory and data_file_directories on different drives for performance.
commitlog_directory You should put your commitlog_directory and data_file_directories on different drives for performance.
saved_caches_directory Where to store your “fast start-up” cache; small disk space requirement.
5
6. Minimum node specific cassandra.yaml
fields for automation deployment scripts:
seeds When bootstrapping a new node into a cluster, the bootstrapping node will refer to a seed node to learn
topology of the cluster, with this information it can take ownership of token ranges and begin data transfer.
listen_address The ip-address of the node for a single homed 1x NIC node.
rpc_address The ip-address of the node for a single homed 1x NIC node.
endpoint_snitch GossipingPropertyFileSnitch
1. The parameter list above is for a basic C* cluster leaving many unlisted parameters at their default settings, the
default settings are very sane for most use cases but can be fine tuned to maximize performance and hardware
utilisation, only tweak the unlisted parameters when you know what you are doing.
2. The parameters listed above are in top down order as at 13/2/2017 for the github.com master Apache Cassandra
repository here: cassandra.yaml
6
7. Minimum node specific cassandra-env.sh
fields for automation deployment scripts:
If the cassandra-env.sh is left in default form it will allocate ¼ of the RAM in the node to Apache Cassandra, this can be
problematic on very small spec’d nodes as C* really needs a minimum 4GB HEAP allocation to function in development.
As a general rule if HEAP =< 16GB use ParNew/CMS GC otherwise HEAP > 16GB use G1 GC.
You set the HEAP by uncommenting the following in the cassandra-env.sh:
#MAX_HEAP_SIZE="4G"
#HEAP_NEWSIZE="800M"
G1 requires that only MAX_HEAP_SIZE be set.
In production the HEAP setting on G1 GC are usually 16,24,32GB.
ParNew/CMS requires both are set, as a guide HEAP_NEWSIZE should be 20-25% of MAX_HEAP_SIZE.
7
8. Summary so far
We now have a node that:
1. Is on the correct hardware
2. Has correct OS with basic tuning in place
3. Has the correct Java JDK version
4. Has Apache Cassandra installed via yum or apt
5. Has customised cassandra.yaml and cassandra-env.sh files
6. Has been secured at IPtable level
7. Can now be started and bootstrapped against seed in the cluster
8
10. Bringing up the first node...
This is a new cluster when bring up the first node so there is in effect nothing to bootstrap against, Cassandra
understands this and initialises the node without going thru the bootstrapping phase.
>service cassandra start
Check /var/log/cassandra/system.log for startup process and monitor for any warnings or exceptions.
You most likely want to bring up multiple nodes at once in the new cluster, for the sake of this presentation I am
looking at one at a time so that i can break down the bootstrapping phases, to skip that and bring multiple nodes up at
once follow the documentation here:
Initializing a multiple node cluster (single datacenter)
10
11. Load some data
Load some data into the first node.
Here I am going to use the
cassandra-stress tool to load 100GB of
sample data.
Cassandra-stress can be used for
loading sample data and/or stress
testing a Cassandra cluster with read /
write workloads.
You can read more about
cassandra-stress here.
1
Tokens 0-9
Data on disk 100GB
11
12. Bootstrapping the second node...
Put the ip-address of the first node in the seed list of this node’s cassandra.yaml
>service cassandra start
Check /var/log/cassandra/system.log for bootstrapping progress.
12
13. Bootstrapping the second node...
Run the following on the first node and you will see your new node in UJ state - Up Joining:
>nodetool status
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 10.10.3.62 100 GB 256 ? c934ced4-b1c9-4f0f-b278-83282cd7107f RAC2
UJ 10.10.3.63 3 MB 256 ? 1a3df7fa-a1e7-464a-9495-c6a52d61eafa RAC3
13
14. Bootstrapping...what happened?
So what is happening in this bootstrapping phase?
In Up Joining (UJ) state the node is not actively participating in any queries either read or write for both internode and
client to node traffic.
1. A calculation is done for this node’s share of the token space, in this case it takes half of the token space as it is
one of only two nodes in the ring and in taking half the token space it is taking responsibility for half the data in
the ring.
2. The node begins streaming in the data from the first node for its tokens.
3. The node completes streaming its data from the first node, this can take time for 100’s of GBs of data
4. The node changes state to UN (Up Normal)
5. The node can now be discovered by drivers and their application servers and now start responding to read /
write requests.
14
15. Data streaming
during bootstrap
Be aware on small clusters of the
costs of bootstrapping, the data
streaming phase can consume
considerable resources and take
increasing amounts of time for very
large amounts of data.
1
2
Tokens 0-4
Data on disk 100GB
Tokens 5-9
Data on disk ..growing
15
16. Second node
added
Notice that the second node now owns
half of the tokens in the ring.
Notice that the data on node 1 is
100GB on disk and the data on the
new node 2 is only 50GB on disk.
1
2
Tokens 0-4
Data on disk 100GB
Tokens 5-9
Data on disk 50GB
16
17. Bootstrapping data...WTF?
In bootstrapping the new node, I knew it took half the data off the first node but the amount of disk space used on the
first node didn’t change, it didn’t go down? WTF is going on here? Something is broken!
Rule: Bootstrapping a new node into a cluster does NOT clean up after itself and delete the orphaned data on the
original nodes!
Don’t get me wrong, the data on the first node is not hurting anything, it’s not used anymore, it just sits there using up
precious space, let's get rid of it by running the following command on the first node:
>nodetool cleanup
Note that in a Vnode cluster (most likely what you will be using) you have to run nodetool cleanup on all nodes in the
DC except of course the node you just added.
17
18. After cleanup
After [nodetool cleanup] has run data
is once again evenly distributed over
nodes.
1
2
Tokens 0-4
Data on disk 50GB
Tokens 5-9
Data on disk 50GB
18
19. Powerful
implications
We just doubled the raw compute
capacity of our database tier in the
following ways:
1. Doubled IO throughput
2. Doubled the amount of RAM
3. Doubled the amount of disk
4. Doubled the number of CPUs
1
2
Tokens 0-4
Data on disk 50GB
Tokens 5-9
Data on disk 50GB
19
20. Powerful
implications
The effect at the application tier is
arguably more profound, we have
doubled the workload capacity of the
underlying database tier to handle
increases in application tier traffic. So
as our workload increases at the
application tier we simply add nodes at
the Cassandra cluster level to soak up
the workload increase.
*The tps figures in this series are not real, your
tps limits will be dependent on your hardware,
data model, replication_factor and how you
read / write data. Use cassandra-stress to
emulate your real world traffic patterns and
and record performance behaviour.
1
Application server max tps 5000 tps
1000 tps
20
21. Powerful
implications
The effect at the application tier is
arguably more profound, we have
doubled the workload capacity of the
underlying database tier to handle
increases in application tier traffic. So
as our workload increases at the
application tier we simply add nodes at
the Cassandra cluster level to soak up
the workload increase.
1
2
1000 tps
1000 tps
Application server max tps 5000 tps
21
22. Practical
considerations
There is not much use having a two
node cluster, you really want a
minimum of 3 nodes and a
replication_factor of 3 and then scale
out your cluster from there.
1
23
22
23. Practical
considerations
Here we have stayed with a single
application server which is not a really
good idea from a redundancy
perspective but there is another
problem.
The tps capacity of the database tier
has scaled past the tps capacity of the
application tier, leaving the database
tier under-utilized.
1
5
2
3
4
8
6
7
9
9000 tps
Application server max tps 5000 tps
23
24. Practical
considerations
Time to start scaling out the
application tier to fully utilize the
capacity of the database tier.
1
5
2
3
4
8
6
7
9
9000 tps
Application server max tps 10000 tps
24
25. Triggers for adding more nodes and
capacity planning
Too much data per node You want to aim for 500GB-1TB of data per node, the more data per node the longer repairs,
bootstrapping and compactions take.
Insufficient free space on drives For SizeTieredCompactionStrategy (the default) you need 50% of the disk free at all times in the
worst case.
Poor IO performance If you have done everything right in regards to amount of data per node, have directly attached
SSD’s and have tuned both your hardware and Cassandra to maximize IO performance and you
still have poor IO performance then you need to scale out of the problem.
Bottlenecked CPUs Same as above, if you have done everything right and tuned both your hardware and Cassandra
to maximize CPU performance and you still have poor CPU performance then you need to scale
out of the problem.
25
26. Triggers for adding more nodes and
capacity planning
Poor JVM GC behaviour This can be tricky to troubleshoot, more than likely it’s just a scale out fix as you are
overloading the nodes with read / write traffic, but there are cases where a poor access pattern
or problematic use case can be the cause of GC churning.
Adding additional keyspaces and
application workloads to the cluster
Workloads are cumulative in resource demand.
Increases in application tier traffic If you double the amount of requests against your application tier, the relationship with
Cassandra is linear, you will need to double the number of nodes in your cluster to maintain
the same performance, it’s simple maths.
26
27. Summary so far
Now we have a basic cluster of 9 nodes that we can continue to scale out.
What we do not have is any form of redundancy:
1. What if a shared switch goes down?
2. What if a common rack chassis power supply goes down?
3. What if we loose the network to this physical data center?
Cassandra has probably the best answer to this of any DB solution available: the logical data center.
27
29. cluster
Data centers
Cassandra data centers (DCs) are a
logical not physical concept.
A Cassandra cluster is made up of
data centers and each data center
holds a complete token range.
You write your data to one data center
and it is replicated to another
datacenter, that other data center
could be in the same rack or across
the world.
A cluster can have many data centers
but practical limits do apply.
DC1
1
5
2
3
4
8
6
7
9
DC2
1
5
2
3
4
8
6
7
9
29
30. cluster
Data centers
Data centers are a versatile concept
and can be used for many differing
purposes, here are some examples:
1. Simple redundancy
2. Active failover from app tier
3. Geo edge serving
4. Workload isolation
As mentioned before, each DC holds
complete token range for the
keyspaces that are replicated to it, you
decide which keyspaces are
replicated.
DC1
1
5
2
3
4
8
6
7
9
DC2
1
5
2
3
4
8
6
7
9
CREATE KEYSPACE myKeyspace
WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3', 'DC2': '3'}
30
31. cluster
Simple redundancy
This multi-dc cluster is a simple
redundancy setup, if we lose us-east-1
due to an outage we can access
us-west-1 for the data for business
continuity.
us-east-1
1
5
2
3
4
8
6
7
9
us-west-1
1
5
2
3
4
8
6
7
9
read/write DC
CREATE KEYSPACE myKeyspace
WITH replication = {'class': 'NetworkTopologyStrategy', 'us-east-1: '3', 'us-west-1': '3'}
31
32. cluster
Active failover
This multi-dc cluster is a an active
failover setup, if we lose us-east-1 due
to an outage we can failover the
application servers to us-west-1, this
can be configured at the cassandra
driver level*, in custom code, the
network layer or at the DNS level.
* See the April 2016 Sydney Cassandra Users
Meetup talk that covers most aspects of driver
configuration and strategies.
us-east-1
1
5
2
3
4
8
6
7
9
us-west-1
1
5
2
3
4
8
6
7
9
read/write DC actively fails over to the us-west-1 DC
CREATE KEYSPACE myKeyspace
WITH replication = {'class': 'NetworkTopologyStrategy', 'us-east-1: '3', 'us-west-1': '3'}
32
33. cluster
Geo edge serving
All DC’s are close to their own
in-country app servers.
Writes can be handled in any number
of ways, reads are always from the
closest DC.
Any write to any DC replicates to the
other 3 geographic locations.
US-DC
1
5
2
3
4
8
6
7
9
CREATE KEYSPACE myKeyspace
WITH replication =
{'class': 'NetworkTopologyStrategy', 'US-DC: '3', 'EU-DC': '3',, 'ME-DC': '3', 'AP-DC': '3'}
EU-DC
1
5
2
3
4
8
6
7
9
ME-DC
1
5
2
3
4
8
6
7
9
AP-DC
1
5
2
3
4
8
6
7
9
33
35. cluster
Workload isolation
Apart from simple redundancy this is the most
important use of logical data centers in
Cassandra.
Different workloads are pointed to different
data centers to allow us to isolate say a spiky
web workload from an analytic Spark
workload, we can then independently scale
each DC to its own workload making the most
efficient use of resources.
In this example we replicate cass-DC tables to
spark-DC, perform analytics on them and write
to recommendation tables in the spark-DC
which replicate back to the cass-DC.
cass-DC
1
5
2
3
4
8
6
7
9
spark-DC
1
5
2
3
4
8
6
7
9
app server
CREATE KEYSPACE web-tables
WITH replication = {'class': 'NetworkTopologyStrategy', 'cass-DC: '3', 'spark-DC': '2'}
CREATE KEYSPACE recommendation-tables
WITH replication = {'class': 'NetworkTopologyStrategy', 'spark-DC: '2', 'cass-DC': '3'}
spark
35
36. C* Learning resources
The datastax documentation has more extensive descriptions of all the concepts listed here, please
refer to it if you need more in depth knowledge and don’t forget academy.datastax.com for full
courses and a multitude of Apache Cassandra learning resources.
36