A presentation for Data Day Austin on January 29th, 2011
Introduces how to effectively use Apache Cassandra for Java developers using the Hector Java client: http://github.com/rantav/hector
Introduction to apache_cassandra_for_developers-lhgzznate
The document provides an introduction to Apache Cassandra for Java developers, explaining key concepts such as data storage, compaction, consistency levels, and how Cassandra differs from relational databases; it also demonstrates examples of performing common operations like reading, writing, and deleting data using the Hector client library.
Introduction to Cassandra: Replication and ConsistencyBenjamin Black
A short introduction to replication and consistency in the Cassandra distributed database. Delivered April 28th, 2010 at the Seattle Scalability Meetup.
Introduciton to Apache Cassandra for Java Developers (JavaOne)zznate
The database industry has been abuzz over the past year about NoSQL databases. Apache Cassandra, which has quickly emerged as a best-of-breed solution in this space, is used at many companies to achieve unprecedented scale while maintaining streamlined operations.
This presentation goes beyond the hype, buzzwords, and rehashed slides and actually presents the attendees with a hands-on, step-by-step tutorial on how to write a Java application on top of Apache Cassandra. It focuses on concepts such as idempotence, tunable consistency, and shared-nothing clusters to help attendees get started with Apache Cassandra quickly while avoiding common pitfalls.
Apache Cassandra operations have the reputation to be simple on single datacenter deployments and / or low volume clusters but they become way more complex on high latency multi-datacenter clusters with high volume and / or high throughout: basic Apache Cassandra operations such as repairs, compactions or hints delivery can have dramatic consequences even on a healthy high latency multi-datacenter cluster.
In this presentation, Julien will go through Apache Cassandra mutli-datacenter concepts first then show multi-datacenter operations essentials in details: bootstrapping new nodes and / or datacenter, repairs strategy, Java GC tuning, OS tuning, Apache Cassandra configuration and monitoring.
Based on his 3 years experience managing a multi-datacenter cluster against Apache Cassandra 2.0, 2.1, 2.2 and 3.0, Julien will give you tips on how to anticipate and prevent / mitigate issues related to basic Apache Cassandra operations with a multi-datacenter cluster.
About the Speaker
Julien Anguenot VP Software Engineering, iland Internet Solutions, Corp
Julien currently serves as iland's Vice President of Software Engineering. Prior to joining iland, Mr. Anguenot held tech leadership positions at several open source content management vendors and tech startups in Europe and in the U.S. Julien is a long time Open Source software advocate, contributor and speaker: Zope, ZODB, Nuxeo contributor, Zope and OpenStack foundations member, his talks includes Apache Con, Cassandra summit, OpenStack summit, The WWW Conference or still EuroPython.
Detail behind the Apache Cassandra 2.0 release and what is new in it including Lightweight Transactions (compare and swap) Eager retries, Improved compaction, Triggers (experimental) and more!
• CQL cursors
This document provides an overview of Apache Cassandra and how it can be used to build a Twitter-like application called Twissandra. It describes Cassandra's data model using keyspaces and column families, and how they can be mapped to represent users, tweets, followers, and more. It also shows examples of common operations like inserting and querying data. The goal is to illustrate how Cassandra addresses issues like scalability and availability in a way relational databases cannot, and how it can be used to build distributed, highly available applications.
Cassandra concepts, patterns and anti-patternsDave Gardner
The document discusses Cassandra concepts, patterns, and anti-patterns. It begins with an agenda that covers choosing NoSQL, Cassandra concepts based on Dynamo and Bigtable, and patterns and anti-patterns of use. It then delves into Cassandra concepts such as consistent hashing, vector clocks, gossip protocol, hinted handoff, read repair, and consistency levels. It also discusses Bigtable concepts like sparse column-based data model, SSTables, commit log, and memtables. Finally, it outlines several patterns and anti-patterns of Cassandra use.
This document introduces Apache Cassandra, a distributed column-oriented NoSQL database. It discusses Cassandra's architecture, data model, query language (CQL), and how to install and run Cassandra. Key points covered include Cassandra's linear scalability, high availability and fault tolerance. The document also demonstrates how to use the nodetool utility and provides guidance on backing up and restoring Cassandra data.
This document provides instructions for downloading and configuring Apache Cassandra, including ensuring necessary properties are configured in the cassandra.yaml file. It also outlines how to use the Cassandra CQL shell to describe and interact with the cluster, keyspaces and tables. Finally, it mentions the DataStax tools DevCenter and OpsCenter for inserting and analyzing Cassandra data.
Apache Cassandra operations have the reputation to be quite simple against single datacenter clusters and / or low volume clusters but they become way more complex against high latency multi-datacenter clusters: basic operations such as repair, compaction or hints delivery can have dramatic consequences even on a healthy cluster.
In this presentation, Julien will go through Cassandra operations in details: bootstrapping new nodes and / or datacenter, repair strategies, compaction strategies, GC tuning, OS tuning, large batch of data removal and Apache Cassandra upgrade strategy.
Julien will give you tips and techniques on how to anticipate issues inherent to multi-datacenter cluster: how and what to monitor, hardware and network considerations as well as data model and application level bad design / anti-patterns that can affect your multi-datacenter cluster performances.
This document provides an overview of distributed key-value stores and Cassandra. It discusses key concepts like data partitioning, replication, and consistency models. It also summarizes Cassandra's features such as high availability, elastic scalability, and support for different data models. Code examples are given to demonstrate basic usage of the Cassandra client API for operations like insert, get, multiget and range queries.
Spark Streaming Recipes and "Exactly Once" Semantics RevisedMichael Spector
This document discusses stream processing with Apache Spark. It begins with an overview of Spark Streaming and its advantages over other frameworks like low latency and rich APIs. It then covers core Spark Streaming concepts like windowing and achieving "exactly once" semantics through checkpointing and write ahead logs. The document presents two examples of using Spark Streaming for analytics and aggregation with transactional and snapshotted approaches. It concludes with notes on deployment with Mesos/Marathon and performance tuning Spark Streaming jobs.
Cassandra is an open source, distributed, decentralized, and fault-tolerant NoSQL database that is highly scalable and provides tunable consistency. It was created at Facebook based on Amazon's Dynamo and Google's Bigtable. Cassandra's key features include elastic scalability through horizontal partitioning, high availability with no single point of failure, tunable consistency levels, and a column-oriented data model with a CQL interface. Major companies like eBay, Netflix, and Apple use Cassandra for applications requiring large volumes of writes, geographical distribution, and evolving data models.
Understanding Data Consistency in Apache CassandraDataStax
This document provides an overview of data consistency in Apache Cassandra. It discusses how Cassandra writes data to commit logs and memtables before flushing to SSTables. It also reviews the CAP theorem and how Cassandra offers tunable consistency levels for both reads and writes. Strategies for choosing consistency levels for writes, such as ANY, ONE, QUORUM, and ALL are presented. The document also covers read repair and hinted handoffs in Cassandra. Examples of CQL queries with different consistency levels are given and information on where to download Cassandra is provided at the end.
This course is designed to be a “fast start” on the basics of data modeling with Cassandra. We will cover some basic Administration information upfront that is important to understand as you choose your data model. It is still important to take a proper Admin class if you are responsible for production instance. This course focuses on CQL3, but thrift shall not be ignored.
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...DataStax
EmoDB is an open source RESTful data store built on top of Cassandra that stores JSON documents and, most notably, offers a databus that allows subscribers to watch for changes to those documents in real time. It features massive non-blocking global writes, asynchronous cross data center communication, and schema-less json content.
For non-blocking global writes, we created a ""JSON delta"" specification that defines incremental updates to any json document. Each row, in Cassandra, is thus a sequence of deltas that serves as a Conflict-free Replicated Datatype (CRDT) for EmoDB's system of record. We introduce the concept of ""distributed compactions"" to frequently compact these deltas for efficient reads.
Finally, the databus forms a crucial piece of our data infrastructure and offers a change queue to real time streaming applications.
About the Speaker
Fahd Siddiqui Lead Software Engineer, Bazaarvoice
Fahd Siddiqui is a Lead Software Engineer at Bazaarvoice in the data infrastructure team. His interests include highly scalable, and distributed data systems. He holds a Master's degree in Computer Engineering from the University of Texas at Austin, and frequently talks at Austin C* User Group. About Bazaarvoice: Bazaarvoice is a network that connects brands and retailers to the authentic voices of people where they shop. More at www.bazaarvoice.com
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Data Con LA
After a brief technical introduction to Apache Cassandra we'll then go into the exciting world of Apache Spark integration, and learn how you can turn your transactional datastore into an analytics platform. Apache Spark has taken the Hadoop world by storm (no pun intended!), and is widely seen as the replacement to Hadoop Map Reduce. Apache Spark coupled with Cassandra are perfect allies, Cassandra does the distributed data storage, Spark does the distributed computation.
This document discusses using Node.js and Cassandra for highly concurrent systems. It explains that Node.js is well-suited for I/O-bound applications with low CPU usage that require high concurrency. This is because Node.js uses an event-driven, non-blocking model that handles connections efficiently in a single thread without much overhead. The document also introduces the Cassandra driver for Node.js, which features connection pooling, load balancing, retry functions, and row/field streaming for efficiently accessing Cassandra from Node.js applications. Examples are given showing how to perform queries, stream rows and fields to responses.
Managing Objects and Data in Apache CassandraDataStax
This document discusses managing objects and data in Apache Cassandra. It covers the primary interfaces for managing objects and data which are the Cassandra CLI and CQL. CQL is introduced as an SQL-like language for creating, altering and removing objects as well as inserting, updating and deleting data. The document provides examples of consistency options in CQL and instructions for downloading Cassandra from DataStax.
Real-time streaming and data pipelines with Apache KafkaJoe Stein
Get up and running quickly with Apache Kafka http://kafka.apache.org/
* Fast * A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.
* Scalable * Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers
* Durable * Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.
* Distributed by Design * Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees.
Yuki Morishita from DataStax gave a presentation on the new features in Apache Cassandra 4.0. Some of the key highlights include virtual tables that allow querying system data via SQL, transient replication which saves resources by using temporary replicas, audit logging for auditing queries and authentication, full query logging for capturing and replaying workloads, and zero-copy SSTable streaming for more efficient data transfer between nodes. Other notable changes include experimental Java 11 support, improved asynchronous messaging, removal of Thrift support, and changes to read repair handling.
This document provides an introduction and overview of Cassandra and NoSQL databases. It discusses the challenges faced by modern web applications that led to the development of NoSQL databases. It then describes Cassandra's data model, API, consistency model, and architecture including write path, read path, compactions, and more. Key features of Cassandra like tunable consistency levels and high availability are also highlighted.
This document provides an overview of Apache Cassandra including:
- What Cassandra is and how it differs from an RDBMS by not supporting joins, having an optional schema, and being transactionless.
- Cassandra's data model using keyspaces, column families, and static vs dynamic column families.
- How to integrate Cassandra with Java applications using the Hector client and ColumnFamilyTemplate for querying, updating, and deleting data.
- Additional topics covered include the CAP theorem, data storage and compaction, and using CQL via JDBC.
This document introduces Apache Cassandra, a distributed column-oriented NoSQL database. It discusses Cassandra's architecture, data model, query language (CQL), and how to install and run Cassandra. Key points covered include Cassandra's linear scalability, high availability and fault tolerance. The document also demonstrates how to use the nodetool utility and provides guidance on backing up and restoring Cassandra data.
This document provides instructions for downloading and configuring Apache Cassandra, including ensuring necessary properties are configured in the cassandra.yaml file. It also outlines how to use the Cassandra CQL shell to describe and interact with the cluster, keyspaces and tables. Finally, it mentions the DataStax tools DevCenter and OpsCenter for inserting and analyzing Cassandra data.
Apache Cassandra operations have the reputation to be quite simple against single datacenter clusters and / or low volume clusters but they become way more complex against high latency multi-datacenter clusters: basic operations such as repair, compaction or hints delivery can have dramatic consequences even on a healthy cluster.
In this presentation, Julien will go through Cassandra operations in details: bootstrapping new nodes and / or datacenter, repair strategies, compaction strategies, GC tuning, OS tuning, large batch of data removal and Apache Cassandra upgrade strategy.
Julien will give you tips and techniques on how to anticipate issues inherent to multi-datacenter cluster: how and what to monitor, hardware and network considerations as well as data model and application level bad design / anti-patterns that can affect your multi-datacenter cluster performances.
This document provides an overview of distributed key-value stores and Cassandra. It discusses key concepts like data partitioning, replication, and consistency models. It also summarizes Cassandra's features such as high availability, elastic scalability, and support for different data models. Code examples are given to demonstrate basic usage of the Cassandra client API for operations like insert, get, multiget and range queries.
Spark Streaming Recipes and "Exactly Once" Semantics RevisedMichael Spector
This document discusses stream processing with Apache Spark. It begins with an overview of Spark Streaming and its advantages over other frameworks like low latency and rich APIs. It then covers core Spark Streaming concepts like windowing and achieving "exactly once" semantics through checkpointing and write ahead logs. The document presents two examples of using Spark Streaming for analytics and aggregation with transactional and snapshotted approaches. It concludes with notes on deployment with Mesos/Marathon and performance tuning Spark Streaming jobs.
Cassandra is an open source, distributed, decentralized, and fault-tolerant NoSQL database that is highly scalable and provides tunable consistency. It was created at Facebook based on Amazon's Dynamo and Google's Bigtable. Cassandra's key features include elastic scalability through horizontal partitioning, high availability with no single point of failure, tunable consistency levels, and a column-oriented data model with a CQL interface. Major companies like eBay, Netflix, and Apple use Cassandra for applications requiring large volumes of writes, geographical distribution, and evolving data models.
Understanding Data Consistency in Apache CassandraDataStax
This document provides an overview of data consistency in Apache Cassandra. It discusses how Cassandra writes data to commit logs and memtables before flushing to SSTables. It also reviews the CAP theorem and how Cassandra offers tunable consistency levels for both reads and writes. Strategies for choosing consistency levels for writes, such as ANY, ONE, QUORUM, and ALL are presented. The document also covers read repair and hinted handoffs in Cassandra. Examples of CQL queries with different consistency levels are given and information on where to download Cassandra is provided at the end.
This course is designed to be a “fast start” on the basics of data modeling with Cassandra. We will cover some basic Administration information upfront that is important to understand as you choose your data model. It is still important to take a proper Admin class if you are responsible for production instance. This course focuses on CQL3, but thrift shall not be ignored.
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...DataStax
EmoDB is an open source RESTful data store built on top of Cassandra that stores JSON documents and, most notably, offers a databus that allows subscribers to watch for changes to those documents in real time. It features massive non-blocking global writes, asynchronous cross data center communication, and schema-less json content.
For non-blocking global writes, we created a ""JSON delta"" specification that defines incremental updates to any json document. Each row, in Cassandra, is thus a sequence of deltas that serves as a Conflict-free Replicated Datatype (CRDT) for EmoDB's system of record. We introduce the concept of ""distributed compactions"" to frequently compact these deltas for efficient reads.
Finally, the databus forms a crucial piece of our data infrastructure and offers a change queue to real time streaming applications.
About the Speaker
Fahd Siddiqui Lead Software Engineer, Bazaarvoice
Fahd Siddiqui is a Lead Software Engineer at Bazaarvoice in the data infrastructure team. His interests include highly scalable, and distributed data systems. He holds a Master's degree in Computer Engineering from the University of Texas at Austin, and frequently talks at Austin C* User Group. About Bazaarvoice: Bazaarvoice is a network that connects brands and retailers to the authentic voices of people where they shop. More at www.bazaarvoice.com
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Data Con LA
After a brief technical introduction to Apache Cassandra we'll then go into the exciting world of Apache Spark integration, and learn how you can turn your transactional datastore into an analytics platform. Apache Spark has taken the Hadoop world by storm (no pun intended!), and is widely seen as the replacement to Hadoop Map Reduce. Apache Spark coupled with Cassandra are perfect allies, Cassandra does the distributed data storage, Spark does the distributed computation.
This document discusses using Node.js and Cassandra for highly concurrent systems. It explains that Node.js is well-suited for I/O-bound applications with low CPU usage that require high concurrency. This is because Node.js uses an event-driven, non-blocking model that handles connections efficiently in a single thread without much overhead. The document also introduces the Cassandra driver for Node.js, which features connection pooling, load balancing, retry functions, and row/field streaming for efficiently accessing Cassandra from Node.js applications. Examples are given showing how to perform queries, stream rows and fields to responses.
Managing Objects and Data in Apache CassandraDataStax
This document discusses managing objects and data in Apache Cassandra. It covers the primary interfaces for managing objects and data which are the Cassandra CLI and CQL. CQL is introduced as an SQL-like language for creating, altering and removing objects as well as inserting, updating and deleting data. The document provides examples of consistency options in CQL and instructions for downloading Cassandra from DataStax.
Real-time streaming and data pipelines with Apache KafkaJoe Stein
Get up and running quickly with Apache Kafka http://kafka.apache.org/
* Fast * A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.
* Scalable * Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers
* Durable * Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.
* Distributed by Design * Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees.
Yuki Morishita from DataStax gave a presentation on the new features in Apache Cassandra 4.0. Some of the key highlights include virtual tables that allow querying system data via SQL, transient replication which saves resources by using temporary replicas, audit logging for auditing queries and authentication, full query logging for capturing and replaying workloads, and zero-copy SSTable streaming for more efficient data transfer between nodes. Other notable changes include experimental Java 11 support, improved asynchronous messaging, removal of Thrift support, and changes to read repair handling.
This document provides an introduction and overview of Cassandra and NoSQL databases. It discusses the challenges faced by modern web applications that led to the development of NoSQL databases. It then describes Cassandra's data model, API, consistency model, and architecture including write path, read path, compactions, and more. Key features of Cassandra like tunable consistency levels and high availability are also highlighted.
This document provides an overview of Apache Cassandra including:
- What Cassandra is and how it differs from an RDBMS by not supporting joins, having an optional schema, and being transactionless.
- Cassandra's data model using keyspaces, column families, and static vs dynamic column families.
- How to integrate Cassandra with Java applications using the Hector client and ColumnFamilyTemplate for querying, updating, and deleting data.
- Additional topics covered include the CAP theorem, data storage and compaction, and using CQL via JDBC.
Building a High-Performance Database with Scala, Akka, and SparkEvan Chan
Here is my talk at Scala by the Bay 2016, Building a High-Performance Database with Scala, Akka, and Spark. Covers integration of Akka and Spark, when to use actors and futures, back pressure, reactive monitoring with Kamon, and more.
This document provides an overview of Scala-ActiveRecord, a type-safe Active Record model library for Scala. It discusses features such as being type-safe, having Rails ActiveRecord-like functionality, automatic transaction control, and support for associations and validations. The document also covers getting started, defining schemas, CRUD operations, queries, caching queries, validations, callbacks, and relationships.
The document provides an overview of the Apache Tomcat web server and servlet container. It discusses Tomcat's history and architecture, how applications are deployed, and how requests are processed. Performance optimization techniques are also covered, noting that Tomcat is designed for scalability out of the box with minimal tuning typically required.
The document provides an overview of the Apache Tomcat web server and servlet container. It discusses Tomcat's history and architecture, how applications are deployed, and how requests are processed. Performance optimization techniques are also covered, noting that Tomcat is designed for scalability out of the box with minimal tuning typically required.
Catalyst is a web framework for Perl that allows developers to build dynamic web applications in a modular, reusable way. It utilizes common Perl techniques like Moose, DBIx::Class and Template Toolkit to handle tasks like object modeling, database access and view rendering. Catalyst applications can be built in a model-view-controller style to separate application logic, data access and presentation layers. This framework provides a standard way to write reusable code and build web UIs for tasks like system administration and automation.
Sparklife - Life In The Trenches With SparkIan Pointer
This document provides tips and tricks for using Apache Spark. It discusses both the benefits of Spark, such as its developer-friendly API and performance advantages over MapReduce, as well as challenges, such as unstable APIs and the difficulty of distributed systems. It provides recommendations for optimizing Spark applications, including choosing the right data structures, partitioning strategies, and debugging and monitoring techniques. It also briefly compares Spark to other streaming frameworks like Storm, Heron, Flink, and Kafka.
The document discusses Java architecture and fundamentals. It can be summarized as:
1. Java's architecture consists of four main components: the Java programming language, Java class files, the Java API, and the Java Virtual Machine (JVM).
2. When a Java program is written and run, it uses these four technologies. The program is written in Java source code and compiled to class files, which are then run on the JVM along with the Java API library.
3. The JVM handles execution by using areas like the method area for bytecode storage, the Java stack for method calls and parameters, and the heap for object instantiation and garbage collection.
The document discusses Java architecture and fundamentals. It can be summarized as:
1. Java's architecture consists of four main components: the Java programming language, Java class files, the Java API, and the Java Virtual Machine (JVM).
2. When a Java program is written and run, it uses these four technologies. The program is written in Java source code and compiled to class files, which are then run on the JVM along with the Java API library.
3. The JVM handles execution by using areas like the method area for bytecode storage, the Java stack for method calls and parameters, and the heap for object instantiation and garbage collection.
The document discusses Java architecture and fundamentals. It can be summarized as:
1. Java's architecture consists of four main components: the Java programming language, Java class files, the Java API, and the Java Virtual Machine (JVM).
2. When a Java program is written and run, it uses these four technologies. The program is written in Java source code and compiled to class files, which are then run on the JVM along with the Java API library.
3. The JVM handles execution by using areas like the method area for bytecode storage, the Java stack for method calls and parameters, and the heap for object instantiation and garbage collection.
The document discusses Java memory management and how it is divided into different areas - the stack space, heap space, and string pool. The stack space stores method calls and references to objects in heap space. The heap space stores all objects created during program execution. The string pool stores and interns string literals to save memory and improve performance during string comparisons. Examples are provided to illustrate how objects and primitives are passed in methods and how string interning works.
This document summarizes a presentation about deploying Tomcat clusters in an advanced Blackboard environment. The presentation introduces clustering concepts and techniques for planning and deploying Tomcat application clusters. It defines what clusters and nodes are, and explains the differences between clustered and load balanced nodes. The presentation reviews configuration requirements for setting up a Tomcat cluster and load distribution, and provides examples of cluster code. It also discusses session replication, guidelines for setting up clusters, and includes benchmark statistics comparing clustered and non-clustered configurations.
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
O'Reilly Webcast with Myself and Evan Chan on the new SNACK Stack (playoff of SMACK) with FIloDB: Scala, Spark Streaming, Akka, Cassandra, FiloDB and Kafka.
This document provides an overview of using Cassandra in web applications. It discusses why developers may consider using a NoSQL solution like Cassandra over traditional SQL databases. It then covers topics like Cassandra's architecture, data modeling, configuration options, APIs, development tools, and examples of companies using Cassandra in production systems. Key points emphasized are that Cassandra offers high performance but requires rewriting code and developing new processes and tools to support its flexible schema and data model.
This document discusses using JDBC to access databases from Java applications like JSP pages. It covers loading the appropriate JDBC driver, establishing a connection with the database using a connection URL, executing SQL statements using Statement objects to retrieve and process result sets, and closing the connection when done. The core steps are to load the driver, get a connection, create statements, execute queries/updates, process results, and close the connection.
This document discusses using JDBC to access databases from Java applications like JSP pages. It covers loading the appropriate JDBC driver, establishing a connection with the database using a connection URL, executing SQL statements using Statement objects to retrieve and process result sets, and closing the connection when done. The core steps are to load the driver, get a connection, create statements, execute queries/updates, process results, and close the connection.
Inside the JVM - Follow the white rabbit! / Breizh JUGSylvain Wallez
Presentation given at the Rennes (FR) Java User Group in Feb 2019.
How do we go from your Java code to the CPU assembly that actually runs it? Using high level constructs has made us forget what happens behind the scenes, which is however key to write efficient code.
Starting from a few lines of Java, we explore the different layers that constribute to running your code: JRE, byte code, structure of the OpenJDK virtual machine, HotSpot, intrinsic methds, benchmarking.
An introductory presentation to these low-level concerns, based on the practical use case of optimizing 6 lines of code, so that hopefully you to want to explore further!
JDD 2016 - Grzegorz Rozniecki - Java 8 What Could Possibly Go WrongPROIDEA
It’s late 2016, so you probably have been using Java 8 goodies for a while: lambdas, Stream, Optional, new date API ‒ stuff which makes Java development much more pleasant. But the question is: do you know these tools well? I bet you said yes, because writing sweet Java 8 code is piece of cake ‒ you’re using efficient, parallel streams and many lambdas, so what could possibly go wrong? Let me put this straight: most probably you’re doing something wrong. In this talk I won’t actually try to prove that you don’t know what you’re doing, on the contrary ‒ I’ll try to help you be a better programmer by pointing out few mistakes you can make when writing Java 8 code (I know that because I made them all). I’ll also discuss couple common misconceptions regarding Stream and Optional and mention missing language features (also if there is a chance to see them in Java 9 or what library should you use instead). Last but not least, I’ll present you a number of lesser-known gems I found in deepest corners of JDK API, which, I hope, will make your life as a software developer a little bit easier.
Cassandra SF Meetup - CQL Performance With Apache Cassandra 3.Xaaronmorton
The document discusses performance improvements in Apache Cassandra 3.0's storage engine. Key improvements include delta encoding, variable integer encoding, clustering columns written only once, aggregated cell metadata, and cell presence bitmaps. This reduces storage size and improves read performance. The write path involves committing to the commit log and merging into the memtable. The read path can use clustering index filters to short-circuit searching based on deletion times, column names, or clustering ranges to avoid reading unnecessary SSTables.
Advanced Apache Cassandra Operations with JMXzznate
Nodetool is a command line interface for managing a Cassandra node. It provides commands for node administration, cluster inspection, table operations and more. The nodetool info command displays node-specific information such as status, load, memory usage and cache details. The nodetool compactionstats command shows compaction status including active tasks and progress. The nodetool tablestats command displays statistics for a specific table including read/write counts, space usage, cache usage and latency.
Describes in detail the security architecture of Apache Cassandra. We discuss encryption at rest, encryption on the wire, authentication and authorization and securing JMX and management tools
Seattle C* Meetup: Hardening cassandra for compliance or paranoiazznate
Details how to secure Apache Cassandra clusters. Covers client to server and server to server encryption, securing management and tooling, using authentication and authorization, as well as options for encryption at rest.
Tracing allows you to see the path a query takes through the Cassandra cluster. It shows details like which nodes are queried, how long each step takes, and can help identify performance bottlenecks. The tracing information can be accessed via the Java driver, cqlsh, or DevCenter and provides a detailed timeline of query execution. Reviewing traces is recommended during development to catch unexpected query behavior.
Hardening cassandra for compliance or paranoiazznate
Cassandra at rest encryption, inter-node communication encryption, client-server communication encryption, authentication, authorization, and securing JMX management were discussed. The document provided guidance on implementing encryption at rest using commercial and open source options, setting up SSL for inter-node and client-server communication using self-signed certificates, implementing authentication and authorization best practices from RBMS, and securing JMX access.
Successful Software Development with Apache Cassandrazznate
Adding a new technology to your development process can be challenging, and the distributed nature of Apache Cassandra can make it daunting. However, recent improvements in drivers, utilities and tooling have simplified the process making it easier than ever before to develop software with Apache Cassandra.
In this presentation, we cover essential knowledge for all developers wanting to efficiently create reliable Apache Cassandra based solutions. Topics include:
- Language and Driver selection
- Optimizing Driver configuration
- Productive Developer environments using ccm, Vagrant and DataStax DevCenter
- Creating appropriate test data
- Unit testing
- Automated integration testing
- Test optimization with profiles
New and existing users will come away from this presentation with the necessary knowledge to make their next Apache Cassandra project a success.
Vert.x is a new JVM based application framework with an event driven, asynchronous programming model. With APIs available in Java, JavaScript, Ruby, Python and Groovy, developers are given complete freedom to implement their application in the language of their choice.
Starting with the core Vert.x concepts, this presentation will walk attendees through the components of a simple vert.x based application. Through this process, attendees will gain an understanding of how Vert.x:
- provides for a way to use several different languages in the same application
- takes advantage of JVMs excellent multi-core capabilities
- uses a module-based framework for packaging and hot-deployment
- communicates with other processes via a distributed event bus
- exposes an asynchronous programming model with very simple concurrency
With this presentation, viewers should gain a deep-enough understanding of Vert.x to be able to evaluate the platform for their own projects.
The document discusses Intravert, a new transport for Apache Cassandra that uses HTTP and JSON. Some key points made include:
- Intravert aims to improve on existing transports by using HTTP and JSON to make it easier to use, secure, and test from a browser.
- It was built using Vert.x for its event-driven and modular architecture.
- Intravert examples demonstrate how to perform common Cassandra operations like slices, sets, and composites using a simple REST-like syntax with JSON payloads.
- Features discussed include flexible batching, server-side filtering, and "getref" to use results of one operation in another.
This document discusses clients and transports for Apache Cassandra. It describes the Thrift and CQL protocols for interacting with Cassandra, including benefits and drawbacks of each. Thrift uses an RPC-based approach while CQL is a binary protocol. The document also covers common data modeling patterns in Cassandra and considerations for choosing between Thrift and CQL based on use cases and application requirements.
This document discusses strategies for applying test-driven development (TDD) to Apache Cassandra projects. It notes that Cassandra's distributed and resource-intensive nature can make it difficult to integrate with TDD. Initially, the author embedded Cassandra in tests, but this led to slow test runs. Alternative tools like Cassandra Unit and the Cassandra Maven plugin were explored. The author ultimately recommends separating unit and integration tests, using the Cassandra Maven plugin without fixtures, and running tests in parallel to better apply TDD principles to Cassandra.
The document provides an overview of building applications with Apache Cassandra. It covers Cassandra basics, common API usage, storage models, understanding the ring and consistency, and integrating with web applications. The sections explore Cassandra concepts like static and dynamic column families, indexing techniques, tombstones, compaction, and using the Spring framework. Code examples are provided to demonstrate inserting, querying, and updating Cassandra data. Requirements for getting started and resources for further learning are also listed.
The document discusses Apache Cassandra, a distributed database management system designed to handle large amounts of data across many commodity servers. It was developed at Facebook and modeled after Google's Bigtable. The summary discusses key concepts like its use of consistent hashing to distribute data, support for tunable consistency levels, and focus on scalability and availability over traditional SQL features. It also provides an overview of how Cassandra differs from relational databases by not supporting joins, having an optional schema, and using a prematerialized and transaction-less model.
Hector v2: The Second Version of the Popular High-Level Java Client for Apach...zznate
This presentation will provide a preview of our new high-level API designed around community feedback and built on the solid foundation of Hector client internals currently in use by a number of production systems. A brief introduction to the existing Hector client will be included to accomadate new users.
2. Brief Intro NOT a "key/value store" Columns are dynamic inside a column family SSTables are immutable SSTables merged on reads All nodes share the same role (i.e. no single point of failure) Trading ACID compliance for scalability is a fundamental design decision
3. How does this impact development? Substantially. For operations affecting the same data, that data will become consistent eventually as determined by the timestamps. But you can trade availability for consistency. (More on this later) You can store whatever you want. It's all just bytes. You need to think about how you will query the data before you write it.
4. Neat. So Now What? Like any database, you need a client! Python: Telephus: http://github.com/driftx/Telephus (Twisted) Pycassa: http://github.com/pycassa/pycassa Java: Hector: http://github.com/rantav/hector (Examples https://github.com/zznate/hector-examples ) Pelops: http://github.com/s7/scale7-pelops Kundera http://code.google.com/p/kundera/ Datanucleus JDO: http://github.com/tnine/Datanucleus-Cassandra-Plugin Grails: grails-cassandra: https://github.com/wolpert/grails-cassandra .NET: FluentCassandra : http://github.com/managedfusion/fluentcassandra Aquiles: http://aquiles.codeplex.com/ Ruby: Cassandra: http://github.com/fauna/cassandra PHP: phpcassa: http://github.com/thobbs/phpcassa SimpleCassie : http://code.google.com/p/simpletools-php/wiki/SimpleCassie
6. Thrift Fast, efficient serialization and network IO. Lots of clients available (you can probably use it in other places as well) Why you don't want to work with the Thrift API directly: SuperColumn ColumnOrSuperColumn ColumnParent.super_column ColumnPath.super_column Map<ByteBuffer,Map<String,List<Mutation>>> mutationMap
7. Higher Level Client Hector JMX Counters Add/remove hosts: automatically programatically via JMX Plugable load balancing Complete encapsulation of Thrift API Type-safe approach to dealing with Apache Cassandra Lightweight ORM (supports JPA 1.0 annotations) Mavenized! http://repo2.maven.org/maven2/me/prettyprint/
8. "CQL" Currently in Apache Cassandra trunk Experimental Lots of possibilities from test/system/test_cql.py: UPDATE StandardLong1 SET 1L="1", 2L="2", 3L="3", 4L="4" WHERE KEY="aa" SELECT "cd1", "col" FROM Standard1 WHERE KEY = "kd" DELETE "cd1", "col" FROM Standard1 WHERE KEY = "kd"
9. Avro?? Gone. Added too much complexity after Thrift caught up. "None of the libraries distinguished themselves as being a particularly crappy choice for serialization." (See CASSANDRA-1765 )
10. Thrift API Methods Retrieving Writing/Removing Meta Information Schema Manipulation
11. Thrift API Methods - Retrieving get: retrieve a single column for a key get_slice: retrieve a "slice" of columns for a key multiget_slice: retrieve a "slice" of columns for a list of keys get_count: counts the columns of key (you have to deserialize the row to do it) get_range_slices: retrieve a slice for a range of keys get_indexed_slices (FTW!)
12. Thrift API Methods - Writing/Removing insert batch_mutate (batch insertion AND deletion) remove truncate**
13. Thrift API Methods - Meta Information describe_cluster_name describe_version describe_keyspace describe_keyspaces
15. vs. RDBMS - Consistency Level Consistency is tunable per request! Cassandra provides consistency when R + W > N (read replica count + write replica count > replication factor). *** CONSITENCY LEVEL FAILURE IS NOT A ROLLBACK*** Idempotent: an operation can be applied multiple times without changing the result
16. vs. RDBMS - Append Only Proper data modelling will minimizes seeks (Go to Tyler's presentation for more!)
17. On to the Code... https://github.com/zznate/cassandra-tutorial Uses Maven. Really basic. Modify/abuse/alter as needed. Descriptions of what is going on and how to run each example are in the Javadoc comments. Sample data is based on North American Numbering Plan http://en.wikipedia.org/wiki/North_American_Numbering_Plan
18. Data Shape 512 202 30.27 097.74 W TX Austin 512 203 30.27 097.74 L TX Austin 512 204 30.32 097.73 W TX Austin 512 205 30.32 097.73 W TX Austin 512 206 30.32 097.73 L TX Austin
19. Get a Single Column for a Key GetCityForNpanxx.java Retrieve a single column with: Name Value Timestamp TTL
20. Get the Contents of a Row GetSliceForNpanxx.java Retrieves a list of columns (Hector wraps these in a ColumnSlice) "SlicePredicate" can either be explicit set of columns OR a range (more on ranges soon) Another messy either/or choice encapsulated by Hector
21. Get the (sorted!) Columns of a Row GetSliceForStateCity.java Shows why the choice of comparator is important (this is the order in which the columns hit the disk - take advantage of it) Can be easily modified to return results in reverse order (but this is slightly slower)
22. Get the Same Slice from Several Rows MultigetSliceForNpanxx.java Very similar to get_slice examples, except we provide a list of keys
23. Get Slices From a Range of Rows GetRangeSlicesForStateCity.java Like multiget_slice, except we can specify a KeyRange (encapsulated by RangeSlicesQuery#setKeys(start, end) The results of this query will be significantly more meaningful with OrderPreservingPartitioner (try this at home!)
24. Get Slices From a Range of Rows - 2 GetSliceForAreaCodeCity.java Compound column name for controlling ranges Comparator at work on text field
25. Get Slices from Indexed Columns GetIndexedSlicesForCityState.java You only need to index a single column to apply clauses on other columns (BUT- the indexed column must be present with an EQUALS clause!) (It's just another ColumnFamily maintained automatically)
26. Insert, Update and Delete ... are effectively the same operation. InsertRowsForColumnFamilies.java DeleteRowsForColumnFamily.java Run each in succession (in whichever combination you like) and verify your results on the CLI Hint: watch the timestamps bin/cassandra-cli --host localhost use Tutorial; list AreaCode; list Npanxx; list StateCity;
27. Stuff I Punted on for the Sake of Brevity meta_* methods CassandraClusterTest.java: L43-81 @hector system_* methods SchemaManipulation.java @ hector-examples CassandraClusterTest.java: L84-157 @hector ORM (it works and is in production) ORM Documentation multiple nodes failure scenarios Data modelling (go see Tyler's presentation)
28. Things to Remember deletes and timestamp granularity "range ghosts" using the wrong column comparator and InvalidRequestException deletions actually write data use column-level TTL to automate deletion "how do I iterate over all the rows in a column family"? get_range_slices, but don't do that a good sign your data model is wrong
29. Dealing with *Lots* of Data (Briefly) Two biggest headaches have been addressed: Compaction pollutes os page cache ( CASSANDRA-1470 ) Greater than 143mil keys on a single SSTable means more BF false positives ( CASSANDRA-1555 ) Hadoop integration: Yes. (Go see Jeremy's presentation) Bulk loading: Yes. CASSANDRA-1278 For more information: http://wiki.apache.org/cassandra/LargeDataSetConsiderations