Spark and Cassandra with the Datastax Spark Cassandra Connector
How it works and how to use it!
Missed Spark Summit but Still want to see some slides?
This slide deck is for you!
Escape From Hadoop: Spark One Liners for C* OpsRussell Spitzer
Apache Cassandra and Spark when combined can give powerful OLTP and OLAP functionality for your data. We’ll walk through the basics of both of these platforms before diving into applications combining the two. Usually joins, changing a partition key, or importing data can be difficult in Cassandra, but we’ll see how do these and other operations in a set of simple Spark Shell one-liners!
Maximum Overdrive: Tuning the Spark Cassandra ConnectorRussell Spitzer
This document summarizes a presentation about tuning the Spark Cassandra Connector for optimal performance. It discusses various write tuning techniques like batching writes by partition key, sorting data within partitions, and adjusting batch sizes and concurrency levels. It also covers read tuning, noting the relationship between Spark and Cassandra partitions and how to avoid out of memory errors by changing the number of partitions. Maximizing read speed requires tuning Cassandra's paging behavior. The presentation encourages contributions to the open source Spark Cassandra Connector project.
- Apache Cassandra is a linearly scalable and fault tolerant NoSQL database that increases throughput linearly with additional machines
- It is an AP system that is eventually consistent according to the CAP theorem, sacrificing consistency in favor of availability and partition tolerance
- Cassandra uses replication and consistency levels to control fault tolerance at the server and client levels respectively
- Its data model and use of SSTables allows for fast writes and queries along clustering columns
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Spark Summit
This document discusses how the Spark Cassandra Connector optimizes for data locality when performing analytics on Cassandra data using Spark. It does this by using the partition keys and token ranges to create Spark partitions that correspond to the data distribution across the Cassandra nodes, allowing work to be done locally to each data node without moving data across the network. This improves performance and avoids the costs of data shuffling.
Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer
The Spark Cassandra Connector allows integration between Spark and Cassandra for distributed analytics. Previously, integrating Hadoop and Cassandra required complex code and configuration. The connector maps Cassandra data distributed across nodes based on token ranges to Spark partitions, enabling analytics on large Cassandra datasets using Spark's APIs. This provides an easier method for tasks like generating reports, analytics, and ETL compared to previous options.
Cassandra and Spark: Optimizing for Data LocalityRussell Spitzer
This document discusses how the Spark Cassandra Connector optimizes for data locality when performing analytics on Cassandra data using Spark. It does this by using the partition keys and token ranges to create Spark partitions that correspond to the data distribution across the Cassandra nodes, allowing work to be done locally to each data node without moving data across the network. This improves performance and avoids the costs of data shuffling.
This document provides an agenda and overview of Big Data Analytics using Spark and Cassandra. It discusses Cassandra as a distributed database and Spark as a data processing framework. It covers connecting Spark and Cassandra, reading and writing Cassandra tables as Spark RDDs, and using Spark SQL, Spark Streaming, and Spark MLLib with Cassandra data. Key capabilities of each technology are highlighted such as Cassandra's tunable consistency and Spark's fault tolerance through RDD lineage. Examples demonstrate basic operations like filtering, aggregating, and joining Cassandra data with Spark.
Apache cassandra and spark. you got the the lighter, let's start the firePatrick McFadin
Introduction to analyzing Apache Cassandra data using Apache Spark. This includes data models, operations topics and the internal on how Spark interfaces with Cassandra.
This document provides an agenda for a presentation on Big Data Analytics with Cassandra, Spark, and MLLib. The presentation covers Spark basics, using Spark with Cassandra, Spark Streaming, Spark SQL, and Spark MLLib. It also includes examples of querying and analyzing Cassandra data with Spark and Spark SQL, and machine learning with Spark MLLib.
The document discusses how the Spark Cassandra Connector works. It explains that the connector uses information about how data is partitioned in Cassandra nodes to generate Spark partitions that correspond to the token ranges in Cassandra. This allows data to be read from Cassandra in parallel across the Spark partitions. The connector also supports automatically pushing down filter predicates to the Cassandra database to reduce the amount of data read.
Spark cassandra connector.API, Best Practices and Use-CasesDuyhai Doan
- The document discusses Spark/Cassandra connector API, best practices, and use cases.
- It describes the connector architecture including support for Spark Core, SQL, and Streaming APIs. Data is read from and written to Cassandra tables mapped as RDDs.
- Best practices around data locality, failure handling, and cross-region/cluster operations are covered. Locality is important for performance.
- Use cases include data cleaning, schema migration, and analytics like joins and aggregation. The connector allows processing and analytics on Cassandra data with Spark.
Owning time series with team apache Strata San Jose 2015Patrick McFadin
Break out your laptops for this hands-on tutorial is geared around understanding the basics of how Apache Cassandra stores and access time series data. We’ll start with an overview of how Cassandra works and how that can be a perfect fit for time series. Then we will add in Apache Spark as a perfect analytics companion. There will be coding as a part of the hands on tutorial. The goal will be to take a example application and code through the different aspects of working with this unique data pattern. The final section will cover the building of an end-to-end data pipeline to ingest, process and store high speed, time series data.
An Introduction to time series with Team ApachePatrick McFadin
We as an industry are collecting more data every year. IoT, web, and mobile applications send torrents of bits to our data centers that have to be processed and stored, even as users expect an always-on experience—leaving little room for error. Patrick McFadin explores how successful companies do this every day using the powerful Team Apache: Apache Kafka, Spark, and Cassandra.
Patrick walks you through organizing a stream of data into an efficient queue using Apache Kafka, processing the data in flight using Apache Spark Streaming, storing the data in a highly scaling and fault-tolerant database using Apache Cassandra, and transforming and finding insights in volumes of stored data using Apache Spark.
Topics include:
- Understanding the right use case
- Considerations when deploying Apache Kafka
- Processing streams with Apache Spark Streaming
- A deep dive into how Apache Cassandra stores data
- Integration between Cassandra and Spark
- Data models for time series
- Postprocessing without ETL using Apache Spark on Cassandra
Lightning fast analytics with Spark and Cassandranickmbailey
Spark is a fast and general engine for large-scale data processing. It provides APIs for Java, Scala, and Python that allow users to load data into a distributed cluster as resilient distributed datasets (RDDs) and then perform operations like map, filter, reduce, join and save. The Cassandra Spark driver allows accessing Cassandra tables as RDDs to perform analytics and run Spark SQL queries across Cassandra data. It provides server-side data selection and mapping of rows to Scala case classes or other objects.
Time series with Apache Cassandra - Long versionPatrick McFadin
Apache Cassandra has proven to be one of the best solutions for storing and retrieving time series data. This talk will give you an overview of the many ways you can be successful. We will discuss how the storage model of Cassandra is well suited for this pattern and go over examples of how best to build data models.
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...DataStax
Worried that you aren't taking full advantage of your Spark and Cassandra integration? Well worry no more! In this talk we'll take a deep dive into all of the available configuration options and see how they affect Cassandra and Spark performance. Concerned about throughput? Learn to adjust batching parameters and gain a boost in speed. Always running out of memory? We'll take a look at the various causes of OOM errors and how we can circumvent them. Want to take advantage of Cassandra's natural partitioning in Spark? Find out about the recent developments that let you perform shuffle-less joins on Cassandra-partitioned data! Come with your questions and problems and leave with answers and solutions!
About the Speaker
Russell Spitzer Software Engineer, DataStax
Russell Spitzer received a Ph.D in Bio-Informatics before finding his deep passion for distributed software. He found the perfect outlet for this passion at DataStax where he began on the Automation and Test Engineering team. He recently moved from finding bugs to making bugs as part of the Analytics team where he works on integration between Cassandra and Spark as well as other tools.
Spark streaming can be used for near-real-time data analysis of data streams. It processes data in micro-batches and provides windowing operations. Stateful operations like updateStateByKey allow tracking state across batches. Data can be obtained from sources like Kafka, Flume, HDFS and processed using transformations before being saved to destinations like Cassandra. Fault tolerance is provided by replicating batches, but some data may be lost depending on how receivers collect data.
Storing time series data with Apache CassandraPatrick McFadin
If you are looking to collect and store time series data, it's probably not going to be small. Don't get caught without a plan! Apache Cassandra has proven itself as a solid choice now you can learn how to do it. We'll look at possible data models and the the choices you have to be successful. Then, let's open the hood and learn about how data is stored in Apache Cassandra. You don't need to be an expert in distributed systems to make this work and I'll show you how. I'll give you real-world examples and work through the steps. Give me an hour and I will upgrade your time series game.
Nike Tech Talk: Double Down on Apache Cassandra and SparkPatrick McFadin
Apache Cassandra has proven to be one of the best solutions for storing and retrieving time series data at high velocity and high volume. This talk will give you an overview of the many ways you can be successful by introducing Apache Cassandra concepts. We will discuss how the storage model of Cassandra is well suited for this pattern and go over examples of how best to build data models. There will also be examples of how you can use Apache Spark along with Apache Cassandra to create a real time data analytics platform. It’s so easy, you will be shocked and ready to try it yourself.
You’ve heard all of the hype, but how can SMACK work for you? In this all-star lineup, you will learn how to create a reactive, scaling, resilient and performant data processing powerhouse. We will go through the basics of Akka, Kafka and Mesos and then deep dive into putting them together in an end2end (and back again) distrubuted transaction. Distributed transactions mean producers waiting for one or more of consumers to respond. On the backend, you will see how Apache Cassandra and Spark can be combined to add the incredibly scaling storage and data analysis needed for fast data pipelines. With these technologies as a foundation, you have the assurance that scale is never a problem and uptime is default.
Managing large volumes of data isn’t trivial and needs a plan. Fast Data is how we describe the nature of data in a heavily consumer-driven world. Fast in. Fast out. Is your data infrastructure ready? You will learn some important reference architectures for large-scale data problems. The three main areas are covered:
Organize - Manage the incoming data stream and ensure it is processed correctly and on time. No data left behind.
Process - Analyze volumes of data you receive in near real-time or in a batch. Be ready for fast serving in your application.
Store - Reliably store data in the data models to support your application. Never accept downtime or slow response times.
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...DataStax
We built an application based on the principles of CQRS and Event Sourcing using Cassandra and Spark. During the project we encountered a number of challenges and problems with Cassandra and the Spark Connector.
In this talk we want to outline a few of those problems and our actions to solve them. While some problems are specific to CQRS and Event Sourcing applications most of them are use case independent.
About the Speakers
Matthias Niehoff IT-Consultant, codecentric AG
works as an IT-Consultant at codecentric AG in Germany. His focus is on big data & streaming applications with Apache Cassandra & Apache Spark. Yet he does not lose track of other tools in the area of big data. Matthias shares his experiences on conferences, meetups and usergroups.
Stephan Kepser Senior IT Consultant and Data Architect, codecentric AG
Dr. Stephan Kepser is an expert on cloud computing and big data. He wrote a couple of journal articles and blog posts on subjects of both fields. His interests reach from legal questions to questions of architecture and design of cloud computing and big data systems to technical details of NoSQL databases.
At this meetup Patrick McFadin, Solutions Architect at DataStax, will be discussing the most recently added features in Apache Cassandra 2.0, including: Lightweight transactions, eager retries, improved compaction, triggers, and CQL cursors. He'll also be touching on time series data with Apache Cassandra.
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraPiotr Kolaczkowski
The document discusses using Apache Spark and Apache Cassandra together for fast data analysis as an alternative to Hadoop. It provides examples of basic Spark operations on Cassandra tables like counting rows, filtering, joining with external data sources, and importing/exporting data. The document argues that Spark on Cassandra provides a simpler distributed processing framework compared to Hadoop.
Cassandra and Spark, closing the gap between no sql and analytics codemotio...Duyhai Doan
This document discusses how Spark and Cassandra can be used together. It begins with an introduction to Spark and Cassandra individually, explaining their architectures and key features. It then details the Spark-Cassandra connector, describing how Cassandra tables can be exposed as Spark RDDs and DataFrames. Various use cases for Spark and Cassandra are presented, including data cleaning, schema migration, and analytics. The document emphasizes the importance of data locality when performing joins and writes between Spark and Cassandra. Code examples are provided for common tasks like data cleaning, migration, and analytics.
This document discusses a proposed solution to prevent theft on beaches. It presents statistics on theft and surveys on beach safety. It then outlines a proposed mobile app and Bluetooth system that would sound an alarm if a user's phone moves beyond a set distance. Finally, it evaluates the potential revenue of implementing this system on beaches in Crimea.
Apache cassandra and spark. you got the the lighter, let's start the firePatrick McFadin
Introduction to analyzing Apache Cassandra data using Apache Spark. This includes data models, operations topics and the internal on how Spark interfaces with Cassandra.
This document provides an agenda for a presentation on Big Data Analytics with Cassandra, Spark, and MLLib. The presentation covers Spark basics, using Spark with Cassandra, Spark Streaming, Spark SQL, and Spark MLLib. It also includes examples of querying and analyzing Cassandra data with Spark and Spark SQL, and machine learning with Spark MLLib.
The document discusses how the Spark Cassandra Connector works. It explains that the connector uses information about how data is partitioned in Cassandra nodes to generate Spark partitions that correspond to the token ranges in Cassandra. This allows data to be read from Cassandra in parallel across the Spark partitions. The connector also supports automatically pushing down filter predicates to the Cassandra database to reduce the amount of data read.
Spark cassandra connector.API, Best Practices and Use-CasesDuyhai Doan
- The document discusses Spark/Cassandra connector API, best practices, and use cases.
- It describes the connector architecture including support for Spark Core, SQL, and Streaming APIs. Data is read from and written to Cassandra tables mapped as RDDs.
- Best practices around data locality, failure handling, and cross-region/cluster operations are covered. Locality is important for performance.
- Use cases include data cleaning, schema migration, and analytics like joins and aggregation. The connector allows processing and analytics on Cassandra data with Spark.
Owning time series with team apache Strata San Jose 2015Patrick McFadin
Break out your laptops for this hands-on tutorial is geared around understanding the basics of how Apache Cassandra stores and access time series data. We’ll start with an overview of how Cassandra works and how that can be a perfect fit for time series. Then we will add in Apache Spark as a perfect analytics companion. There will be coding as a part of the hands on tutorial. The goal will be to take a example application and code through the different aspects of working with this unique data pattern. The final section will cover the building of an end-to-end data pipeline to ingest, process and store high speed, time series data.
An Introduction to time series with Team ApachePatrick McFadin
We as an industry are collecting more data every year. IoT, web, and mobile applications send torrents of bits to our data centers that have to be processed and stored, even as users expect an always-on experience—leaving little room for error. Patrick McFadin explores how successful companies do this every day using the powerful Team Apache: Apache Kafka, Spark, and Cassandra.
Patrick walks you through organizing a stream of data into an efficient queue using Apache Kafka, processing the data in flight using Apache Spark Streaming, storing the data in a highly scaling and fault-tolerant database using Apache Cassandra, and transforming and finding insights in volumes of stored data using Apache Spark.
Topics include:
- Understanding the right use case
- Considerations when deploying Apache Kafka
- Processing streams with Apache Spark Streaming
- A deep dive into how Apache Cassandra stores data
- Integration between Cassandra and Spark
- Data models for time series
- Postprocessing without ETL using Apache Spark on Cassandra
Lightning fast analytics with Spark and Cassandranickmbailey
Spark is a fast and general engine for large-scale data processing. It provides APIs for Java, Scala, and Python that allow users to load data into a distributed cluster as resilient distributed datasets (RDDs) and then perform operations like map, filter, reduce, join and save. The Cassandra Spark driver allows accessing Cassandra tables as RDDs to perform analytics and run Spark SQL queries across Cassandra data. It provides server-side data selection and mapping of rows to Scala case classes or other objects.
Time series with Apache Cassandra - Long versionPatrick McFadin
Apache Cassandra has proven to be one of the best solutions for storing and retrieving time series data. This talk will give you an overview of the many ways you can be successful. We will discuss how the storage model of Cassandra is well suited for this pattern and go over examples of how best to build data models.
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...DataStax
Worried that you aren't taking full advantage of your Spark and Cassandra integration? Well worry no more! In this talk we'll take a deep dive into all of the available configuration options and see how they affect Cassandra and Spark performance. Concerned about throughput? Learn to adjust batching parameters and gain a boost in speed. Always running out of memory? We'll take a look at the various causes of OOM errors and how we can circumvent them. Want to take advantage of Cassandra's natural partitioning in Spark? Find out about the recent developments that let you perform shuffle-less joins on Cassandra-partitioned data! Come with your questions and problems and leave with answers and solutions!
About the Speaker
Russell Spitzer Software Engineer, DataStax
Russell Spitzer received a Ph.D in Bio-Informatics before finding his deep passion for distributed software. He found the perfect outlet for this passion at DataStax where he began on the Automation and Test Engineering team. He recently moved from finding bugs to making bugs as part of the Analytics team where he works on integration between Cassandra and Spark as well as other tools.
Spark streaming can be used for near-real-time data analysis of data streams. It processes data in micro-batches and provides windowing operations. Stateful operations like updateStateByKey allow tracking state across batches. Data can be obtained from sources like Kafka, Flume, HDFS and processed using transformations before being saved to destinations like Cassandra. Fault tolerance is provided by replicating batches, but some data may be lost depending on how receivers collect data.
Storing time series data with Apache CassandraPatrick McFadin
If you are looking to collect and store time series data, it's probably not going to be small. Don't get caught without a plan! Apache Cassandra has proven itself as a solid choice now you can learn how to do it. We'll look at possible data models and the the choices you have to be successful. Then, let's open the hood and learn about how data is stored in Apache Cassandra. You don't need to be an expert in distributed systems to make this work and I'll show you how. I'll give you real-world examples and work through the steps. Give me an hour and I will upgrade your time series game.
Nike Tech Talk: Double Down on Apache Cassandra and SparkPatrick McFadin
Apache Cassandra has proven to be one of the best solutions for storing and retrieving time series data at high velocity and high volume. This talk will give you an overview of the many ways you can be successful by introducing Apache Cassandra concepts. We will discuss how the storage model of Cassandra is well suited for this pattern and go over examples of how best to build data models. There will also be examples of how you can use Apache Spark along with Apache Cassandra to create a real time data analytics platform. It’s so easy, you will be shocked and ready to try it yourself.
You’ve heard all of the hype, but how can SMACK work for you? In this all-star lineup, you will learn how to create a reactive, scaling, resilient and performant data processing powerhouse. We will go through the basics of Akka, Kafka and Mesos and then deep dive into putting them together in an end2end (and back again) distrubuted transaction. Distributed transactions mean producers waiting for one or more of consumers to respond. On the backend, you will see how Apache Cassandra and Spark can be combined to add the incredibly scaling storage and data analysis needed for fast data pipelines. With these technologies as a foundation, you have the assurance that scale is never a problem and uptime is default.
Managing large volumes of data isn’t trivial and needs a plan. Fast Data is how we describe the nature of data in a heavily consumer-driven world. Fast in. Fast out. Is your data infrastructure ready? You will learn some important reference architectures for large-scale data problems. The three main areas are covered:
Organize - Manage the incoming data stream and ensure it is processed correctly and on time. No data left behind.
Process - Analyze volumes of data you receive in near real-time or in a batch. Be ready for fast serving in your application.
Store - Reliably store data in the data models to support your application. Never accept downtime or slow response times.
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...DataStax
We built an application based on the principles of CQRS and Event Sourcing using Cassandra and Spark. During the project we encountered a number of challenges and problems with Cassandra and the Spark Connector.
In this talk we want to outline a few of those problems and our actions to solve them. While some problems are specific to CQRS and Event Sourcing applications most of them are use case independent.
About the Speakers
Matthias Niehoff IT-Consultant, codecentric AG
works as an IT-Consultant at codecentric AG in Germany. His focus is on big data & streaming applications with Apache Cassandra & Apache Spark. Yet he does not lose track of other tools in the area of big data. Matthias shares his experiences on conferences, meetups and usergroups.
Stephan Kepser Senior IT Consultant and Data Architect, codecentric AG
Dr. Stephan Kepser is an expert on cloud computing and big data. He wrote a couple of journal articles and blog posts on subjects of both fields. His interests reach from legal questions to questions of architecture and design of cloud computing and big data systems to technical details of NoSQL databases.
At this meetup Patrick McFadin, Solutions Architect at DataStax, will be discussing the most recently added features in Apache Cassandra 2.0, including: Lightweight transactions, eager retries, improved compaction, triggers, and CQL cursors. He'll also be touching on time series data with Apache Cassandra.
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraPiotr Kolaczkowski
The document discusses using Apache Spark and Apache Cassandra together for fast data analysis as an alternative to Hadoop. It provides examples of basic Spark operations on Cassandra tables like counting rows, filtering, joining with external data sources, and importing/exporting data. The document argues that Spark on Cassandra provides a simpler distributed processing framework compared to Hadoop.
Cassandra and Spark, closing the gap between no sql and analytics codemotio...Duyhai Doan
This document discusses how Spark and Cassandra can be used together. It begins with an introduction to Spark and Cassandra individually, explaining their architectures and key features. It then details the Spark-Cassandra connector, describing how Cassandra tables can be exposed as Spark RDDs and DataFrames. Various use cases for Spark and Cassandra are presented, including data cleaning, schema migration, and analytics. The document emphasizes the importance of data locality when performing joins and writes between Spark and Cassandra. Code examples are provided for common tasks like data cleaning, migration, and analytics.
This document discusses a proposed solution to prevent theft on beaches. It presents statistics on theft and surveys on beach safety. It then outlines a proposed mobile app and Bluetooth system that would sound an alarm if a user's phone moves beyond a set distance. Finally, it evaluates the potential revenue of implementing this system on beaches in Crimea.
Indian domestic sector and the need for promoting community based microgenera...eSAT Publishing House
This document discusses the need to promote community-based microgeneration over stand-alone systems for near-zero housing communities in India's domestic sector. It notes that urban growth has increased energy demand primarily in the form of electricity. With depleting resources, there is a greater need to promote on-site microgeneration systems at the community level. The document identifies three major concerns: 1) Understanding existing renewable energy strategies and promoting them with changing housing typologies. 2) Developing hybrid energy systems to balance demand during climate inconsistencies. 3) Integrating on-site microgeneration infrastructure without altering existing character and aesthetics.
Chanchalerm Watcharapasakhon has over 15 years of experience in instrumentation, control, and automation engineering. He has worked on projects in various industries including chemical, petrochemical, pharmaceutical, and offshore platforms. Some of his responsibilities have included engineering design, project management, commissioning, procurement, and ensuring safety and cost-effective designs. He has extensive experience working with international standards and codes.
A rundown of the key findings and overall themes of InvestmentNews' flagship adviser technology benchmarking study, addressing the driving factors behind how and why firms invest in technology, the factors they weigh the closest in making those key decisions, and the strategies they implement to better their firms.
InvestmentNews Research collected firm-level data from over 300 independent advisory firms in December 2014 - January 2014. Firms submitted truncated P&L data, including measures of operating expenses, profits, revenue, assets, client and staff. "Top Performers", or the most profitable and productive firms, and "Innovators", the firms who utilized the most technology at their firms, were identified as cohorts used to benchmark the industry.
This document provides an analysis of several music videos based on music video principles. It summarizes the music videos for "Stressed Out" by Twenty One Pilots, "Welcome to the Black Parade" by My Chemical Romance, and "Kiss with a Fist" by Florence + the Machine. For each video, it discusses the links between the music and visuals, demonstrations of genre characteristics, use of intertextual references, and whether the video has a narrative, performance, or conceptual style.
The document provides an overview of equity research through technical analysis. It discusses the internship project conducted by Deepesh Sharma at Bajaj Capital to analyze equity markets through technical analysis. The project involved analyzing stock prices and patterns over a one month period to make buy/sell recommendations. Technical analysis tools like charts, indicators, patterns and candlestick patterns were used to study past market trends and predict future price movements. The analysis was able to predict price movements accurately 69% of the time. The document concludes technical analysis is an effective method for short-term trading and predicting stock price behavior.
Khawaja Muhammad Safdar Medical College Sialkot Merit List 2014Rana Waqar
This document lists 70 candidates who were selected for open merit seats at Khawaja Muhammad Safdar Medical College in Sialkot for the 2014-2015 session. It provides information for each candidate including their roll number, name, father's name, domicile, total and obtained marks in matriculation, F.Sc marks, entry test marks, aggregate percentage, and date of birth. The candidates are ranked based on their aggregate percentages.
BA BCom BSc BBA BCA BTech BE
BA BCom BSc BBA BCA BTech BE
BA BCom BSc BBA BCA BTech BE
BA BCom BSc BBA BCA BTech BE
BA BCom BSc BBA BCA BTech BE
MA MCom MSc MBA MCA MTech ME
MA MCom MSc MBA MCA MTech ME
MA MCom MSc MBA MCA MTech ME
MA MCom MSc MBA MCA MTech ME
MA MCom MSc MBA MCA MTech ME
Diploma B.ed M.ed
Regards,
Harsh
7827808104
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
This resume summarizes the qualifications and experience of Oladipupo Olawale Osibajo. He has over 20 years of experience in civil/structural engineering, quality management, and construction project management. He has worked on numerous multi-million dollar projects in Nigeria, taking on roles like quality manager, project engineer, structural design engineer, and deputy project engineer. He has extensive expertise in areas like quality assurance/control, structural design/analysis, AutoCAD, project planning, and health and safety. He is a registered engineer and holds several professional certifications in engineering, quality management, and safety.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Java was developed by Sun Microsystems in 1991 to control home appliances from touchscreen devices. It became widely used for internet programming and general purpose object-oriented applications. Key features include being platform independent, secure, robust, and multithreaded. The Java compiler produces bytecode that runs on the Java Virtual Machine, allowing programs to execute on any device with a JVM. Classes form the basis of Java's object-oriented approach, allowing code reuse through inheritance and interfaces. Multithreading allows programs to simulate parallel processing. Applets enabled interactive programs to run in web browsers.
Manchester Hadoop Meetup: Cassandra Spark internalsChristopher Batey
This document summarizes how the Spark Cassandra Connector works to read and write data between Spark and Cassandra in a distributed manner. It discusses how the connector partitions Spark RDDs based on Cassandra token ranges and nodes, retrieves data from Cassandra in batches using CQL, and writes data back to Cassandra in batches grouped by partition key. Key classes and configuration parameters that control this distributed processing are also outlined.
This document discusses Apache Spark and Cassandra. It provides an overview of Cassandra as a shared-nothing, masterless, peer-to-peer database with great scaling. It then discusses how Spark can be used to analyze large amounts of data stored in Cassandra in parallel across a cluster. The Spark Cassandra connector allows Spark to create partitions that align with the token ranges in Cassandra, enabling efficient distributed queries across the cluster.
Analyzing Time Series Data with Apache Spark and CassandraPatrick McFadin
You have collected a lot of time series data so now what? It's not going to be useful unless you can analyze what you have. Apache Spark has become the heir apparent to Map Reduce but did you know you don't need Hadoop? Apache Cassandra is a great data source for Spark jobs! Let me show you how it works, how to get useful information and the best part, storing analyzed data back into Cassandra. That's right. Kiss your ETL jobs goodbye and let's get to analyzing. This is going to be an action packed hour of theory, code and examples so caffeine up and let's go.
- The Spark Cassandra Connector allows reading Cassandra data into Spark RDDs and writing Spark RDDs back to Cassandra tables.
- When reading, it partitions RDDs by Cassandra token ranges to co-locate partitions with node data. When writing, it batches writes by partition key to minimize requests.
- This allows efficient distributed processing of Cassandra data using Spark's parallelism while minimizing network usage through co-location of data and tasks.
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterprisePatrick McFadin
Wait! Back away from the Cassandra 2ndary index. It’s ok for some use cases, but it’s not an easy button. "But I need to search through a bunch of columns to look for the data and I want to do some regression analysis… and I can’t model that in C*, even after watching all of Patrick McFadins videos. What do I do?” The answer, dear developer, is in DSE Search and Analytics. With it’s easy Solr API and Spark integration so you can search and analyze data stored in your Cassandra database until your heart’s content. Take our hand. WE will show you how.
Apache cassandra & apache spark for time series dataPatrick McFadin
Apache Cassandra is a distributed database that stores time series data in a partitioned and ordered format. Apache Spark can efficiently query this Cassandra data using Resilient Distributed Datasets (RDDs) and perform analytics like aggregations. For example, weather station data stored sequentially in Cassandra by time can be aggregated into daily high and low temperatures with Spark and written back to a roll-up Cassandra table.
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016StampedeCon
Have you ever wanted to analyze sensor data that arrives every second from across the world? Or maybe your want to analyze intra-day trading prices of millions of financial instruments? Or take all the page views from Wikipedia and compare the hourly statistics? To do this or any other similar analysis, you will need to analyze large sequences of measurements over time. And what better way to do this then with Apache Spark? In this session we will dig into how to consume data, and analyze it with Spark, and then store the results in Apache Cassandra.
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...DataStax Academy
Wait! Back away from the Cassandra 2ndary index. It’s ok for some use cases, but it’s not an easy button. "But I need to search through a bunch of columns to look for the data and I want to do some regression analysis… and I can’t model that in C*, even after watching all of Patrick McFadins videos. What do I do?” The answer, dear developer, is in DSE Search and Analytics. With it’s easy Solr API and Spark integration so you can search and analyze data stored in your Cassandra database until your heart’s content. Take our hand. WE will show you how.
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...StampedeCon
Learn how to model beyond traditional direct access in Apache Cassandra. Utilizing the DataStax platform to harness the power of Spark and Solr to perform search, analytics, and complex operations in place on your Cassandra data!
A Tale of Two APIs: Using Spark Streaming In ProductionLightbend
Fast Data architectures are the answer to the increasing need for the enterprise to process and analyze continuous streams of data to accelerate decision making and become reactive to the particular characteristics of their market.
Apache Spark is a popular framework for data analytics. Its capabilities include SQL-based analytics, dataflow processing, graph analytics and a rich library of built-in machine learning algorithms. These libraries can be combined to address a wide range of requirements for large-scale data analytics.
To address Fast Data flows, Spark offers two API's: The mature Spark Streaming and its younger sibling, Structured Streaming. In this talk, we are going to introduce both APIs. Using practical examples, you will get a taste of each one and obtain guidance on how to choose the right one for your application.
Cassandra ne permet ni jointure, ni agrégats et limite drastiquement vos capacités à requêter vos données pour permettre une scalabilité linéaire dans une architecture masterless. L'outil de choix pour effectuer des traitements analytiques sur vos tables Cassandra est Spark mais ce dernier complexifie des opérations pourtant simples en SQL. SparkSQL permet de retrouver une syntaxe SQL dans Spark et nous allons voir comment l'utiliser en Scala, Java et en Python pour travailler sur des tables Cassandra, et retrouver jointures et agrégats (entre autres).
Introduction to Spark SQL and basic expression.
For demo file please go to https://github.com/bryanyang0528/SparkTutorial/tree/cdh5.5
Apache Cassandra and Spark when combined can give powerful OLTP and OLAP functionality for your data. We’ll walk through the basics of both of these platforms before diving into applications combining the two. Usually joins, changing a partition key, or importing data can be difficult in Cassandra, but we’ll see how do these and other operations in a set of simple Spark Shell one-liners!
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
Regardless of the meaning we are searching for over our vast amounts of data, whether we are in science, finance, technology, energy, health care…, we all share the same problems that must be solved: How do we achieve that? What technologies best support the requirements? This talk is about how to leverage fast access to historical data with real time streaming data for predictive modeling for lambda architecture with Spark Streaming, Kafka, Cassandra, Akka and Scala. Efficient Stream Computation, Composable Data Pipelines, Data Locality, Cassandra data model and low latency, Kafka producers and HTTP endpoints as akka actors...
Spark + Cassandra = Real Time Analytics on Operational DataVictor Coustenoble
This document discusses using Apache Spark and Cassandra together for real-time analytics on transactional data. It provides an overview of Cassandra and how it can be used for operational applications like recommendations, fraud detection, and messaging. It then discusses how the Spark Cassandra Connector allows reading and writing Cassandra data from Spark, enabling real-time analytics on streaming and batch data using Spark SQL, MLlib, and Spark Streaming. It also covers some DataStax Enterprise features for high availability and integration of Spark and Cassandra.
Cassandra Java APIs Old and New – A Comparisonshsedghi
The document compares old Java APIs for Cassandra like Thrift, Hector and JDBC to the new DataStix Java driver. It provides an overview of each API, including how they interact with Cassandra (e.g. using Thrift), examples of basic operations like reading rows, and references for more information. It also briefly introduces Cassandra's data model and the binary protocol which the new driver utilizes.
This document discusses using Apache Spark to perform analytics on Cassandra data. It provides an overview of Spark and how it can be used to query and aggregate Cassandra data through transformations and actions on resilient distributed datasets (RDDs). It also describes how to use the Spark Cassandra connector to load data from Cassandra into Spark and write data from Spark back to Cassandra.
Building a High-Performance Database with Scala, Akka, and SparkEvan Chan
Here is my talk at Scala by the Bay 2016, Building a High-Performance Database with Scala, Akka, and Spark. Covers integration of Akka and Spark, when to use actors and futures, back pressure, reactive monitoring with Kamon, and more.
5 Ways to Use Spark to Enrich your Cassandra EnvironmentJim Hatcher
Apache Cassandra is a powerful system for supporting large-scale, low-latency data systems, but it has some tradeoffs. Apache Spark can help fill those gaps, and this presentation will show you how.
⭕️➡️ FOR DOWNLOAD LINK : http://drfiles.net/ ⬅️⭕️
Maxon Cinema 4D 2025 is the latest version of the Maxon's 3D software, released in September 2024, and it builds upon previous versions with new tools for procedural modeling and animation, as well as enhancements to particle, Pyro, and rigid body simulations. CG Channel also mentions that Cinema 4D 2025.2, released in April 2025, focuses on spline tools and unified simulation enhancements.
Key improvements and features of Cinema 4D 2025 include:
Procedural Modeling: New tools and workflows for creating models procedurally, including fabric weave and constellation generators.
Procedural Animation: Field Driver tag for procedural animation.
Simulation Enhancements: Improved particle, Pyro, and rigid body simulations.
Spline Tools: Enhanced spline tools for motion graphics and animation, including spline modifiers from Rocket Lasso now included for all subscribers.
Unified Simulation & Particles: Refined physics-based effects and improved particle systems.
Boolean System: Modernized boolean system for precise 3D modeling.
Particle Node Modifier: New particle node modifier for creating particle scenes.
Learning Panel: Intuitive learning panel for new users.
Redshift Integration: Maxon now includes access to the full power of Redshift rendering for all new subscriptions.
In essence, Cinema 4D 2025 is a major update that provides artists with more powerful tools and workflows for creating 3D content, particularly in the fields of motion graphics, VFX, and visualization.
Explaining GitHub Actions Failures with Large Language Models Challenges, In...ssuserb14185
GitHub Actions (GA) has become the de facto tool that developers use to automate software workflows, seamlessly building, testing, and deploying code. Yet when GA fails, it disrupts development, causing delays and driving up costs. Diagnosing failures becomes especially challenging because error logs are often long, complex and unstructured. Given these difficulties, this study explores the potential of large language models (LLMs) to generate correct, clear, concise, and actionable contextual descriptions (or summaries) for GA failures, focusing on developers’ perceptions of their feasibility and usefulness. Our results show that over 80% of developers rated LLM explanations positively in terms of correctness for simpler/small logs. Overall, our findings suggest that LLMs can feasibly assist developers in understanding common GA errors, thus, potentially reducing manual analysis. However, we also found that improved reasoning abilities are needed to support more complex CI/CD scenarios. For instance, less experienced developers tend to be more positive on the described context, while seasoned developers prefer concise summaries. Overall, our work offers key insights for researchers enhancing LLM reasoning, particularly in adapting explanations to user expertise.
https://arxiv.org/abs/2501.16495
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdfTechSoup
In this webinar we will dive into the essentials of generative AI, address key AI concerns, and demonstrate how nonprofits can benefit from using Microsoft’s AI assistant, Copilot, to achieve their goals.
This event series to help nonprofits obtain Copilot skills is made possible by generous support from Microsoft.
What You’ll Learn in Part 2:
Explore real-world nonprofit use cases and success stories.
Participate in live demonstrations and a hands-on activity to see how you can use Microsoft 365 Copilot in your own work!
Join Ajay Sarpal and Miray Vu to learn about key Marketo Engage enhancements. Discover improved in-app Salesforce CRM connector statistics for easy monitoring of sync health and throughput. Explore new Salesforce CRM Synch Dashboards providing up-to-date insights into weekly activity usage, thresholds, and limits with drill-down capabilities. Learn about proactive notifications for both Salesforce CRM sync and product usage overages. Get an update on improved Salesforce CRM synch scale and reliability coming in Q2 2025.
Key Takeaways:
Improved Salesforce CRM User Experience: Learn how self-service visibility enhances satisfaction.
Utilize Salesforce CRM Synch Dashboards: Explore real-time weekly activity data.
Monitor Performance Against Limits: See threshold limits for each product level.
Get Usage Over-Limit Alerts: Receive notifications for exceeding thresholds.
Learn About Improved Salesforce CRM Scale: Understand upcoming cloud-based incremental sync.
Download YouTube By Click 2025 Free Full Activatedsaniamalik72555
Copy & Past Link 👉👉
https://dr-up-community.info/
"YouTube by Click" likely refers to the ByClick Downloader software, a video downloading and conversion tool, specifically designed to download content from YouTube and other video platforms. It allows users to download YouTube videos for offline viewing and to convert them to different formats.
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentShubham Joshi
A secure test infrastructure ensures that the testing process doesn’t become a gateway for vulnerabilities. By protecting test environments, data, and access points, organizations can confidently develop and deploy software without compromising user privacy or system integrity.
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...Eric D. Schabell
It's time you stopped letting your telemetry data pressure your budgets and get in the way of solving issues with agility! No more I say! Take back control of your telemetry data as we guide you through the open source project Fluent Bit. Learn how to manage your telemetry data from source to destination using the pipeline phases covering collection, parsing, aggregation, transformation, and forwarding from any source to any destination. Buckle up for a fun ride as you learn by exploring how telemetry pipelines work, how to set up your first pipeline, and exploring several common use cases that Fluent Bit helps solve. All this backed by a self-paced, hands-on workshop that attendees can pursue at home after this session (https://o11y-workshops.gitlab.io/workshop-fluentbit).
🌍📱👉COPY LINK & PASTE ON GOOGLE http://drfiles.net/ 👈🌍
Adobe Illustrator is a powerful, professional-grade vector graphics software used for creating a wide range of designs, including logos, icons, illustrations, and more. Unlike raster graphics (like photos), which are made of pixels, vector graphics in Illustrator are defined by mathematical equations, allowing them to be scaled up or down infinitely without losing quality.
Here's a more detailed explanation:
Key Features and Capabilities:
Vector-Based Design:
Illustrator's foundation is its use of vector graphics, meaning designs are created using paths, lines, shapes, and curves defined mathematically.
Scalability:
This vector-based approach allows for designs to be resized without any loss of resolution or quality, making it suitable for various print and digital applications.
Design Creation:
Illustrator is used for a wide variety of design purposes, including:
Logos and Brand Identity: Creating logos, icons, and other brand assets.
Illustrations: Designing detailed illustrations for books, magazines, web pages, and more.
Marketing Materials: Creating posters, flyers, banners, and other marketing visuals.
Web Design: Designing web graphics, including icons, buttons, and layouts.
Text Handling:
Illustrator offers sophisticated typography tools for manipulating and designing text within your graphics.
Brushes and Effects:
It provides a range of brushes and effects for adding artistic touches and visual styles to your designs.
Integration with Other Adobe Software:
Illustrator integrates seamlessly with other Adobe Creative Cloud apps like Photoshop, InDesign, and Dreamweaver, facilitating a smooth workflow.
Why Use Illustrator?
Professional-Grade Features:
Illustrator offers a comprehensive set of tools and features for professional design work.
Versatility:
It can be used for a wide range of design tasks and applications, making it a versatile tool for designers.
Industry Standard:
Illustrator is a widely used and recognized software in the graphic design industry.
Creative Freedom:
It empowers designers to create detailed, high-quality graphics with a high degree of control and precision.
How can one start with crypto wallet development.pptxlaravinson24
This presentation is a beginner-friendly guide to developing a crypto wallet from scratch. It covers essential concepts such as wallet types, blockchain integration, key management, and security best practices. Ideal for developers and tech enthusiasts looking to enter the world of Web3 and decentralized finance.
Avast Premium Security Crack FREE Latest Version 2025mu394968
🌍📱👉COPY LINK & PASTE ON GOOGLE https://dr-kain-geera.info/👈🌍
Avast Premium Security is a paid subscription service that provides comprehensive online security and privacy protection for multiple devices. It includes features like antivirus, firewall, ransomware protection, and website scanning, all designed to safeguard against a wide range of online threats, according to Avast.
Key features of Avast Premium Security:
Antivirus: Protects against viruses, malware, and other malicious software, according to Avast.
Firewall: Controls network traffic and blocks unauthorized access to your devices, as noted by All About Cookies.
Ransomware protection: Helps prevent ransomware attacks, which can encrypt your files and hold them hostage.
Website scanning: Checks websites for malicious content before you visit them, according to Avast.
Email Guardian: Scans your emails for suspicious attachments and phishing attempts.
Multi-device protection: Covers up to 10 devices, including Windows, Mac, Android, and iOS, as stated by 2GO Software.
Privacy features: Helps protect your personal data and online privacy.
In essence, Avast Premium Security provides a robust suite of tools to keep your devices and online activity safe and secure, according to Avast.
Designing AI-Powered APIs on Azure: Best Practices& ConsiderationsDinusha Kumarasiri
AI is transforming APIs, enabling smarter automation, enhanced decision-making, and seamless integrations. This presentation explores key design principles for AI-infused APIs on Azure, covering performance optimization, security best practices, scalability strategies, and responsible AI governance. Learn how to leverage Azure API Management, machine learning models, and cloud-native architectures to build robust, efficient, and intelligent API solutions
Exploring Wayland: A Modern Display Server for the FutureICS
Wayland is revolutionizing the way we interact with graphical interfaces, offering a modern alternative to the X Window System. In this webinar, we’ll delve into the architecture and benefits of Wayland, including its streamlined design, enhanced performance, and improved security features.
WinRAR Crack for Windows (100% Working 2025)sh607827
copy and past on google ➤ ➤➤ https://hdlicense.org/ddl/
WinRAR Crack Free Download is a powerful archive manager that provides full support for RAR and ZIP archives and decompresses CAB, ARJ, LZH, TAR, GZ, ACE, UUE, .
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AIdanshalev
If we were building a GenAI stack today, we'd start with one question: Can your retrieval system handle multi-hop logic?
Trick question, b/c most can’t. They treat retrieval as nearest-neighbor search.
Today, we discussed scaling #GraphRAG at AWS DevOps Day, and the takeaway is clear: VectorRAG is naive, lacks domain awareness, and can’t handle full dataset retrieval.
GraphRAG builds a knowledge graph from source documents, allowing for a deeper understanding of the data + higher accuracy.
2. Russell, ostensibly a software engineer
• Did a Ph.D in bioinformatics at some
point
• Written a great deal of automation and
testing framework code
• Now develops for Datastax on the
Analytics Team
• Focuses a lot on the
Datastax OSS Spark Cassandra Connector
3. Datastax Spark Cassandra Connector
Let Spark Interact with your Cassandra Data!
https://github.com/datastax/spark-cassandra-connector
http://spark-packages.org/package/datastax/spark-cassandra-connector
http://spark-packages.org/package/TargetHolding/pyspark-cassandra
Compatible with Spark 1.6 + Cassandra 3.0
• Bulk writing to Cassandra
• Distributed full table scans
• Optimized direct joins with Cassandra
• Secondary index pushdown
• Connection and prepared statement pools
4. Cassandra is essentially a hybrid between a key-value and a column-oriented
(or tabular) database management system. Its data model is a partitioned row
store with tunable consistency*
*https://en.wikipedia.org/wiki/Apache_Cassandra
Let's break that down
1.What is a C* Partition and Row
2.How does C* Place Partitions
5. CQL looks a lot like SQL
CREATE TABLE tracker (
vehicle_id uuid,
ts timestamp,
x double,
y double,
PRIMARY KEY (vehicle_id, ts))
6. INSERTS look almost Identical
CREATE TABLE tracker (
vehicle_id uuid,
ts timestamp,
x double,
y double,
PRIMARY KEY (vehicle_id, ts))
INSERT INTO tracker (vehicle_id, ts, x, y) Values ( 1, 0, 0, 1)
7. Cassandra Data is stored in Partitions
CREATE TABLE tracker (
vehicle_id uuid,
ts timestamp,
x double,
y double,
PRIMARY KEY (vehicle_id, ts))
vehicle_id
1
ts: 0
x: 0 y: 1
INSERT INTO tracker (vehicle_id, ts, x, y) Values ( 1, 0, 0, 1)
11. Within a partition there is ordering based on the
Clustering Keys
CREATE TABLE tracker (
vehicle_id uuid,
ts timestamp,
x double,
y double,
PRIMARY KEY (vehicle_id, ts))
vehicle_id
1
ts: 0
x: 0 y: 1
ts: 1
x: -1 y:0
ts: 2
x: 0 y: -1
ts: 3
x: 1 y: 0
Ordered by Clustering Key
12. Slices within a Partition are Very Easy
CREATE TABLE tracker (
vehicle_id uuid,
ts timestamp,
x double,
y double,
PRIMARY KEY (vehicle_id, ts))
vehicle_id
1
ts: 0
x: 0 y: 1
ts: 1
x: -1 y:0
ts: 2
x: 0 y: -1
ts: 3
x: 1 y: 0
Ordered by Clustering Key
13. Cassandra is a Distributed Fault-Tolerant Database
San Jose
Oakland
San Francisco
14. Data is located on a Token Range
San Jose
Oakland
San Francisco
15. Data is located on a Token Range
0
San Jose
Oakland
San Francisco
1200
600
16. The Partition Key of a Row is Hashed to Determine
it's Token
0
San Jose
Oakland
San Francisco
1200
600
17. The Partition Key of a Row is Hashed to Determine
it's Token
01200
600
San Jose
Oakland
San Francisco
vehicle_id
1
INSERT INTO tracker (vehicle_id, ts, x, y) Values ( 1, 0, 0, 1)
18. The Partition Key of a Row is Hashed to Determine
it's Token
01200
600
San Jose
Oakland
San Francisco
vehicle_id
1
INSERT INTO tracker (vehicle_id, ts, x, y) Values ( 1, 0, 0, 1)
Hash(1) = 1000
20. How The Spark Cassandra Connector Reads
0
San Jose Oakland San Francisco
500
21. Spark Partitions are Built From the Token Range
0
San Jose Oakland San Francisco
500
500
-
600
600
-
700
500 1200
22. Each Partition Can be Drawn Locally from at
Least One Node
0
San Jose Oakland San Francisco
500
500
-
600
600
-
700
500 1200
23. Each Partition Can be Drawn Locally from at
Least One Node
0
San Jose Oakland San Francisco
500
500
-
600
600
-
700
500 1200
Spark Executor Spark Executor Spark Executor
25. Data is Retrieved using the Datastax Java Driver
0
Spark Executor Cassandra
26. A Connection is Established
0
Spark Executor Cassandra
500
-
600
27. A Query is Prepared with Token Bounds
0
Spark Executor Cassandra
SELECT * FROM TABLE WHERE
Token(PK) > StartToken AND
Token(PK) < EndToken500
-
600
28. The Spark Partitions Bounds are Placed in the Query
0
Spark Executor Cassandra
SELECT * FROM TABLE WHERE
Token(PK) > 500 AND
Token(PK) < 600500
-
600
29. Paged a number of rows at a Time
0
Spark Executor Cassandra
500
-
600
SELECT * FROM TABLE WHERE
Token(PK) > 500 AND
Token(PK) < 600
30. Data is Paged
0
Spark Executor Cassandra
500
-
600
SELECT * FROM TABLE WHERE
Token(PK) > 500 AND
Token(PK) < 600
31. Data is Paged
0
Spark Executor Cassandra
500
-
600
SELECT * FROM TABLE WHERE
Token(PK) > 500 AND
Token(PK) < 600
32. Data is Paged
0
Spark Executor Cassandra
500
-
600
SELECT * FROM TABLE WHERE
Token(PK) > 500 AND
Token(PK) < 600
33. Data is Paged
0
Spark Executor Cassandra
500
-
600
SELECT * FROM TABLE WHERE
Token(PK) > 500 AND
Token(PK) < 600
34. How we write to Cassandra
• Data is written via the Datastax Java Driver
• Data is grouped based on Partition Key
(configurable)
• Batches are written
35. Data is Written using the Datastax Java Driver
0
Spark Executor Cassandra
39. Once a Batch is Full or There are Too Many Batches
The Largest Batch is Executed
0
Spark Executor Cassandra
40. Once a Batch is Full or There are Too Many Batches
The Largest Batch is Executed
0
Spark Executor Cassandra
41. Once a Batch is Full or There are Too Many Batches
The Largest Batch is Executed
0
Spark Executor Cassandra
42. Once a Batch is Full or There are Too Many Batches
The Largest Batch is Executed
0
Spark Executor Cassandra
43. Most Common Features
• RDD APIs
• cassandraTable
• saveToCassandra
• repartitionByCassandraTable
• joinWithCassandraTable
• DF API
• Datasource
44. Full Table Scans, Making an RDD out of a Table
import com.datastax.spark.connector._
sc.cassandraTable(KeyspaceName, TableName)
import com.datastax.spark.connector._
sc.cassandraTable[MyClass](KeyspaceName, TableName)
45. Pushing Down CQL to Cassandra
import com.datastax.spark.connector._
sc.cassandraTable[MyClass](KeyspaceName,
TableName).select("vehicle_id").where("ts > 10")
SELECT "vehicle_id" FROM TABLE
WHERE
Token(PK) > 500 AND
Token(PK) < 600 AND
ts > 10
47. But our Data isn't Colocated
import com.datastax.spark.connector._
rdd.joinWithCassandraTable("keyspace", "table")
San Jose Oakland San Francisco
Spark Executor Spark Executor Spark Executor
RDD
48. RBCR Moves bulk reshuffles our data so Data Will
Be Local
rdd
.repartitionByCassandraReplica("keyspace","table")
.joinWithCassandraTable("keyspace", "table")
San Jose Oakland San Francisco
Spark Executor Spark Executor Spark Executor
RDD
49. The Connector Provides a Distributed Pool for
Prepared Statements and Sessions
CassandraConnector(sc.getConf)
rdd.mapPartitions{ it => {
val ps = CassandraConnector(sc.getConf)
.withSessionDo( s => s.prepare)
it.map{ ps.bind(_).executeAsync()}
}
Your Pool Ready to Be Deployed
50. The Connector Provides a Distributed Pool for
Prepared Statements and Sessions
CassandraConnector(sc.getConf)
rdd.mapPartitions{ it => {
val ps = CassandraConnector(sc.getConf)
.withSessionDo( s => s.prepare)
it.map{ ps.bind(_).executeAsync()}
}
Your Pool Ready to Be Deployed
Prepared
Statement Cache
Session Cache
51. The Connector Supports the DataSources Api
sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map("keyspace" -> "read_test", "table" -> "simple_kv"))
.load
import org.apache.spark.sql.cassandra._
sqlContext
.read
.cassandraFormat("read_test","table")
.load
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md
52. The Connector Works with Catalyst to
Pushdown Predicates to Cassandra
import com.datastax.spark.connector._
df.select("vehicle_id").filter("ts > 10")
SELECT "vehicle_id" FROM TABLE
WHERE
Token(PK) > 500 AND
Token(PK) < 600 AND
ts > 10
53. The Connector Works with Catalyst to
Pushdown Predicates to Cassandra
import com.datastax.spark.connector._
df.select("vehicle_id").filter("ts > 10")
QueryPlan Catalyst PrunedFilteredScan
Only Prunes (projections) and
Filters (predicates) are able to
be pushed down.
54. Recent Features
• C* 3.0 Support
• Materialized Views
• SASL Indexes (for pushdown)
• Advanced Spark Partitioner Support
• Increased Performance on JWCT
56. Use C* Partitioning in Spark
• C* Data is Partitioned
• Spark has Partitions and partitioners
• Spark can use a known partitioner to speed up
Cogroups (joins)
• How Can we Leverage this?
57. Use C* Partitioning in Spark
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/16_partitioning.md
Now if keyBy is used on a CassandraTableScanRDD and the PartitionKey
is included in the key. The RDD will be given a C* Partitioner
val ks = "doc_example"
val rdd = { sc.cassandraTable[(String, Int)](ks, "users")
.select("name" as "_1", "zipcode" as "_2", "userid")
.keyBy[Tuple1[Int]]("userid")
}
rdd.partitioner
//res3: Option[org.apache.spark.Partitioner] =
Some(com.datastax.spark.connector.rdd.partitioner.CassandraPartitioner@94515d3e)
58. Use C* Partitioning in Spark
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/16_partitioning.md
Share partitioners between Tables for joins on Partition Key
val ks = "doc_example"
val rdd1 = { sc.cassandraTable[(Int, Int, String)](ks, "purchases")
.select("purchaseid" as "_1", "amount" as "_2", "objectid" as "_3", "userid")
.keyBy[Tuple1[Int]]("userid")
}
val rdd2 = { sc.cassandraTable[(String, Int)](ks, "users")
.select("name" as "_1", "zipcode" as "_2", "userid")
.keyBy[Tuple1[Int]]("userid")
}.applyPartitionerFrom(rdd1) // Assigns the partitioner from the first rdd to this one
val joinRDD = rdd1.join(rdd2)
59. Use C* Partitioning in Spark
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/16_partitioning.md
Many other uses for this, try it yourself!
• All self joins using the Partition key
• Groups within C* Partitions
• Anything formerly using SpanBy
• Joins with other RDDs
And much more!
60. Enhanced Parallelism with JWCT
• joinWithCassandraTable now has increased
concurrency and parallelism!
• 5X Speed increases in some cases
• https://datastax-oss.atlassian.net/browse/
SPARKC-233
• Thanks Jaroslaw!
61. The Connector wants you!
• OSS Project that loves community involvement
• Bug Reports
• Feature Requests
• Doc Improvements
• Come join us!
Vin Diesel may or may not be a contributor
62. Tons of Videos at Datastax Academy
https://academy.datastax.com/
https://academy.datastax.com/courses/getting-started-apache-spark
63. Tons of Videos at Datastax Academy
https://academy.datastax.com/
https://academy.datastax.com/courses/getting-started-apache-spark
64. See you at Cassandra Summit!
Join me and 3500 of your database peers, and take a deep dive
into Apache Cassandra™, the massively scalable NoSQL
database that powers global businesses like Apple, Spotify,
Netflix and Sony.
San Jose Convention Center
September 7-9, 2016
https://cassandrasummit.org/
Build Something Disruptive