The document discusses how the Spark Cassandra Connector works. It explains that the connector uses information about how data is partitioned in Cassandra nodes to generate Spark partitions that correspond to the token ranges in Cassandra. This allows data to be read from Cassandra in parallel across the Spark partitions. The connector also supports automatically pushing down filter predicates to the Cassandra database to reduce the amount of data read.
This document provides an agenda and overview of Big Data Analytics using Spark and Cassandra. It discusses Cassandra as a distributed database and Spark as a data processing framework. It covers connecting Spark and Cassandra, reading and writing Cassandra tables as Spark RDDs, and using Spark SQL, Spark Streaming, and Spark MLLib with Cassandra data. Key capabilities of each technology are highlighted such as Cassandra's tunable consistency and Spark's fault tolerance through RDD lineage. Examples demonstrate basic operations like filtering, aggregating, and joining Cassandra data with Spark.
Spark cassandra connector.API, Best Practices and Use-CasesDuyhai Doan
- The document discusses Spark/Cassandra connector API, best practices, and use cases.
- It describes the connector architecture including support for Spark Core, SQL, and Streaming APIs. Data is read from and written to Cassandra tables mapped as RDDs.
- Best practices around data locality, failure handling, and cross-region/cluster operations are covered. Locality is important for performance.
- Use cases include data cleaning, schema migration, and analytics like joins and aggregation. The connector allows processing and analytics on Cassandra data with Spark.
This document provides an agenda for a presentation on Big Data Analytics with Cassandra, Spark, and MLLib. The presentation covers Spark basics, using Spark with Cassandra, Spark Streaming, Spark SQL, and Spark MLLib. It also includes examples of querying and analyzing Cassandra data with Spark and Spark SQL, and machine learning with Spark MLLib.
Lightning fast analytics with Spark and Cassandranickmbailey
Spark is a fast and general engine for large-scale data processing. It provides APIs for Java, Scala, and Python that allow users to load data into a distributed cluster as resilient distributed datasets (RDDs) and then perform operations like map, filter, reduce, join and save. The Cassandra Spark driver allows accessing Cassandra tables as RDDs to perform analytics and run Spark SQL queries across Cassandra data. It provides server-side data selection and mapping of rows to Scala case classes or other objects.
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisDuyhai Doan
This document provides an overview of Spark and its integration with Cassandra for real-time data processing. It begins with introductions of the speaker and Datastax. It then discusses what Spark and Cassandra are, including their architectures and key characteristics like Spark being fast, easy to use, and supporting multiple languages. The document demonstrates basic Spark code and how RDDs work. It covers the Spark and Cassandra connectors and how they provide locality-aware joins. It also discusses use cases and deployment options. Finally, it considers future improvements like leveraging Solr for local filtering to improve data locality during joins.
Spark + Cassandra = Real Time Analytics on Operational DataVictor Coustenoble
This document discusses using Apache Spark and Cassandra together for real-time analytics on transactional data. It provides an overview of Cassandra and how it can be used for operational applications like recommendations, fraud detection, and messaging. It then discusses how the Spark Cassandra Connector allows reading and writing Cassandra data from Spark, enabling real-time analytics on streaming and batch data using Spark SQL, MLlib, and Spark Streaming. It also covers some DataStax Enterprise features for high availability and integration of Spark and Cassandra.
Cassandra and Spark: Optimizing for Data LocalityRussell Spitzer
This document discusses how the Spark Cassandra Connector optimizes for data locality when performing analytics on Cassandra data using Spark. It does this by using the partition keys and token ranges to create Spark partitions that correspond to the data distribution across the Cassandra nodes, allowing work to be done locally to each data node without moving data across the network. This improves performance and avoids the costs of data shuffling.
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Spark Summit
This document discusses how the Spark Cassandra Connector optimizes for data locality when performing analytics on Cassandra data using Spark. It does this by using the partition keys and token ranges to create Spark partitions that correspond to the data distribution across the Cassandra nodes, allowing work to be done locally to each data node without moving data across the network. This improves performance and avoids the costs of data shuffling.
Real time data pipeline with spark streaming and cassandra with mesosRahul Kumar
This document discusses building real-time data pipelines with Apache Spark Streaming and Cassandra using Mesos. It provides an overview of data management challenges, introduces Cassandra and Spark concepts. It then describes how to use the Spark Cassandra Connector to expose Cassandra tables as Spark RDDs and write back to Cassandra. It recommends designing scalable pipelines by identifying bottlenecks, using efficient data parsing, proper data modeling, and compression.
The document discusses the Datastax Spark Cassandra Connector. It provides an overview of how the connector allows Spark to interact with Cassandra data, including performing full table scans, pushing down filters and projections to Cassandra, distributed joins using Cassandra's partitioning, and writing data back to Cassandra in a distributed way. It also highlights some recent features of the connector like support for Cassandra 3.0, materialized views, and performance improvements from the Java Wildcard Cassandra Tester project.
Spark and Cassandra with the Datastax Spark Cassandra Connector
How it works and how to use it!
Missed Spark Summit but Still want to see some slides?
This slide deck is for you!
Spark streaming can be used for near-real-time data analysis of data streams. It processes data in micro-batches and provides windowing operations. Stateful operations like updateStateByKey allow tracking state across batches. Data can be obtained from sources like Kafka, Flume, HDFS and processed using transformations before being saved to destinations like Cassandra. Fault tolerance is provided by replicating batches, but some data may be lost depending on how receivers collect data.
Spark SQL for Java/Scala Developers. Workshop by Aaron Merlob, Galvanize. To hear about future conferences go to http://dataengconf.com
How do you rapidly derive complex insights on top of really big data sets in Cassandra? This session draws upon Evan's experience building a distributed, interactive, columnar query engine on top of Cassandra and Spark. We will start by surveying the existing query landscape of Cassandra and discuss ways to integrate Cassandra and Spark. We will dive into the design and architecture of a fast, column-oriented query architecture for Spark, and why columnar stores are so advantageous for OLAP workloads. I will present a schema for Parquet-like storage of analytical datasets onCassandra. Find out why Cassandra and Spark are the perfect match for enabling fast, scalable, complex querying and storage of big analytical data.
The document discusses Resilient Distributed Datasets (RDDs) in Spark. It explains that RDDs hold references to partition objects containing subsets of data across a cluster. When a transformation like map is applied to an RDD, a new RDD is created to store the operation and maintain a dependency on the original RDD. This allows chained transformations to be lazily executed together in jobs scheduled by Spark.
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...DataStax
Worried that you aren't taking full advantage of your Spark and Cassandra integration? Well worry no more! In this talk we'll take a deep dive into all of the available configuration options and see how they affect Cassandra and Spark performance. Concerned about throughput? Learn to adjust batching parameters and gain a boost in speed. Always running out of memory? We'll take a look at the various causes of OOM errors and how we can circumvent them. Want to take advantage of Cassandra's natural partitioning in Spark? Find out about the recent developments that let you perform shuffle-less joins on Cassandra-partitioned data! Come with your questions and problems and leave with answers and solutions!
About the Speaker
Russell Spitzer Software Engineer, DataStax
Russell Spitzer received a Ph.D in Bio-Informatics before finding his deep passion for distributed software. He found the perfect outlet for this passion at DataStax where he began on the Automation and Test Engineering team. He recently moved from finding bugs to making bugs as part of the Analytics team where he works on integration between Cassandra and Spark as well as other tools.
Spark is a general engine for large-scale data processing. It introduces Resilient Distributed Datasets (RDDs) which allow in-memory caching for fault tolerance and act like familiar Scala collections for distributed computation across clusters. RDDs provide a programming model with transformations like map and reduce and actions to compute results. Spark also supports streaming, SQL, machine learning, and graph processing workloads.
Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer
The Spark Cassandra Connector allows integration between Spark and Cassandra for distributed analytics. Previously, integrating Hadoop and Cassandra required complex code and configuration. The connector maps Cassandra data distributed across nodes based on token ranges to Spark partitions, enabling analytics on large Cassandra datasets using Spark's APIs. This provides an easier method for tasks like generating reports, analytics, and ETL compared to previous options.
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
Watch video at: http://youtu.be/Wg2boMqLjCg
Want to learn how to write faster and more efficient programs for Apache Spark? Two Spark experts from Databricks, Vida Ha and Holden Karau, provide some performance tuning and testing tips for your Spark applications
Using Spark to Load Oracle Data into CassandraJim Hatcher
The document discusses lessons learned from using Spark to load data from Oracle into Cassandra. It describes problems encountered with Spark SQL handling Oracle NUMBER and timeuuid fields incorrectly. It also discusses issues generating IDs across RDDs and limitations on returning RDDs of tuples over 22 items. The resources section provides references for learning more about Spark, Scala, and using Spark with Cassandra.
Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming.
Presented at the Desert Code Camp:
http://oct2016.desertcodecamp.com/sessions/all
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryIlya Ganelin
In this talk I talk about my recent experience working with Spark Data Frames and the Spark TimeSeries library. For data frames, the focus will be on usability. Specifically, a lot of the documentation does not cover common use cases like intricacies of creating data frames, adding or manipulating individual columns, and doing quick and dirty analytics. For the time series library, I dive into the kind of use cases it supports and why it’s actually super useful.
Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Time series with Apache Cassandra - Long versionPatrick McFadin
Apache Cassandra has proven to be one of the best solutions for storing and retrieving time series data. This talk will give you an overview of the many ways you can be successful. We will discuss how the storage model of Cassandra is well suited for this pattern and go over examples of how best to build data models.
November 3 tactics to ensure next sale is around the cornerCole Information
Stretch your marketing dollars further by going hyperlocal. New research suggests the majority of consumers prefer spending their hard earned dollars close to home. Learn tactics to entice your closest neighbors to come through your small business’ front door.
In this Small Business Marketing 101 Webcast, we'll explore three ways you effectively reach out to local consumers and ensure your next sale is around the corner. You'll come away with knowledge on:
• Effective touch points for these next door consumers
• How to stay top of mind in your community
• How to extend your reach beyond a 5 mile radius
Die Zeitschrift über die Salzburger Festspiele 2014 mit Fahrplan und Preisangaben für Tickets. Festspiel Magazin: ein Journal vom Kartenbüro POLZER über Oper, Schauspiel und Konzerte der Festspiele in Salzburg. Erfahren Sie alles über das Programm und die Inhalte, sowie auch über Preise der Konzertkarten und Eintrittskarten.
Spark + Cassandra = Real Time Analytics on Operational DataVictor Coustenoble
This document discusses using Apache Spark and Cassandra together for real-time analytics on transactional data. It provides an overview of Cassandra and how it can be used for operational applications like recommendations, fraud detection, and messaging. It then discusses how the Spark Cassandra Connector allows reading and writing Cassandra data from Spark, enabling real-time analytics on streaming and batch data using Spark SQL, MLlib, and Spark Streaming. It also covers some DataStax Enterprise features for high availability and integration of Spark and Cassandra.
Cassandra and Spark: Optimizing for Data LocalityRussell Spitzer
This document discusses how the Spark Cassandra Connector optimizes for data locality when performing analytics on Cassandra data using Spark. It does this by using the partition keys and token ranges to create Spark partitions that correspond to the data distribution across the Cassandra nodes, allowing work to be done locally to each data node without moving data across the network. This improves performance and avoids the costs of data shuffling.
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Spark Summit
This document discusses how the Spark Cassandra Connector optimizes for data locality when performing analytics on Cassandra data using Spark. It does this by using the partition keys and token ranges to create Spark partitions that correspond to the data distribution across the Cassandra nodes, allowing work to be done locally to each data node without moving data across the network. This improves performance and avoids the costs of data shuffling.
Real time data pipeline with spark streaming and cassandra with mesosRahul Kumar
This document discusses building real-time data pipelines with Apache Spark Streaming and Cassandra using Mesos. It provides an overview of data management challenges, introduces Cassandra and Spark concepts. It then describes how to use the Spark Cassandra Connector to expose Cassandra tables as Spark RDDs and write back to Cassandra. It recommends designing scalable pipelines by identifying bottlenecks, using efficient data parsing, proper data modeling, and compression.
The document discusses the Datastax Spark Cassandra Connector. It provides an overview of how the connector allows Spark to interact with Cassandra data, including performing full table scans, pushing down filters and projections to Cassandra, distributed joins using Cassandra's partitioning, and writing data back to Cassandra in a distributed way. It also highlights some recent features of the connector like support for Cassandra 3.0, materialized views, and performance improvements from the Java Wildcard Cassandra Tester project.
Spark and Cassandra with the Datastax Spark Cassandra Connector
How it works and how to use it!
Missed Spark Summit but Still want to see some slides?
This slide deck is for you!
Spark streaming can be used for near-real-time data analysis of data streams. It processes data in micro-batches and provides windowing operations. Stateful operations like updateStateByKey allow tracking state across batches. Data can be obtained from sources like Kafka, Flume, HDFS and processed using transformations before being saved to destinations like Cassandra. Fault tolerance is provided by replicating batches, but some data may be lost depending on how receivers collect data.
Spark SQL for Java/Scala Developers. Workshop by Aaron Merlob, Galvanize. To hear about future conferences go to http://dataengconf.com
How do you rapidly derive complex insights on top of really big data sets in Cassandra? This session draws upon Evan's experience building a distributed, interactive, columnar query engine on top of Cassandra and Spark. We will start by surveying the existing query landscape of Cassandra and discuss ways to integrate Cassandra and Spark. We will dive into the design and architecture of a fast, column-oriented query architecture for Spark, and why columnar stores are so advantageous for OLAP workloads. I will present a schema for Parquet-like storage of analytical datasets onCassandra. Find out why Cassandra and Spark are the perfect match for enabling fast, scalable, complex querying and storage of big analytical data.
The document discusses Resilient Distributed Datasets (RDDs) in Spark. It explains that RDDs hold references to partition objects containing subsets of data across a cluster. When a transformation like map is applied to an RDD, a new RDD is created to store the operation and maintain a dependency on the original RDD. This allows chained transformations to be lazily executed together in jobs scheduled by Spark.
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...DataStax
Worried that you aren't taking full advantage of your Spark and Cassandra integration? Well worry no more! In this talk we'll take a deep dive into all of the available configuration options and see how they affect Cassandra and Spark performance. Concerned about throughput? Learn to adjust batching parameters and gain a boost in speed. Always running out of memory? We'll take a look at the various causes of OOM errors and how we can circumvent them. Want to take advantage of Cassandra's natural partitioning in Spark? Find out about the recent developments that let you perform shuffle-less joins on Cassandra-partitioned data! Come with your questions and problems and leave with answers and solutions!
About the Speaker
Russell Spitzer Software Engineer, DataStax
Russell Spitzer received a Ph.D in Bio-Informatics before finding his deep passion for distributed software. He found the perfect outlet for this passion at DataStax where he began on the Automation and Test Engineering team. He recently moved from finding bugs to making bugs as part of the Analytics team where he works on integration between Cassandra and Spark as well as other tools.
Spark is a general engine for large-scale data processing. It introduces Resilient Distributed Datasets (RDDs) which allow in-memory caching for fault tolerance and act like familiar Scala collections for distributed computation across clusters. RDDs provide a programming model with transformations like map and reduce and actions to compute results. Spark also supports streaming, SQL, machine learning, and graph processing workloads.
Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer
The Spark Cassandra Connector allows integration between Spark and Cassandra for distributed analytics. Previously, integrating Hadoop and Cassandra required complex code and configuration. The connector maps Cassandra data distributed across nodes based on token ranges to Spark partitions, enabling analytics on large Cassandra datasets using Spark's APIs. This provides an easier method for tasks like generating reports, analytics, and ETL compared to previous options.
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
Watch video at: http://youtu.be/Wg2boMqLjCg
Want to learn how to write faster and more efficient programs for Apache Spark? Two Spark experts from Databricks, Vida Ha and Holden Karau, provide some performance tuning and testing tips for your Spark applications
Using Spark to Load Oracle Data into CassandraJim Hatcher
The document discusses lessons learned from using Spark to load data from Oracle into Cassandra. It describes problems encountered with Spark SQL handling Oracle NUMBER and timeuuid fields incorrectly. It also discusses issues generating IDs across RDDs and limitations on returning RDDs of tuples over 22 items. The resources section provides references for learning more about Spark, Scala, and using Spark with Cassandra.
Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming.
Presented at the Desert Code Camp:
http://oct2016.desertcodecamp.com/sessions/all
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryIlya Ganelin
In this talk I talk about my recent experience working with Spark Data Frames and the Spark TimeSeries library. For data frames, the focus will be on usability. Specifically, a lot of the documentation does not cover common use cases like intricacies of creating data frames, adding or manipulating individual columns, and doing quick and dirty analytics. For the time series library, I dive into the kind of use cases it supports and why it’s actually super useful.
Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Time series with Apache Cassandra - Long versionPatrick McFadin
Apache Cassandra has proven to be one of the best solutions for storing and retrieving time series data. This talk will give you an overview of the many ways you can be successful. We will discuss how the storage model of Cassandra is well suited for this pattern and go over examples of how best to build data models.
November 3 tactics to ensure next sale is around the cornerCole Information
Stretch your marketing dollars further by going hyperlocal. New research suggests the majority of consumers prefer spending their hard earned dollars close to home. Learn tactics to entice your closest neighbors to come through your small business’ front door.
In this Small Business Marketing 101 Webcast, we'll explore three ways you effectively reach out to local consumers and ensure your next sale is around the corner. You'll come away with knowledge on:
• Effective touch points for these next door consumers
• How to stay top of mind in your community
• How to extend your reach beyond a 5 mile radius
Die Zeitschrift über die Salzburger Festspiele 2014 mit Fahrplan und Preisangaben für Tickets. Festspiel Magazin: ein Journal vom Kartenbüro POLZER über Oper, Schauspiel und Konzerte der Festspiele in Salzburg. Erfahren Sie alles über das Programm und die Inhalte, sowie auch über Preise der Konzertkarten und Eintrittskarten.
La Red Provincial de Vigilancia Farmacéutica de Buenos Aires fue creada en 2000 para brindar apoyo técnico a los farmacéuticos sobre medicamentos y productos de salud. La Red ha recibido varios premios por su trabajo de notificación de eventos adversos y problemas de calidad de medicamentos. Además, la Red ha comenzado a comunicar alertas e información sobre medicamentos a la comunidad a través de comunicados de prensa.
Producto Algebra Gestión Hipotecaria
Cubre todo el flujo operativo desde la pre-firam hasta la post-firma.
Aporta una serie de vnetajas y utlidades que permiten la creación de ventajas competitivas.
Gestion los expedientes electrónicos, comunicación multi-canal, ofrece control centralizado de cumplimiento de fecha, gestión de tareas, acceso ventana cliente y proveedores...
Este documento presenta el sílabo de la asignatura Instalaciones Eléctricas de Interiores de la Universidad Nacional de Chimborazo. Describe los objetivos, contenidos, metodología y bibliografía del curso. Los temas a cubrir incluyen simbología eléctrica, tipos de esquemas, planos de instalaciones residenciales y comerciales, y presupuestos. El curso busca que los estudiantes aprendan a diseñar e implementar instalaciones eléctricas usando normas técnicas.
El documento describe los factores psicológicos que influyen en el rendimiento de los judokas. Explica que la preparación psicológica es fundamental para optimizar el rendimiento competitivo, además de la preparación física y técnica. También discute conceptos como el Bushido y el rol del Sensei, y cómo estos influyen en los atletas desde una perspectiva psicológica.
DataStax Enterprise clients, such as CQLSH or Hadoop and Spark based applications, can be precisely configured to achieve a desired behaviour. For a basic use case, we just run a dedicated DSE command and do not care about how all of those pieces are setup to work together, leveraging the goodness of DSE. However, understanding where and what we need to modify to achieve the expected change in the configuration is essential for using DSE efficiently. In this presentation we go through the basic and advanced settings for client applications, including security features and limitations or DSE patches introduced into integrated Spark. We show the new tools which significantly simplify the configuration of external DSE installations which are used just for accessing DSE cluster in client mode. Finally, we conclude with hints for configuring Spark driver from scratch in order to use it in a web application, when running the program through DSE scripts is not feasible.
About the Speaker
Jacek Lewandowski Software engineer, DataStax
Jacek Lewandowski is a software engineer with 13 years of experience. Initially a full stack developer, he was working as a consultant and a trainer for different companies. Since 2011 he started using Cassandra as an alternative to SQL in various applications. He is passionate about distributed algorithms, graphs and functional programming in Scala. Part time assistant professor popularizing Cassandra database among students and researchers. Working at DataStax Analytics team for over 2 years.
High Throughput Analytics with Cassandra & AzureDataStax Academy
This document summarizes Cassandra and Azure cloud services for high throughput analytics. It discusses:
1) Using Cassandra and Azure services to store and analyze 200 million data points per hour from various endpoints in near real-time.
2) Cassandra's ability to horizontally scale storage and queries by adding nodes with no downtime.
3) An architecture using Cassandra, Azure VMs, web/worker roles, and SQL database to ingest and analyze streaming IoT data.
How to Talk about APIs (APIDays Paris 2016)Andrew Seward
One of the more challenging aspects of working with APIs is that outside of your own little tech bubble, nobody actually knows what an API is - despite being hopelessly dependent on APIs for their day to day lives. So how do you talk about APIs to the masses of people who have no idea what they are? You're going to have to do it - you'll need to talk to your non-technical colleagues about it, many of whom you're entirely dependent upon to improve your API or get it out to the masses; there'll be potential customers out there for whom your API is the exact solution to the problem they're having and there's the people you meet who ask you what it is you do.
In this talk we'll discuss how overcome this huge challenge for all of us in the business of APIs, how to establish not just a clear ubiquitous language when talking about our APIs but clarity and consistency of content - making sure your developers, salespeople, support and marketing are all talking about your APIs in a way that is accessible, meaningful and useful to all concerned, and how that consistency of understanding can be the difference between the success or failure of your API.
The document discusses strategies for creating value with APIs, focusing on their use in powering Africa and the Middle East's digital revolution. It describes how APIs can help companies go mobile across the region by providing a single platform and multiple open APIs to access services like SMS, payments, and customer data across 20 countries and 116 million customers. It also discusses challenges in implementing API strategies in the region, like the need for developer-oriented and commercially-focused approaches as well as cultural changes. Finally, it outlines how Dun & Bradstreet is using APIs and scalable cloud architecture to provide consistent, accurate business data from over 30,000 sources to help customers make informed decisions.
This document provides an introduction to Apache Spark presented by Vincent Poncet of IBM. It discusses how Spark is a fast, general-purpose cluster computing system for large-scale data processing. It is faster than MapReduce, supports a wide range of workloads, and is easier to use with APIs in Scala, Python, and Java. The document also provides an overview of Spark's execution model and its core API called resilient distributed datasets (RDDs).
http://www.TranslationIsUX.com
http://twitter.com/TranslationIsUX
Translation plays a major role in the user experience. Why neglect it?
In the Web world, User Experience is progressively taking the central position it has always deserved. Designers, entrepreneurs and developers attend to their users' every need and this is great news.
UX designers know that creating a user experience implies tackling many different parameters and having an eye for detail: interfaces, processes, texts, graphics, etc. However we regret that one of those parameters, which does have a major impact, is too often neglected: translation.
Let's localise experiences, place all users in a central position, original version users as well as the others. Web and translation professionals, let's stick and work together. “Translation is UX!”
Verónica González de la Rosa, translator
Antoine Lefeuvre, UX designer
Cloud Summit- How can operators leverage networks to deliver innovative cloud services? Presentation by Mats Alendal from the Broadband World Forum, Amsterdam 2012. For more information on 4th Generation IP for mobility and the cloud: http://www.ericsson.com/yourbusiness/telecom_operators/fixed-broadband-convergence
El documento describe el aprendizaje autónomo, señalando que requiere que el estudiante sea capaz de tomar sus propias decisiones. Ofrece ventajas como fomentar la curiosidad, autodisciplina e innovación del estudiante. Explica que el aprendizaje autónomo implica dirigir, controlar, evaluar y regular el proceso de aprendizaje de manera autónoma. Finalmente, destaca la importancia de la motivación y las TIC para apoyar el aprendizaje autónomo.
Internet está formado por una gran red mundial de computadores interconectados que pueden intercambiar información. Los usuarios se conectan a través de proveedores de servicios de Internet y utilizan protocolos como TCP/IP para comunicarse con otros computadores en la red. La World Wide Web facilitó el acceso y navegación por Internet para todo tipo de usuarios a través de hipertexto, gráficos y formularios. Ahora es posible buscar y compartir información, realizar compras, consultar servicios públicos y comunicarse a través de correo electrónico y redes
La agencia de viajes y turismo busca lanzarse al mercado de Medellín enfocándose en adultos mayores. Su objetivo es lograr $80 millones en ventas en el primer mes e informar claramente sobre sus servicios exclusivos para este grupo. Para el lanzamiento proponen invertir $30 millones en publicidad para promover sus paquetes turísticos y lograr la venta de 15 paquetes.
This document provides an introduction and overview of Apache Spark. It discusses why Spark is useful, describes some Spark basics including Resilient Distributed Datasets (RDDs) and DataFrames, and gives a quick tour of Spark Core, SQL, and Streaming functionality. It also provides some tips for using Spark and describes how to set up Spark locally. The presenter is introduced as a data engineer who uses Spark to load data from Kafka streams into Redshift and Cassandra. Ways to learn more about Spark are suggested at the end.
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
This document provides an agenda for an advanced Spark class covering topics such as RDD fundamentals, Spark runtime architecture, memory and persistence, shuffle operations, and Spark Streaming. The class will be held in March 2015 and include lectures, labs, and Q&A sessions. It notes that some slides may be skipped and asks attendees to keep Q&A low during the class, with a dedicated Q&A period at the end.
In this talk, we present two emerging, popular open source projects: Spark and Shark. Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. It outperform Hadoop by up to 100x in many real-world applications. Spark programs are often much shorter than their MapReduce counterparts thanks to its high-level APIs and language integration in Java, Scala, and Python. Shark is an analytic query engine built on top of Spark that is compatible with Hive. It can run Hive queries much faster in existing Hive warehouses without modifications.
These systems have been adopted by many organizations large and small (e.g. Yahoo, Intel, Adobe, Alibaba, Tencent) to implement data intensive applications such as ETL, interactive SQL, and machine learning.
This document provides an overview of Apache Spark and machine learning using Spark. It introduces the speaker and objectives. It then covers Spark concepts including its architecture, RDDs, transformations and actions. It demonstrates working with RDDs and DataFrames. Finally, it discusses machine learning libraries available in Spark like MLib and how Spark can be used for supervised machine learning tasks.
Abstract –
Spark 2 is here, while Spark has been the leading cluster computation framework for severl years, its second version takes Spark to new heights. In this seminar, we will go over Spark internals and learn the new concepts of Spark 2 to create better scalable big data applications.
Target Audience
Architects, Java/Scala developers, Big Data engineers, team leaders
Prerequisites
Java/Scala knowledge and SQL knowledge
Contents:
- Spark internals
- Architecture
- RDD
- Shuffle explained
- Dataset API
- Spark SQL
- Spark Streaming
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
The document is a presentation about Apache Spark given on August 25th, 2015 in Pittsburgh by Sneha Challa. It introduces Spark as a fast and general cluster computing engine for large-scale data processing. It discusses Spark's Resilient Distributed Datasets (RDDs) and transformations/actions. It provides examples of Spark APIs like map, reduce, and explains running Spark on standalone, Mesos, YARN, or EC2 clusters. It also covers Spark libraries like MLlib and running machine learning algorithms like k-means clustering and logistic regression.
5 Ways to Use Spark to Enrich your Cassandra EnvironmentJim Hatcher
Apache Cassandra is a powerful system for supporting large-scale, low-latency data systems, but it has some tradeoffs. Apache Spark can help fill those gaps, and this presentation will show you how.
Spark is an open-source cluster computing framework. It was developed in 2009 at UC Berkeley and open sourced in 2010. Spark supports batch, streaming, and interactive computations in a unified framework. The core abstraction in Spark is the resilient distributed dataset (RDD), which allows data to be partitioned across a cluster for parallel processing. RDDs support transformations like map and filter that return new RDDs and actions that return values to the driver program.
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiData Con LA
Abstract:- Of all the developers delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs - RDDs, DataFrames, and Datasets available in Apache Spark 2.x. In particular, I will emphasize why and when you should use each set as best practices, outline its performance and optimization benefits, and underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you'll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them.
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Lucidworks
This document provides an overview of using Apache Spark with Apache Solr. It discusses using Solr as a data source for Spark SQL, reading data from Solr into Spark RDDs, querying Solr from the Spark shell, indexing data from Spark Streaming into Solr, and an example of using Solr as a sink for a Spark Streaming application that processes tweets in real-time.
Apache Spark is a cluster computing framework that allows for fast, easy, and general processing of large datasets. It extends the MapReduce model to support iterative algorithms and interactive queries. Spark uses Resilient Distributed Datasets (RDDs), which allow data to be distributed across a cluster and cached in memory for faster processing. RDDs support transformations like map, filter, and reduce and actions like count and collect. This functional programming approach allows Spark to efficiently handle iterative algorithms and interactive data analysis.
Author: Stefan Papp, Data Architect at “The unbelievable Machine Company“. An overview of Big Data Processing engines with a focus on Apache Spark and Apache Flink, given at a Vienna Data Science Group meeting on 26 January 2017. Following questions are addressed:
• What are big data processing paradigms and how do Spark 1.x/Spark 2.x and Apache Flink solve them?
• When to use batch and when stream processing?
• What is a Lambda-Architecture and a Kappa Architecture?
• What are the best practices for your project?
This document provides an overview of Apache Spark, an open-source unified analytics engine for large-scale data processing. It discusses Spark's core APIs including RDDs and transformations/actions. It also covers Spark SQL, Spark Streaming, MLlib, and GraphX. Spark provides a fast and general engine for big data processing, with explicit operations for streaming, SQL, machine learning, and graph processing. The document includes installation instructions and examples of using various Spark components.
This is my slides from ebiznext workshop : Introduction to Apache Spark.
Please download code sources from https://github.com/MohamedHedi/SparkSamples
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
Of all the developers’ delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs-RDDs, DataFrames, and Datasets-available in Apache Spark 2.x. In particular, I will emphasize three takeaways: 1) why and when you should use each set as best practices 2) outline its performance and optimization benefits; and 3) underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you’ll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them. (this will be vocalization of the blog, along with the latest developments in Apache Spark 2.x Dataframe/Datasets and Spark SQL APIs: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)
This document provides an introduction to Apache Spark, including its architecture and programming model. Spark is a cluster computing framework that provides fast, in-memory processing of large datasets across multiple cores and nodes. It improves upon Hadoop MapReduce by allowing iterative algorithms and interactive querying of datasets through its use of resilient distributed datasets (RDDs) that can be cached in memory. RDDs act as immutable distributed collections that can be manipulated using transformations and actions to implement parallel operations.
Apache Spark is an open-source distributed processing engine that is up to 100 times faster than Hadoop for processing data stored in memory and 10 times faster for data stored on disk. It provides high-level APIs in Java, Scala, Python and SQL and supports batch processing, streaming, and machine learning. Spark runs on Hadoop, Mesos, Kubernetes or standalone and can access diverse data sources using its core abstraction called resilient distributed datasets (RDDs).
This document discusses Spark Streaming and its use for near real-time ETL. It provides an overview of Spark Streaming, how it works internally using receivers and workers to process streaming data, and an example use case of building a recommender system to find matches using both batch and streaming data. Key points covered include the streaming execution model, handling data receipt and job scheduling, and potential issues around data loss and (de)serialization.
This document discusses Apache Spark, an open-source cluster computing framework. It provides an overview of Spark, including its main concepts like RDDs (Resilient Distributed Datasets) and transformations. Spark is presented as a faster alternative to Hadoop for iterative jobs and machine learning through its ability to keep data in-memory. Example code is shown for Spark's programming model in Scala and Python. The document concludes that Spark offers a rich API to make data analytics fast, achieving speedups of up to 100x over Hadoop in real applications.
🌍📱👉COPY LINK & PASTE ON GOOGLE http://drfiles.net/ 👈🌍
Adobe Illustrator is a powerful, professional-grade vector graphics software used for creating a wide range of designs, including logos, icons, illustrations, and more. Unlike raster graphics (like photos), which are made of pixels, vector graphics in Illustrator are defined by mathematical equations, allowing them to be scaled up or down infinitely without losing quality.
Here's a more detailed explanation:
Key Features and Capabilities:
Vector-Based Design:
Illustrator's foundation is its use of vector graphics, meaning designs are created using paths, lines, shapes, and curves defined mathematically.
Scalability:
This vector-based approach allows for designs to be resized without any loss of resolution or quality, making it suitable for various print and digital applications.
Design Creation:
Illustrator is used for a wide variety of design purposes, including:
Logos and Brand Identity: Creating logos, icons, and other brand assets.
Illustrations: Designing detailed illustrations for books, magazines, web pages, and more.
Marketing Materials: Creating posters, flyers, banners, and other marketing visuals.
Web Design: Designing web graphics, including icons, buttons, and layouts.
Text Handling:
Illustrator offers sophisticated typography tools for manipulating and designing text within your graphics.
Brushes and Effects:
It provides a range of brushes and effects for adding artistic touches and visual styles to your designs.
Integration with Other Adobe Software:
Illustrator integrates seamlessly with other Adobe Creative Cloud apps like Photoshop, InDesign, and Dreamweaver, facilitating a smooth workflow.
Why Use Illustrator?
Professional-Grade Features:
Illustrator offers a comprehensive set of tools and features for professional design work.
Versatility:
It can be used for a wide range of design tasks and applications, making it a versatile tool for designers.
Industry Standard:
Illustrator is a widely used and recognized software in the graphic design industry.
Creative Freedom:
It empowers designers to create detailed, high-quality graphics with a high degree of control and precision.
Get & Download Wondershare Filmora Crack Latest [2025]saniaaftab72555
Copy & Past Link 👉👉
https://dr-up-community.info/
Wondershare Filmora is a video editing software and app designed for both beginners and experienced users. It's known for its user-friendly interface, drag-and-drop functionality, and a wide range of tools and features for creating and editing videos. Filmora is available on Windows, macOS, iOS (iPhone/iPad), and Android platforms.
Discover why Wi-Fi 7 is set to transform wireless networking and how Router Architects is leading the way with next-gen router designs built for speed, reliability, and innovation.
Download YouTube By Click 2025 Free Full Activatedsaniamalik72555
Copy & Past Link 👉👉
https://dr-up-community.info/
"YouTube by Click" likely refers to the ByClick Downloader software, a video downloading and conversion tool, specifically designed to download content from YouTube and other video platforms. It allows users to download YouTube videos for offline viewing and to convert them to different formats.
Solidworks Crack 2025 latest new + license codeaneelaramzan63
Copy & Paste On Google >>> https://dr-up-community.info/
The two main methods for installing standalone licenses of SOLIDWORKS are clean installation and parallel installation (the process is different ...
Disable your internet connection to prevent the software from performing online checks during installation
Douwan Crack 2025 new verson+ License codeaneelaramzan63
Copy & Paste On Google >>> https://dr-up-community.info/
Douwan Preactivated Crack Douwan Crack Free Download. Douwan is a comprehensive software solution designed for data management and analysis.
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...Ranjan Baisak
As software complexity grows, traditional static analysis tools struggle to detect vulnerabilities with both precision and context—often triggering high false positive rates and developer fatigue. This article explores how Graph Neural Networks (GNNs), when applied to source code representations like Abstract Syntax Trees (ASTs), Control Flow Graphs (CFGs), and Data Flow Graphs (DFGs), can revolutionize vulnerability detection. We break down how GNNs model code semantics more effectively than flat token sequences, and how techniques like attention mechanisms, hybrid graph construction, and feedback loops significantly reduce false positives. With insights from real-world datasets and recent research, this guide shows how to build more reliable, proactive, and interpretable vulnerability detection systems using GNNs.
Not So Common Memory Leaks in Java WebinarTier1 app
This SlideShare presentation is from our May webinar, “Not So Common Memory Leaks & How to Fix Them?”, where we explored lesser-known memory leak patterns in Java applications. Unlike typical leaks, subtle issues such as thread local misuse, inner class references, uncached collections, and misbehaving frameworks often go undetected and gradually degrade performance. This deck provides in-depth insights into identifying these hidden leaks using advanced heap analysis and profiling techniques, along with real-world case studies and practical solutions. Ideal for developers and performance engineers aiming to deepen their understanding of Java memory management and improve application stability.
Adobe Lightroom Classic Crack FREE Latest link 2025kashifyounis067
🌍📱👉COPY LINK & PASTE ON GOOGLE http://drfiles.net/ 👈🌍
Adobe Lightroom Classic is a desktop-based software application for editing and managing digital photos. It focuses on providing users with a powerful and comprehensive set of tools for organizing, editing, and processing their images on their computer. Unlike the newer Lightroom, which is cloud-based, Lightroom Classic stores photos locally on your computer and offers a more traditional workflow for professional photographers.
Here's a more detailed breakdown:
Key Features and Functions:
Organization:
Lightroom Classic provides robust tools for organizing your photos, including creating collections, using keywords, flags, and color labels.
Editing:
It offers a wide range of editing tools for making adjustments to color, tone, and more.
Processing:
Lightroom Classic can process RAW files, allowing for significant adjustments and fine-tuning of images.
Desktop-Focused:
The application is designed to be used on a computer, with the original photos stored locally on the hard drive.
Non-Destructive Editing:
Edits are applied to the original photos in a non-destructive way, meaning the original files remain untouched.
Key Differences from Lightroom (Cloud-Based):
Storage Location:
Lightroom Classic stores photos locally on your computer, while Lightroom stores them in the cloud.
Workflow:
Lightroom Classic is designed for a desktop workflow, while Lightroom is designed for a cloud-based workflow.
Connectivity:
Lightroom Classic can be used offline, while Lightroom requires an internet connection to sync and access photos.
Organization:
Lightroom Classic offers more advanced organization features like Collections and Keywords.
Who is it for?
Professional Photographers:
PCMag notes that Lightroom Classic is a popular choice among professional photographers who need the flexibility and control of a desktop-based application.
Users with Large Collections:
Those with extensive photo collections may prefer Lightroom Classic's local storage and robust organization features.
Users who prefer a traditional workflow:
Users who prefer a more traditional desktop workflow, with their original photos stored on their computer, will find Lightroom Classic a good fit.
Avast Premium Security Crack FREE Latest Version 2025mu394968
🌍📱👉COPY LINK & PASTE ON GOOGLE https://dr-kain-geera.info/👈🌍
Avast Premium Security is a paid subscription service that provides comprehensive online security and privacy protection for multiple devices. It includes features like antivirus, firewall, ransomware protection, and website scanning, all designed to safeguard against a wide range of online threats, according to Avast.
Key features of Avast Premium Security:
Antivirus: Protects against viruses, malware, and other malicious software, according to Avast.
Firewall: Controls network traffic and blocks unauthorized access to your devices, as noted by All About Cookies.
Ransomware protection: Helps prevent ransomware attacks, which can encrypt your files and hold them hostage.
Website scanning: Checks websites for malicious content before you visit them, according to Avast.
Email Guardian: Scans your emails for suspicious attachments and phishing attempts.
Multi-device protection: Covers up to 10 devices, including Windows, Mac, Android, and iOS, as stated by 2GO Software.
Privacy features: Helps protect your personal data and online privacy.
In essence, Avast Premium Security provides a robust suite of tools to keep your devices and online activity safe and secure, according to Avast.
Societal challenges of AI: biases, multilinguism and sustainabilityJordi Cabot
Towards a fairer, inclusive and sustainable AI that works for everybody.
Reviewing the state of the art on these challenges and what we're doing at LIST to test current LLMs and help you select the one that works best for you
Landscape of Requirements Engineering for/by AI through Literature ReviewHironori Washizaki
Hironori Washizaki, "Landscape of Requirements Engineering for/by AI through Literature Review," RAISE 2025: Workshop on Requirements engineering for AI-powered SoftwarE, 2025.
Download Wondershare Filmora Crack [2025] With Latesttahirabibi60507
Copy & Past Link 👉👉
http://drfiles.net/
Wondershare Filmora is a video editing software and app designed for both beginners and experienced users. It's known for its user-friendly interface, drag-and-drop functionality, and a wide range of tools and features for creating and editing videos. Filmora is available on Windows, macOS, iOS (iPhone/iPad), and Android platforms.
Exploring Wayland: A Modern Display Server for the FutureICS
Wayland is revolutionizing the way we interact with graphical interfaces, offering a modern alternative to the X Window System. In this webinar, we’ll delve into the architecture and benefits of Wayland, including its streamlined design, enhanced performance, and improved security features.
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...Eric D. Schabell
It's time you stopped letting your telemetry data pressure your budgets and get in the way of solving issues with agility! No more I say! Take back control of your telemetry data as we guide you through the open source project Fluent Bit. Learn how to manage your telemetry data from source to destination using the pipeline phases covering collection, parsing, aggregation, transformation, and forwarding from any source to any destination. Buckle up for a fun ride as you learn by exploring how telemetry pipelines work, how to set up your first pipeline, and exploring several common use cases that Fluent Bit helps solve. All this backed by a self-paced, hands-on workshop that attendees can pursue at home after this session (https://o11y-workshops.gitlab.io/workshop-fluentbit).
4. 4
§ Apache Spark is a fast, general purpose,
easy-to-use cluster computing system for
large-scale data processing
– Fast
• Leverages aggressively cached in-memory
distributed computing and dedicated Executor
processes even when no jobs are running
• Faster than MapReduce
– General purpose
• Covers a wide range of workloads
• Provides SQL, streaming and complex
analytics
– Flexible and easier to use than Map Reduce
• Spark is written in Scala, an object oriented,
functional programming language
• Scala, Java, Python and R
• Scala and Python interactive shells
• Runs on Hadoop, Mesos, standalone or cloud
Logistic regression in Hadoop and Spark
Spark Stack
val wordCounts =
sc.textFile("README.md").flatMap(line =>
line.split(" ")).map(word => (word,
1)).reduceByKey((a, b) => a + b)
WordCount
5. 5
Brief History of Spark
§ 2002 – MapReduce @ Google
§ 2004 – MapReduce paper
§ 2006 – Hadoop @ Yahoo
§ 2008 – Hadoop Summit
§ 2010 – Spark paper
§ 2013 – Spark 0.7 Apache Incubator
§ 2014 – Apache Spark top-level
§ 2014 – 1.2.0 release in December
§ 2015 – 1.3.0 release in March
§ 2015 – 1.4.0 release in June
§ 2015 – 1.5.0 release in September
§ 2016 – 1.6.0 release in January
§ Most active project in Hadoop
ecosystem
§ One of top 3 most active Apache
projects
§ Databricks founded by the creators
of Spark from UC Berkeley’s
AMPLab
Activity for 6 months in 2014
(from Matei Zaharia – 2014 Spark Summit)
DataBricks
In June 2015, code base was about 400K lines
8. 8
Why Spark on Cassandra?
§ Analytics on transactional data and operational applications
§ Data model independent queries
§ Cross-table operations (JOIN, UNION, etc.)
§ Complex analytics (e.g. machine learning)
§ Data transformation, aggregation, etc.
§ Stream processing
9. 9
DataStax Involvment on Spark
§ Apache Spark is embedded in DataStax Enterprise since DSE 4.5
released in July 2014
§ DataStax provides enterprise support to the embedded Apache
Spark
§ DataStax released a open source Spark Cassandra Connector
https://github.com/datastax/spark-cassandra-connector
Connector Spark Cassandra
Cassandra
Java Driver DSE
1.5 1.5 2.1.5+ 2.2
1.4 1.4 2.1.5+ 2.1 4.8
1.3 1.3 2.1.5+ 2.1
1.2 1.2 2.1, 2.0 2.1 4.7
1.1 1.1, 1.0 2.1, 2.0 2.1 4.6
1.0 1.0, 0.9 2.0 2.0 4.5
11. 11
§ An RDD is a distributed collection of Scala/Python/Java objects of
the same type:
– RDD of strings
– RDD of integers
– RDD of (key, value) pairs
– RDD of class Java/Python/Scala objects
§ An RDD is physically distributed across the cluster, but manipulated
as one logical entity:
– Spark will “distribute” any required processing to all partitions where the RDD
exists and perform necessary redistributions and aggregations as well.
– Example: Consider a distributed RDD “Names” made of names
Resilient Distributed Dataset (RDD): definition
Vincent
Victor
Pascal
Steve
Dani
Nate
Matt
Piotr
Alice
Partition 1 Partition 2 Partition 3
Names
12. 12
§ Suppose we want to know the number of names in the RDD “Names”
§ User simply requests: Names.count()
– Spark will “distribute” count processing to all partitions so as to obtain:
• Partition 1: Mokhtar(1), Jacques (1), Dirk (1) è 3
• Partition 2: Cindy (1), Dan (1), Susan (1) è 3
• Partition 3: Dirk (1), Frank (1), Jacques (1) è 3
– Local counts are subsequently aggregated: 3+3+3=9
§ To lookup the first element in the RDD: Names.first()
§ To display all elements of the RDD: Names.collect() (careful with this)
Resilient Distributed Dataset: definition
Names
Vincent
Victor
Pascal
Steve
Dani
Nate
Matt
Piotr
Alice
Partition 1 Partition 2 Partition 3
13. 13
Resilient Distributed Datasets: Creation and Manipulation
§ Three methods for creation
– Distributing a collection of objects from the driver program (using the
parallelize method of the spark context)
val rddNumbers = sc.parallelize(1 to 10)
val rddLetters = sc.parallelize (List(“a”, “b”, “c”, “d”))
– Loading an external dataset (file)
val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt")
– Transformation from another existing RDD
val rddNumbers2 = rddNumbers.map(x=> x+1)
§ Dataset from any storage
– CFS, HDFS, Cassandra, Amazon S3
§ File types supported
– Text files, SequenceFiles, Parquet, JSON
– Hadoop InputFormat
14. 14
Resilient Distributed Datasets: Properties
§ Immutable
§ Two types of operations
– Transformations ~ DDL (Create View V2 as…)
• val rddNumbers = sc.parallelize(1 to 10): Numbers from 1 to 10
• val rddNumbers2 = rddNumbers.map (x => x+1): Numbers from 2 to 11
• The LINEAGE on how to obtain rddNumbers2 from rddNumber is recorded
• It’s a Directed Acyclic Graph (DAG)
• No actual data processing does take place è Lazy evaluations
– Actions ~ DML (Select * From V2…)
• rddNumbers2.collect(): Array [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
• Performs transformations and action
• Returns a value (or write to a file)
§ Fault tolerance
– If data in memory is lost it will be recreated from lineage
§ Caching, persistence (memory, spilling, disk) and check-pointing
15. 15
RDD Transformations
§ Transformations are lazy evaluations
§ Returns a transformed RDD
§ Pair RDD (K,V) functions for MapReduce style transformations
Transformation Meaning
map(func) Return a new dataset formed by passing each element of the source through a function func.
filter(func) Return a new dataset formed by selecting those elements of the source on which func returns
true.
flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items. So func should
return a Seq rather than a single item
Full documentation at http://spark.apache.org/docs/1.4.1/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions
join(otherDataset,
[numTasks])
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all
pairs of elements for each key.
reduceByKey(func) When called on a dataset of (K, V) pairs, returns a dataset of (K,V) pairs where the values for
each key are aggregated using the given reduce function func
sortByKey([ascendin
g],[numTasks])
When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K,V)
pairs sorted by keys in ascending or descending order.
combineByKey[C}(cr
eateCombiner,
mergeValue,
mergeCombiners))
Generic function to combine the elements for each key using a custom set of aggregation
functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C.
createCombiner: (V) ⇒ C, mergeValue: (C, V) ⇒ C, mergeCombiners: (C, C) ⇒ C)
Full documentation at http://spark.apache.org/docs/1.4.1/api/scala/index.html#org.apache.spark.rdd.RDD
17. 17
RDD Actions
§ Actions returns values or save a RDD to disk
Action Meaning
collect() Return all the elements of the dataset as an array of the driver program. This
is usually useful after a filter or another operation that returns a sufficiently
small subset of data.
count() Return the number of elements in a dataset.
first() Return the first element of the dataset
take(n) Return an array with the first n elements of the dataset.
foreach(func) Run a function func on each element of the dataset.
saveAsTextFile Save the RDD into a TextFile
Full documentation at http://spark.apache.org/docs/1.4.1/api/scala/index.html#org.apache.spark.rdd.RDD
18. 18
RDD Persistence
§ Each node stores any partitions of the cache that it computes in memory
§ Reuses them in other actions on that dataset (or datasets derived from it)
– Future actions are much faster (often by more than 10x)
§ Two methods for RDD persistence: persist() and cache()
Storage Level Meaning
MEMORY_ONLY Store as deserialized Java objects in the JVM. If the RDD does not fit in memory, part of
it will be cached. The other will be recomputed as needed. This is the default. The
cache() method uses this.
MEMORY_AND_DISK Same except also store on disk if it doesn’t fit in memory. Read from memory and disk
when needed.
MEMORY_ONLY_SER Store as serialized Java objects (one bye array per partition). Space efficient, but more
CPU intensive to read.
MEMORY_AND_DISK_SER Similar to MEMORY_AND_DISK but stored as serialized objects.
DISK_ONLY Store only on disk.
MEMORY_ONLY_2,
MEMORY_AND_DISK_2,etc.
Same as above, but replicate each partition on two cluster nodes
OFF_HEAP (experimental) Store RDD in serialized format in Tachyon.
20. 20
Introduction to Scala
What is Scala?
Functions in Scala
Operating on collections in Scala
Scala Crash Course, Holden Karau, DataBricks
21. 21
About Scala
High-level language for the JVM
● Object oriented + functional programming
Statically typed
● Comparable in speed to Java
● Type inference saves us from having to write
explicit types most of the time
Interoperates with Java
● Can use any Java class (inherit from, etc.)
● Can be called from Java code
Scala Crash Course, Holden Karau, DataBricks
22. 22
Quick Tour of Scala
Declaring variables:
var x: Int =
var x = 7 //
val y = “hi”
7
Type inferred
// read-only
String) =
}
def announce(text:
{
println(text)
}
Java equivalent:
int x= 7;
final String y = “hi”;
Functions: Java equivalent:
def square(x: Int): Int = x*x int square(int x) {
def square(x: Int): Int = { return x*x;
x*x }
void announce(String text) {
System.out.println(text);
}
Scala Crash Course, Holden Karau, DataBricks
23. 23
Scala functions (closures)
(x: Int) => x + 2 // full version
x => x + 2 //type inferred
Scala Crash Course, Holden Karau, DataBricks
24. 24
Scala functions (closures)
(x: Int) => x + 2 // full version
x => x + 2 // type inferred
_ + 2 // placeholder syntax (each argument must be used
exactly once)
Scala Crash Course, Holden Karau, DataBricks
25. 25
Scala functions (closures)
(x: Int) => x + 2 // full version
x => x + 2 // type inferred
_ + 2 // placeholder syntax (each argument must be used
exactly once)
x => { // body is a block of code
val numberToAdd = 2
x + numberToAdd
}
Scala Crash Course, Holden Karau, DataBricks
26. 26
Scala functions (closures)
(x: Int) => x + 2 // full version
x => x + 2 // type inferred
_ + 2 // placeholder syntax (each argument must be used
exactly once)
x => { // body is a block of code
val numberToAdd = 2
x + numberToAdd
}
// Regular functions
def addTwo(x:Int): Int = x + 2
Scala Crash Course, Holden Karau, DataBricks
27. 27
Quick Tour of Scala Part 2
Processing collections with functional programming
val list = List(1,2,3)
list.foreach(x => println(x))
list.foreach(println)
// prints 1, 2, 3
// same
list.map(x =>x + 2)
list.map(_ + 2)
// returns a new
// same
List(3, 4, 5)
list.filter(x => x % 2 == 1)// returns a new List(1, 3)
list.filter(_ % 2 == 1) // same
list.reduce((x, y) => x + y) // => 6
list.reduce(_ + _) // same
All of these leave the list unchanged as it is immutable.
Scala Crash Course, Holden Karau, DataBricks
28. 28
Functional methods on collections
Method on Seq[T] Explanation
map(f: T => U): Seq[U] Each element is result of f
flatMap(f: T => Seq[U]): Seq[U] One to many map
filter(f: T => Boolean): Seq[T] Keep elements passing f
exists(f: T => Boolean): Boolean True if one element passes f
forall(f: T => Boolean): Boolean True if all elements pass
reduce(f: (T, T) => T): T Merge elements using f
groupBy(f: T => K): Map[K, List[T]] Group elements by f
sortBy(f: T => K): Seq[T] Sort elements
…..
There are a lot of methods on Scala collections, just google Scala Seq or http://www.scala-lang.org/api/2.
10.4/index.html#scala.collection.Seq
Scala Crash Course, Holden Karau, DataBricks
29. 29
Code Execution (1)
// Create RDD
val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt")
// Transformations
val vpnQuotes = quotes.filter(_.startsWith(”VPN"))
val vpnSpark = vpnQuotes.map(_.split(" ")).map(x => x(1))
// Action
vpnSpark.filter(_.contains("Spark")).count()
VPN Spark is cool
BOB Spark is fun
BRIAN Spark is great
VPN Scala is awesome
BOB Scala is flexible
File: sparkQuotes.txt
§ ‘spark-shell’ provides Spark context as ‘sc’
30. 30
// Create RDD
val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt")
// Transformations
val vpnQuotes = quotes.filter(_.startsWith(”VPN"))
val vpnSpark = vpnQuotes.map(_.split(" ")).map(x => x(1))
// Action
vpnSpark.filter(_.contains("Spark")).count()
Code Execution (2)
VPN Spark is cool
BOB Spark is fun
BRIAN Spark is great
VPN Scala is awesome
BOB Scala is flexible
File: sparkQuotes.txt RDD: quotes
VPN Spark is cool
BOB Spark is fun
BRIAN Spark is great
VPN Scala is awesome
BOB Scala is flexible
31. 31
// Create RDD
val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt")
// Transformations
val danQuotes = quotes.filter(_.startsWith(”VPN"))
val vpnSpark = vpnQuotes.map(_.split(" ")).map(x => x(1))
// Action
vpnSpark.filter(_.contains("Spark")).count()
Code Execution (3)
File: sparkQuotes.txt RDD: quotes RDD: vpnQuotes
VPN Spark is cool
VPN Scala is awesome
VPN Spark is cool
BOB Spark is fun
BRIAN Spark is great
VPN Scala is awesome
BOB Scala is flexible
VPN Spark is cool
BOB Spark is fun
BRIAN Spark is great
VPN Scala is awesome
BOB Scala is flexible
32. 32
// Create RDD
val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt")
// Transformations
val danQuotes = quotes.filter(_.startsWith(”VPN"))
val vpnSpark = vpnQuotes.map(_.split(" ")).map(x => x(1))
// Action
vpnSpark.filter(_.contains("Spark")).count()
Code Execution (4)
File: sparkQuotes.txt RDD: quotes RDD: vpnQuotes RDD: vpnSpark
Spark
Scala
VPN Spark is cool
VPN Scala is awesome
VPN Spark is cool
BOB Spark is fun
BRIAN Spark is great
VPN Scala is awesome
BOB Scala is flexible
VPN Spark is cool
BOB Spark is fun
BRIAN Spark is great
VPN Scala is awesome
BOB Scala is flexible
33. 33
// Create RDD
val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt")
// Transformations
val danQuotes = quotes.filter(_.startsWith(”VPN"))
val vpnSpark = vpnQuotes.map(_.split(" ")).map(x => x(1))
// Action
vpnSpark.filter(_.contains("Spark")).count()
Code Execution (5)
File: sparkQuotes.txt RDD: quotes RDD: vpnQuotes
Spark
Scala
RDD: vpnSpark
1
VPN Spark is cool
VPN Scala is awesome
VPN Spark is cool
BOB Spark is fun
BRIAN Spark is great
VPN Scala is awesome
BOB Scala is flexible
VPN Spark is cool
BOB Spark is fun
BRIAN Spark is great
VPN Scala is awesome
BOB Scala is flexible
35. 35
DataFrames
§ A DataFrame is a distributed collection of data organized into named columns. It is
conceptually equivalent to a table in a relational database, an R dataframe or Python Pandas,
but in a distributed manner and with query optimizations and predicate pushdown to the
underlying storage using a DataSource connector.
§ DataFrames can be constructed from a wide array of sources such as: structured data files,
tables in Hive, external databases, or existing RDDs.
§ Released in Spark 1.3
DataBricks / Spark Summit 2015
36. 36
DataFrames Examples
// Show the content of the DataFrame (default 100 records, use show(int) for specifying the number)
df.show()
// Print the schema in a tree format
df.printSchema()
// root
// |-‐-‐ age: long (nullable = true)
// |-‐-‐ name: string (nullable = true)
// Select only the "name" column
df.select(df(“name”)).show() // full statement using Column
df.select($”name”).show() // $ is implicit of Column, need import sqlContext.implicits._
df.select("name").show() // use of StringExpression, related to SparkSQL expression
// Select everybody, but increment the age by 1
df.select(df("name"), df("age") + 1).show()
df.select($"name”, $"age” + 1).show()
df.selectExpr(“name”, “age +1”).show()
// Select people older than 21
df.filter(df("age") > 21).show()
df.filter($"age” > 21).show()
df.filter("age > 21”).show()
// Count people by age
df.groupBy("age").count().show()
Full documentation at http://spark.apache.org/docs/1.4.1/api/scala/index.html#org.apache.spark.sql.DataFrame
39. 39
Comparison between Spark DFs and Python Pandas
§ https://databricks.com/blog/2015/08/12/from-pandas-to-apache-
sparks-dataframe.html
§ https://medium.com/@chris_bour/6-differences-between-pandas-and-
spark-dataframes-1380cec394d2#.n1rcu0imr
§ https://lab.getbase.com/pandarize-spark-dataframes/
40. 40
DataFrames - Getting Started
§ SQLContext created from SparkContext
// An existing SparkContext, sc
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
§ HiveContext created from SparkContext (better SQL support)
// An existing SparkContext, sc
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
§ Import a library to convert an RDD to a DataFrame
– Scala:
import sqlContext.implicits._
§ Creating a DataFrame
– From RDD
– Inferring the schema using reflection
– Programmatic Interface
– Using a DataSource connector
41. 41
DataFrames – creating a DataFrame from RDD
§ Infering the Schema using Reflection
– The case class in Scala defines the schema of the table
case class Person(name: String, age: Int)
– The arguments of the case class becomes the names of the columns
– Create the RDD of the Person object and create a DataFrame
val people = sc.textFile
("examples/src/main/resources/people.txt").map(_.split(",")).map(p =>
Person(p(0), p(1).trim.toInt)).toDF()
§ Programmatic Interface
– Schema encoded as a String, import SparkSQL Struct types
val schemaString = “name age”
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType,StructField,StringType};
– Create the schema represented by a StructType matching the structure of the
Rows in the RDD from previous step
val schema = StructType( schemaString.split(" ").map(fieldName =>
StructField(fieldName, StringType, true)))
– Apply the schema to the RDD of Rows using the createDataFrame method
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)
42. 42
DataFrames - DataSources
Spark 1.4+
§ Generic Load/Save
– val df = sqlContext.read.format(“<datasource type>”).options(“<datasource specific option, ie
connection parameters, filename, table,…>”).load(“< datasource specific option, table, filename, or
nothing>”)
– df.write.format(“<datasource type>”).options(“<datasource specific option, ie connection parameters,
filename, table,…>”).save (“< datasource specific option, table, filename, or nothing>”)
§ JSON
– val df = sqlContext.read.load("people.json", "json")
– df.select("name", "age").write.save("namesAndAges.json", “json")
§ CSV (external package)
– val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("cars.csv")
– df.select("year", "model").write.format("com.databricks.spark.csv").save("newcars.csv")
Spark Cassandra Connector is a DataSource API compliant connector.
DataSource APIs provides generic methods to manage connectors to any
datasource (file, jdbc, cassandra, mongodb, etc…). From Spark 1.3
DataSource APIs provides predicate pushdown capabilities to leverage the
performance of the backend. Most connectors are available at http://spark-
packages.org/
43. 43
Spark SQL
§ Process relational queries expressed in SQL (HiveQL)
§ Seamlessly mix SQL queries with Spark programs
– Register the DataFrame as a table
people.registerTempTable("people")
– Run SQL statements using the sql method provided by the SQLContext
val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <=
19")
§ Provide a single interface for working with structured data
including Apache Hive, Parquet, JSON files and any DataSource
§ Leverages Hive frontend and metastore
– Compatibility with Hive data, queries and UDFs
– HiveQL limitations may apply
– Not ANSI SQL compliant
§ Connectivity through JDBC/ODBC using SparkSQL Thrift Server
§ Do Adhoc query / Report using BI Tool
§ Tableau, Cognos, ClickView, Microstrategy, …
46. 46
rdd1.join(rdd2)
.groupBy(…)
.filter(…)
RDD Objects
build operator DAG
agnostic to
operators!
doesn’t know
about stages
DAGScheduler
split graph into
stages of tasks
submit each
stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via
cluster manager
retry failed or
straggling tasks
Cluster
manager
Worker
execute tasks
store and serve
blocks
Block
manager
Threads
Task
stage
failed
Scheduling Process
DataBricks
47. 47
Pipelines narrow ops.
within a stage
Picks join algorithms
based on partitioning
(minimize shuffles)
Reuses previously
cached data
DAGScheduler Optimizations
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= previously computed partition
Task
DataBricks
48. 48
RDD Direct Acyclic Graph (DAG)
§ View the lineage
§ Could be issued in a continuous line
scala> vpnSpark.toDebugString
res1: String =
(2) MappedRDD[4] at map at <console>:16
| MappedRDD[3] at map at <console>:16
| FilteredRDD[2] at filter at <console>:14
| hdfs:/sparkdata/sparkQuotes.txt MappedRDD[1] at textFile at <console>:12
| hdfs:/sparkdata/sparkQuotes.txt HadoopRDD[0] at textFile at <console>:12
val vpnSpark = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt").
filter(_.startsWith(”VPN")).
map(_.split(" ")).
map(x => x(1)).
.filter(_.contains("Spark"))
vpnSpark.count()
49. 49
DataFrame Execution
§ DataFrame API calls are transformed into Physical Plan by Catalyst
§ Catalyst provides numerous optimisation schemes
– Join strategy selection through cost model evaluation
– Operations fusion
– Predicates pushdown to the DataSource
– Code generation to reduce virtual function calls (Spark 1.5)
§ The Physical execution is made as RDDs
§ Benefits are better performance, datasource leverage and consistent
performances accross langages
50. 50
Showing Multiple Apps
SparkContext
Driver Program
Cluster Manager
Worker Node
Executor
Task Task
Cache
Worker Node
Executor
Task Task
Cache
App
§ Each Spark application runs as a set of processes coordinated by the
Spark context object (driver program)
– Spark context connects to Cluster Manager (standalone, Mesos/Yarn)
– Spark context acquires executors (JVM instance)
on worker nodes
– Spark context sends tasks to the executors
DataBricks
51. 51
Spark Terminology
§ Context (Connection):
– Represents a connection to the Spark cluster. The Application which initiated
the context can submit one or several jobs, sequentially or in parallel, batch or
interactively, or long running server continuously serving requests.
§ Driver (Coordinator agent)
– The program or process running the Spark context. Responsible for running
jobs over the cluster and converting the App into a set of tasks
§ Job (Query / Query plan):
– A piece of logic (code) which will take some input from HDFS (or the local
filesystem), perform some computations (transformations and actions) and
write some output back.
§ Stage (Subplan)
– Jobs are divided into stages
§ Tasks (Sub section)
– Each stage is made up of tasks. One task per partition. One task is executed
on one partition (of data) by one executor
§ Executor (Sub agent)
– The process responsible for executing a task on a worker node
§ Resilient Distributed Dataset
53. 53
Spark’s Scala and Python Shell
§ Spark comes with two shells
– Scala
– Python
§ APIs available for Scala, Python and Java
§ Appropriate versions for each Spark release
§ Spark’s native language is Scala, more natural to write Spark
applications using Scala.
§ This presentation will focus on code examples in Scala
54. 54
Spark’s Scala and Python Shell
§ Powerful tool to analyze data interactively
§ The Scala shell runs on the Java VM
– Can leverage existing Java libraries
§ Scala:
– To launch the Scala shell (from Spark home directory):
dse spark
– To read in a text file:
scala> val textFile =
sc.textFile("file:///Users/vincentponcet/toto")
§ Python:
– To launch the Python shell (from Spark home directory):
dse pyspark
– To read in a text file:
>>> textFile =
sc.textFile("file:///Users/vincentponcet/toto")
55. 55
SparkContext in Applications
§ The main entry point for Spark functionality
§ Represents the connection to a Spark cluster
§ Create RDDs, accumulators, and broadcast variables on that
cluster
§ In the Spark shell, the SparkContext, sc, is automatically initialized
for you to use
§ In a Spark program, import some classes and implicit conversions
into your program:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
56. 56
A Spark Standalone Application in Scala
Import statements
SparkConf and
SparkContext
Transformations
and Actions
57. 57
Running Standalone Applications
§ Define the dependencies
– Scala à simple.sbt
§ Create the typical directory structure with the files
§ Create a JAR package containing the application’s code.
– Scala: sbt package
§ Use spark-submit to run the program
Scala:
./simple.sbt
./src
./src/main
./src/main/scala
./src/main/scala/SimpleApp.scala
58. 58
Spark Properties
§ Set application properties via the SparkConf object
val conf = new SparkConf()
.setMaster("local")
.setAppName("CountingSheep")
.set("spark.executor.memory", "1g")
val sc = new SparkContext(conf)
§ Dynamically setting Spark properties
– SparkContext with an empty conf
val sc = new SparkContext(new SparkConf())
– Supply the configuration values during runtime
./bin/spark-submit --name "My app" --master local[4] --conf
spark.shuffle.spill=false --conf "spark.executor.extraJavaOptions=-
XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar
– conf/spark-defaults.conf
60. 60
Spark Monitoring
§ Three ways to monitor Spark applications
1. Web UI
• Default port 4040
• Available for the duration of the application
• Jobs and stages metrics, stage execution plan and tasks timeline
2. Metrics API
• Report to a variety of sinks (HTTP, JMX, and CSV)
• /conf/metrics.properties
62. 62
Spark Streaming
§ Scalable, high-throughput, fault-tolerant stream processing of live
data streams
§ Write Spark streaming applications like Spark applications
§ Recovers lost work and operator state (sliding windows) out-of-the-
box
§ Uses HDFS and Zookeeper for high availability
§ Data sources also include TCP sockets, ZeroMQ or other customized
data sources
63. 63
Spark Streaming - Internals
§ The input stream goes into Spark Steaming
§ Breaks up into batches of input data
§ Feeds it into the Spark engine for processing
§ Generate the final results in streams of batches
§ DStream - Discretized Stream
– Represents a continuous stream of data created from the input streams
– Internally, represented as a sequence of RDDs
64. 64
Spark Streaming - Getting Started
§ Count the number of words coming in from the TCP socket
§ Import the Spark Streaming classes
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
§ Create the StreamingContext object
val conf =
new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
§ Create a DStream
val lines = ssc.socketTextStream("localhost", 9999)
§ Split the lines into words
val words = lines.flatMap(_.split(" "))
§ Count the words
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
§ Print to the console
wordCounts.print()
65. 65
Spark Streaming - Continued
§ No real processing happens until you tell it
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
§ Code and application can be found in the NetworkWordCount
example
§ To run the example:
– Invoke netcat to start the data stream
– In a different terminal, run the application
./bin/run-example streaming.NetworkWordCount localhost 9999
66. 66
SparkMLlib
§ SparkMLlib for machine learning
library
§ Since Spark 0.8
§ Provides common algorithms and
utilities
• Classification
• Regression
• Clustering
• Collaborative filtering
• Dimensionality reduction
§ Leverages in-memory cache of Spark
to speed up iteration processing
67. 67
SparkMLlib - Getting Started
§ Use k-means clustering for set of latitudes and longitudes
§ Import the SparkMLlib classes
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors
§ Create the SparkContext object
val conf = new SparkConf().setAppName("KMeans")
val sc = new SparkContext(conf)
§ Create a data RDD
val taxifile = sc.textFile("user/spark/sparkdata/nyctaxisub/*")
§ Create Vectors for input to algorithm
val taxi =
taxifile.map{line=>Vectors.dense(line.split(",").slice(3,5).map(_.toDouble))}
§ Run the k-means algorithm with 3 clusters and 10 iterations
val model = Kmeans.train(taxi,3,10)
val clusterCenters = model.clusterCenters.map(_.toArray)
§ Print to the console
clusterCenters.foreach(lines=>println(lines(0),lines(1)))
68. 68
SparkML
§ SparkML provides an API to build ML pipeline (since Spark 1.3)
§ Similar to Python scikit-learn
§ SparkML provides abstraction for all steps of an ML workflow
Generic ML Workflow Real Life ML Workflow
§ Transformer: A Transformer is an algorithm which can transform
one DataFrame into another DataFrame. E.g., an ML model is a
Transformer which transforms an RDD with features into an
RDD with predictions.
§ Estimator: An Estimator is an algorithm which can be fit on a
DataFrame to produce a Transformer. E.g., a learning algorithm
is an Estimator which trains on a dataset and produces a model.
§ Pipeline: A Pipeline chains multiple Transformers and Estimators
together to specify an ML workflow.
§ Param: All Transformers and Estimators now share a common
API for specifying parameters.
Xebia HUG France 06/2015
70. 70
Cassandra Spark Driver
§ Cassandra tables exposed as
– Spark RDDs
– Spark DataFrames
§ Load data from Cassandra to Spark
§ Write data from Spark to Cassandra
§ Object mapper : Mapping of C* tables and
rows to Scala/Java objects
§All Cassandra types supported and
converted to Scala/Java types
§ Server side data selection
§ Virtual Nodes support
§ Data Locality awareness
§ Scala, Python and Java APIs
71. 71
$ dse spark
...
Spark context available as sc.
HiveSQLContext available as hc.
CassandraSQLContext available as csc.
scala> sc.cassandraTable("test", "kv")
res5: com.datastax.spark.connector.rdd.CassandraRDD
[com.datastax.spark.connector.CassandraRow] =
CassandraRDD[2] at RDD at CassandraRDD.scala:48
scala> sc.cassandraTable("test", "kv").collect
res6: Array[com.datastax.spark.connector.CassandraRow] =
Array(CassandraRow{k: 1, v: foo})
cqlsh> select *
from test.kv;
k | v
-‐-‐-‐+-‐-‐-‐-‐-‐
1 | foo
(1 rows)
DSE Spark Interactive Shell
72. 72
Connecting to Cassandra
// Import Cassandra-‐specific functions on SparkContext and
RDD objects
import com.datastax.driver.spark._
// Spark connection options
val conf = new SparkConf(true)
.setMaster("spark://192.168.123.10:7077")
.setAppName("cassandra-‐demo")
.set("cassandra.connection.host", "192.168.123.10")
// initial contact
.set("cassandra.username", "cassandra")
.set("cassandra.password", "cassandra")
val sc = new SparkContext(conf)
73. 73
Reading Data as RDD
val table = sc
.cassandraTable[CassandraRow]("db", "tweets")
.select("user_name", "message")
.where("user_name = ?", "ewa")
row
representation keyspace table
server side
column and row
selection
74. 74
Writing RDD to Cassandra
CREATE TABLE test.words(word TEXT PRIMARY KEY, count INT);
val collection = sc.parallelize(Seq(("foo", 2), ("bar", 5)))
collection.saveToCassandra("test", "words", SomeColumns("word",
"count"))
cqlsh:test> select * from
words;
word | count
-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐
bar | 5
foo | 2
(2 rows)
75. 75
Mapping Rows to Objects
CREATE TABLE test.cars
(
id text PRIMARY
KEY,
model text,
fuel_type text,
year int
);
case class Vehicle(
id: String,
model: String,
fuelType: String,
year: Int
)
sc.cassandraTable[Vehicle]("test",
"cars").toArray
//Array(Vehicle(KF334L, Ford Mondeo,
Petrol, 2009),
// Vehicle(MT8787, Hyundai x35,
Diesel, 2011)
à
• Mapping rows to Scala Case Classes
• CQL underscore case column mapped to Scala camel case property
• Custom mapping functions (see docs)
76. 76
Type Mapping
CQL Type Scala Type
ascii String
bigint Long
boolean Boolean
counter Long
decimal BigDecimal, java.math.BigDecimal
double Double
float Float
inet java.net.InetAddress
int Int
list Vector, List, Iterable, Seq, IndexedSeq, java.util.List
map Map, TreeMap, java.util.HashMap
set Set, TreeSet, java.util.HashSet
text, varchar String
timestamp Long, java.util.Date, java.sql.Date, org.joda.time.DateTime
timeuuid java.util.UUID
uuid java.util.UUID
varint BigInt, java.math.BigInteger
*nullable values Option
77. 77
§Mapping of Cassandra keyspaces and tables
§Read and write on Cassandra tables
Usage of Spark DataFrames
import com.datastax.spark.connector._
// Connect to the Spark cluster
val conf = new SparkConf(true)...
val sc = new SparkContext(conf)
// Create Cassandra SQL context
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
// Cassandra Table as a Spark DataFrame
val df = sqlContext.read.format("org.apache.spark.sql.cassandra").
options(Map("keyspace"-‐>"music", "table" -‐> "tracks_by_album”)).
load()
// Some DataFrames transformations // for floor : import sqlContext.functions._
val tmp = df.groupBy((floor(df("album_year") / 10) * 10).cast("int").alias("decade")).count
val count_by_decade = tmp.select(tmp("decade"), tmp("count").alias("album_count"))count_by_decade.show
// Save the resulting DataFrame into a Cassandra Table
count_by_decade.write.format("org.apache.spark.sql.cassandra").options(Map( "table" -‐>
"albums_by_decade", "keyspace" -‐> "steve")).mode(SaveMode.Overwrite).save()
78. 78
§Mapping of Cassandra keyspaces and tables
§Read and write on Cassandra tables
Usage of Spark SQL
import com.datastax.spark.connector._
// Connect to the Spark cluster
val conf = new SparkConf(true)...
val sc = new SparkContext(conf)
// Create Cassandra SQL context
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
// Create the SparkSQL alias to the Cassandra Table
sqlContext.sql(“””CREATE TEMPORARY TABLE tmp_tracks_by_album
USING org.apache.spark.sql.cassandra
OPTIONS (keyspace "music", table "tracks_by_album”)”””)
// Execute the SparkSQL query. The result of the query is a DataFrame
val track_count_by_year = sqlContext.sql("select 'dummy' as dummy, album_year as year,
count(*) as track_count
from tmp_tracks_by_album
group by album_year")
// Register the DataFrame as a SparkSQL alias
track_count_by_year.registerTempTable("tmp_track_count_by_year”)
// Save the DataFrame using SparkSQL Insert into
sqlContext.sql("insert into table music.tracks_by_year select dummy, track_count, year from
tmp_track_count_by_year")
79. 79
§ Easy setup and config
– No need to setup a separate Spark cluster
– No need to tweak classpaths or config files
§ High availability of Spark Master
§ Enterprise security
– Password / Kerberos / LDAP authentication
– SSL for all Spark to Cassandra connections
§ Included support of Apache Spark in DSE Max subscription
DataStax Enterprise Max - Spark Special Features
80. 80
DataStax Enterprise - High Availability
§ All nodes are Spark Workers
§ By default resilient to Worker failures
§ First Spark node promoted as Spark Master
( state saved in Cassandra, no SPOF)
§ Standby Master promoted on failure (New
Spark Master reconnects to Workers and the
driver app and continues the job)
§ No need of Zookeeper, unlike
§ Spark Standalone
§ Mesos
§ YARN