Apache Spark.

Jun 4, 2022Download as PPTX, PDF1 like447 views

Apache Spark is an open-source cluster computing framework for large-scale data processing. It supports batch processing, real-time processing, streaming analytics, machine learning, interactive queries, and graph processing. Spark core provides distributed task dispatching and scheduling. It works by having a driver program that connects to a cluster manager to run tasks on executors in worker nodes. Spark also introduces Resilient Distributed Datasets (RDDs) that allow immutable, parallel data processing. Common RDD transformations include map, flatMap, groupByKey, and reduceByKey while common actions include reduce.

Apache
Spark…
By,
Janani.J
I.M.Sc Information technology

Spark
💜 Spark it’s used for data processing.
💜Apache Spark has been designed for quick computation by a simple
cluster technology.
💜Apache Spark is another open source cluster computing framework
for data analytics. However, Spark supports in-memory cluster
computing, Parallel processing and promises to be Faster than Hadoop.
💜 Spark supports various high-level tools for data analysis.
💜 Spark provides APIs for Scala, Java and Python Language.

Spark Unifies
🍒 Batch processing
🍒 Real-time processing
🍒 Stream Analytics
🍒 Machine Learning
🍒 Interactive SQL
🍒 Graph processing

Spark Core
Spark Core provides distributed task dispatching, scheduling, basic I/O etc. All these
functionalities are exposed through an application programming interface (for Java,
Python, Scala, and R) called driver program.

Driver Program
🎈The Driver Program is a process that runs the main() function of the application and
creates the SparkContext object. The purpose of SparkContext is to coordinate the
spark applications, running as independent sets of processes on a cluster
🎈To run on a cluster, the SparkContext connects to a different type of cluster
managers and then perform the following tasks: -
🎈It acquires executors on nodes in the cluster.
🎈Then, it sends your application code to the executors. Here, the application
code can be defined by JAR or Python files passed to the SparkContext.
🎈At last, the SparkContext sends tasks to the executors to run.

Cluster Manager
💜The role of the cluster manager is to allocate resources across applications.
The Spark is capable enough of running on a large number of clusters.
💜It consists of various types of cluster managers such as Hadoop YARN,
Apache Mesos and Standalone Scheduler.
💜Here, the Standalone Scheduler is a standalone spark cluster manager that
facilitates to install Spark on an empty set of machines.
Worker node
💜The worker node is a slave node
💜Its role is to run the application code in the cluster.

Executor
👉An executor is a process launched for an application on a worker node.
👉It runs tasks and keeps data in memory or disk storage across them.
👉It read and write data to the external sources.
👉Every application contains its executor.
Task
👉A unit of work that will be sent to one executor.

Resilient Distributed Datasets…
⭐ Spark Core is embedded with RDDs(Resilient Distributed Datasets), an
immutable fault-tolerant, distributed collection of objects that can be operated
on in parallel.

Let’s see some of the frequently used RDD Transformations.
Map(func)
It returns a new distributed dataset formed by passing each element of the source
through a function func.
flatMap(func)
Here, each input item can be mapped to zero or more output items, so func should
return a sequence rather than a single item.
groupByKey([numPartitions])
It returns a dataset of (K, Iterable) pairs when called on a dataset of (K, V) pairs.
reduceByKey(func, [numPartitions])
When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the
values for each key are aggregated using the given reduce function func, which must be
of type (V,V) => V.

Let’s see some of the frequently used RDD Actions.
Reduce(func)
It aggregate the elements of the dataset using a function func (which
takes two arguments and returns one). The function should be
commutative and associative so that it can be computed correctly in
parallel.

This document discusses key enabling technologies for the Internet of Things (IoT). It describes wireless sensor networks that use distributed sensor nodes to monitor environmental conditions. It also discusses cloud computing which provides on-demand computing resources and services over the Internet. Additionally, it covers big data analytics which involves collecting, processing, and analyzing large, diverse datasets. Finally, it mentions communication protocols that allow devices to exchange data over networks and embedded systems which are specialized computer systems designed to perform specific tasks.

Chapter 5 IoT Design methodologiespavan penugonda

The document outlines a 10-step IoT design methodology that includes purpose and requirements specification, process specification, domain modeling, information modeling, service specifications, IoT level specification, functional view specification, operational view specification, device and component integration, and application development. It then provides an example application of this methodology to design a smart home automation system for controlling lights remotely. The example walks through each step for specifying the purpose, domain model, information model, services, functional views, and developing the application and native controller components.

Dataflow with Apache NiFiDataWorks Summit/Hadoop Summit

This document provides an overview of Apache NiFi and dataflow. It begins with an introduction to the challenges of moving data effectively within and between systems. It then discusses Apache NiFi's key features for addressing these challenges, including guaranteed delivery, data buffering, prioritized queuing, and data provenance. The document outlines NiFi's architecture and components like repositories and extension points. It also previews a live demo and invites attendees to further discuss Apache NiFi at a Birds of a Feather session.

M2M and IoT Design MethodologiesSelvaraj Seerangan

This document discusses M2M and IoT design methodologies. It begins with an overview of M2M architecture, including the key components of an M2M area network, M2M core network, M2M gateways, and M2M applications. It then contrasts M2M and IoT, noting differences in communication protocols, types of connected devices, emphasis on hardware vs software, how data is collected and analyzed, and applications. The document also introduces software-defined networking (SDN) and network function virtualization (NFV) as approaches to address limitations of conventional network architectures for IoT.

SQLReimuel Bisnar

SQL is a standard language for accessing and manipulating databases. It allows users to retrieve, insert, update, and delete data as well as create new databases and tables. Common SQL statements include SELECT, UPDATE, DELETE, and INSERT. SQL uses clauses, operators, and wildcards to filter records based on conditions. Some key points are that SQL is an ANSI standard but different versions exist, it allows querying and modifying data in databases, and is essential for interacting with relational database systems.

Fundamentals of Apache KafkaChhavi Parasher

Chapter 12. Outlier Detection.pptSubrata Kumer Paul

Data Analysis & Visualization using MS. ExcelFrehiwot Mulugeta

This document provides an overview of data analysis and visualization using Microsoft Excel. It covers summarizing data using functions like COUNTIF, sorting and filtering data, creating pivot tables, adding filters and slicers to pivot tables, formatting pivot tables, and creating pivot charts. The objective is to help users understand how to extract insights from data through summarization, aggregation, and visualization techniques in Excel.

Delta lake and the delta architectureAdam Doyle

- Delta Lake is an open source project that provides ACID transactions, schema enforcement, and time travel capabilities to data stored in data lakes such as S3 and ADLS. - It allows building a "Lakehouse" architecture where the same data can be used for both batch and streaming analytics. - Key features include ACID transactions, scalable metadata handling, time travel to view past data states, schema enforcement, schema evolution, and change data capture for streaming inserts, updates and deletes.

SparkHeena Madan

Spark is an open-source distributed computing framework used for processing large datasets. It allows for in-memory cluster computing, which enhances processing speed. Spark core components include Resilient Distributed Datasets (RDDs) and a directed acyclic graph (DAG) that represents the lineage of transformations and actions on RDDs. Spark Streaming is an extension that allows for processing of live data streams with low latency.

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama

This document provides an overview and summary of Amazon S3 best practices and tuning for Hadoop/Spark in the cloud. It discusses the relationship between Hadoop/Spark and S3, the differences between HDFS and S3 and their use cases, details on how S3 behaves from the perspective of Hadoop/Spark, well-known pitfalls and tunings related to S3 consistency and multipart uploads, and recent community activities related to S3. The presentation aims to help users optimize their use of S3 storage with Hadoop/Spark frameworks.

Paris Redis Meetup IntroductionGregory Boissinot

Redis is an in-memory key-value store that can be used as a database, cache, and message broker. It supports various data structures like strings, hashes, lists, sets, sorted sets, with commands to add, remove, and get values. Redis works with an optional disk storage for persistence and supports master-slave replication for high availability. Common use cases include caching, queues, user sessions, and real-time analytics.

Apache Spark ArchitectureAlexey Grishchenko

Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov

Apache SparkSugumarSarDurai

This document provides an overview of the Apache Spark framework. It discusses how Spark allows distributed processing of large datasets across computer clusters using simple programming models. It also describes how Spark can scale from single servers to thousands of machines. Spark is designed to provide high availability by detecting and handling failures at the application layer. The document also summarizes Resilient Distributed Datasets (RDDs), which are Spark's fundamental data abstraction, and transformations and actions that can be performed on RDDs.

Parquet overviewJulien Le Dem

Parquet is a column-oriented storage format for Hadoop that supports efficient compression and encoding techniques. It uses a row group structure to store data in columns in a compressed and encoded column chunk format. The schema and metadata are stored in the file footer to allow for efficient reads and scans of selected columns. The format is designed to be extensible through pluggable components for schema conversion, record materialization, and encodings.

Apache Spark Core—Deep Dive—Proper OptimizationDatabricks

Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks

Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.

Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.

Apache Parquet is an open-source columnar storage format for efficient data storage and analytics. It provides efficient compression and encoding techniques that enable fast scans and queries of large datasets. Parquet 2.0 improves on these efficiencies through enhancements like delta encoding, binary packing designed for CPU efficiency, and predicate pushdown using statistics. Benchmark results show Parquet provides much better compression and query performance than row-oriented formats on big data workloads. The project is developed as an open-source community with contributions from many organizations.

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks

Structured Streaming has proven to be the best platform for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark’s built-in functions make it easy for developers to express complex computations. Delta Lake, on the other hand, is the best way to store structured data because it is a open-source storage layer that brings ACID transactions to Apache Spark and big data workloads Together, these can make it very easy to build pipelines in many common scenarios. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem that needs to be solved. Apache Spark, being a unified analytics engine doing both batch and stream processing, often provides multiples ways to solve the same problem. So understanding the requirements carefully helps you to architect your pipeline that solves your business needs in the most resource efficient manner. In this talk, I am going examine a number common streaming design patterns in the context of the following questions. WHAT are you trying to consume? What are you trying to produce? What is the final output that the business wants? What are your throughput and latency requirements? WHY do you really have those requirements? Would solving the requirements of the individual pipeline actually solve your end-to-end business requirements? HOW are going to architect the solution? And how much are you willing to pay for it? Clarity in understanding the ‘what and why’ of any problem can automatically much clarity on the ‘how’ to architect it using Structured Streaming and, in many cases, Delta Lake.

Introduction to Storm Chandler Huang

Storm is a distributed and fault-tolerant realtime computation system. It was created at BackType/Twitter to analyze tweets, links, and users on Twitter in realtime. Storm provides scalability, reliability, and ease of programming. It uses components like Zookeeper, ØMQ, and Thrift. A Storm topology defines the flow of data between spouts that read data and bolts that process data. Storm guarantees processing of all data through its reliability APIs and guarantees no data loss even during failures.

Performant Streaming in Production: Preventing Common Pitfalls when Productio...Databricks

Introduction to Apache SparkAnastasios Skarlatidis

Oracle archi pptHitesh Kumar Markam

The document provides an overview of the Oracle database including its architecture, components, and features. It discusses Oracle's memory structure consisting of the shared pool, database buffer cache, and redo log buffer. It describes Oracle's process structure including background processes like DBWR, LGWR, PMON and SMON. It also covers Oracle's storage structure such as datafiles, redo logs, control files and the physical and logical storage architectures including tablespaces, segments, extents and blocks.

Performance tuning in sql serverAntonios Chatzipavlis

This document discusses how to optimize performance in SQL Server. It covers: 1) Why performance tuning is necessary to allow systems to scale, improve performance, and save costs. 2) How to optimize SQL Server performance by addressing CPU, memory, I/O, and other factors like compression and partitioning. 3) How to optimize the database for performance through techniques like schema design, indexing, locking, and query optimization.

Spark architectureGauravBiswas9

The document discusses Apache Spark, an open source cluster computing framework for real-time data processing. It notes that Spark is up to 100 times faster than Hadoop for in-memory processing and 10 times faster on disk. The main feature of Spark is its in-memory cluster computing capability, which increases processing speeds. Spark runs on a driver-executor model and uses resilient distributed datasets and directed acyclic graphs to process data in parallel across a cluster.

Introduction to memcachedJurriaan Persyn

Apache Spark Introductionsudhakara st

Spark is an open source cluster computing framework for large-scale data processing. It provides high-level APIs and runs on Hadoop clusters. Spark components include Spark Core for execution, Spark SQL for SQL queries, Spark Streaming for real-time data, and MLlib for machine learning. The core abstraction in Spark is the resilient distributed dataset (RDD), which allows data to be partitioned across nodes for parallel processing. A word count example demonstrates how to use transformations like flatMap and reduceByKey to count word frequencies from an input file in Spark.

Apache Spark.pptxV.V.Vanniaperumal College for Women

Fast Data Analytics with Spark and PythonBenjamin Bengfort

In this one day workshop, we will introduce Spark at a high level context. Spark is fundamentally different than writing MapReduce jobs so no prior Hadoop experience is needed. You will learn how to interact with Spark on the command line and conduct rapid in-memory data analyses. We will then work on writing Spark applications to perform large cluster-based analyses including SQL-like aggregations, machine learning applications, and graph algorithms. The course will be conducted in Python using PySpark.

More Related Content

What's hot (20)

Delta lake and the delta architectureAdam Doyle

SparkHeena Madan

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama

Paris Redis Meetup IntroductionGregory Boissinot

Apache Spark ArchitectureAlexey Grishchenko

Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov

Apache SparkSugumarSarDurai

Parquet overviewJulien Le Dem

Apache Spark Core—Deep Dive—Proper OptimizationDatabricks

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks

Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks

Introduction to Storm Chandler Huang

Performant Streaming in Production: Preventing Common Pitfalls when Productio...Databricks

Introduction to Apache SparkAnastasios Skarlatidis

Oracle archi pptHitesh Kumar Markam

Performance tuning in sql serverAntonios Chatzipavlis

Spark architectureGauravBiswas9

Introduction to memcachedJurriaan Persyn

Apache Spark Introductionsudhakara st

Delta lake and the delta architectureAdam Doyle

SparkHeena Madan

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama

Paris Redis Meetup IntroductionGregory Boissinot

Apache Spark ArchitectureAlexey Grishchenko

Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov

Apache SparkSugumarSarDurai

Parquet overviewJulien Le Dem

Apache Spark Core—Deep Dive—Proper OptimizationDatabricks

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks

Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks

Introduction to Storm Chandler Huang

Performant Streaming in Production: Preventing Common Pitfalls when Productio...Databricks

Introduction to Apache SparkAnastasios Skarlatidis

Oracle archi pptHitesh Kumar Markam

Performance tuning in sql serverAntonios Chatzipavlis

Spark architectureGauravBiswas9

Introduction to memcachedJurriaan Persyn

Apache Spark Introductionsudhakara st

Similar to Apache Spark. (20)

Apache Spark.pptxV.V.Vanniaperumal College for Women

Fast Data Analytics with Spark and PythonBenjamin Bengfort

Hadoop Spark Introduction-20150130Xuan-Chao Huang

Introduction to apache spark and the architecturesundharakumarkb2

Spark corePrashant Gupta

Apache sparkPrashant Pranay

OVERVIEW ON SPARK.pptxAishg4

Spark is a fast and general engine for large-scale data processing. It was designed to be fast, easy to use and supports machine learning. Spark achieves high performance by keeping data in-memory as much as possible using its Resilient Distributed Datasets (RDDs) abstraction. RDDs allow data to be partitioned across nodes and operations are performed in parallel. The Spark architecture uses a master-slave model with a driver program coordinating execution across worker nodes. Transformations operate on RDDs to produce new RDDs while actions trigger job execution and return results.

Apache spark architecture (Big Data and Analytics)Jyotasana Bharti

Apache Spark OverviewDharmjit Singh

spark ...................................itsTIM66

Apache Spark Introduction.pdfMaheshPandit16

The document provides an overview of Apache Spark, including what it is, its ecosystem, features, and architecture. Some key points: - Apache Spark is an open-source cluster computing framework for large-scale data processing. It is up to 100x faster than Hadoop for iterative/interactive algorithms. - Spark features include its RDD abstraction, lazy evaluation, and use of DAGs to optimize performance. It supports Scala, Java, Python, and R. - The Spark ecosystem includes tools like Spark SQL, MLlib, GraphX, and Spark Streaming. It can run on Hadoop YARN, Mesos, or in standalone mode. - Spark's architecture includes the SparkContext,

Bring the Spark To Your EyesDemi Ben-Ari

Demi Ben-Ari is a senior software engineer at Windward Ltd. who has a BS in computer science. They previously worked as a software team leader and senior Java engineer developing missile defense and alert systems. The presentation discusses Spark, an open-source cluster computing framework, and how Windward uses Spark for data filtering, management, predictions and more through Java applications running on YARN clusters.

Spark from the SurfaceJosi Aranda

Apache Spark is an open-source distributed processing engine that is up to 100 times faster than Hadoop for processing data stored in memory and 10 times faster for data stored on disk. It provides high-level APIs in Java, Scala, Python and SQL and supports batch processing, streaming, and machine learning. Spark runs on Hadoop, Mesos, Kubernetes or standalone and can access diverse data sources using its core abstraction called resilient distributed datasets (RDDs).

spark example spark example spark examplespark examplespark examplespark exampleShidrokhGoudarzi1

Spark is a fast general-purpose engine for large-scale data processing. It has advantages over MapReduce like speed, ease of use, and running everywhere. Spark supports SQL querying, streaming, machine learning, and graph processing. It can run on Scala, Java, Python. Spark applications have drivers, executors, tasks and run RDDs and shared variables. The Spark shell provides an interactive way to learn the API and analyze data.

Module01NPN Training

This document provides an overview of Apache Spark, including its history, features, architecture and use cases. Spark started in 2009 at UC Berkeley and was later adopted by the Apache Foundation. It provides faster processing than Hadoop by keeping data in memory. Spark supports batch, streaming and interactive processing on large datasets using its core abstraction called resilient distributed datasets (RDDs).

CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptxbhuvankumar3877

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn

This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark. Below topics are explained in this Spark presentation: 1. History of Spark 2. What is Spark 3. Hadoop vs Spark 4. Components of Apache Spark 5. Spark architecture 6. Applications of Spark 7. Spark usecase What is this Big Data Hadoop training course about? The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab. What are the course objectives? Simplilearn’s Apache Spark and Scala certification training are designed to: 1. Advance your expertise in the Big Data Hadoop Ecosystem 2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark 3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos What skills will you learn? By completing this Apache Spark and Scala course you will be able to: 1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations 2. Understand the fundamentals of the Scala programming language and its features 3. Explain and master the process of installing Spark as a standalone cluster 4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark 5. Master Structured Query Language (SQL) using SparkSQL 6. Gain a thorough understanding of Spark streaming features 7. Master and describe the features of Spark ML programming and GraphX programming Who should take this Scala course? 1. Professionals aspiring for a career in the field of real-time big data analytics 2. Analytics professionals 3. Research professionals 4. IT developers and testers 5. Data scientists 6. BI and reporting professionals 7. Students who wish to gain a thorough understanding of Apache Spark Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training

Introduction to apache sparkJohn Godoi

Apache Spark is an open-source distributed processing engine that allows for iterative and interactive processing of big data. It provides a framework with a functional API to create distributed applications that run across a cluster. Spark contains various components, with the core providing the base functionality and other components adding features for specific purposes like SQL, streaming, and machine learning. The functional programming paradigm underlies Spark's API, with immutable data and functions without side effects. Spark uses the map-reduce model where transformations are lazy and actions trigger execution, similar to Hadoop but with improved performance through in-memory caching of data.

Spark basic.pdfssuser8b6c85

Apache Spark is a distributed programming framework for big data processing based on functional programming. It implements distributed Scala collections and is built on top of the Akka actor framework. Resilient Distributed Datasets (RDDs) store data in partitions across several computers and record lineage to allow recomputing partitions if servers fail. Spark supports transformations like map, filter, and reduce and actions like collect, count, and save that trigger computation.

Apache Spark: What's under the hoodAdarsh Pannu

This document provides an overview of Apache Spark's architectural components through the life of simple Spark jobs. It begins with a simple Spark application analyzing airline on-time arrival data, then covers Resilient Distributed Datasets (RDDs), the cluster architecture, job execution through Spark components like tasks and scheduling, and techniques for writing better Spark applications like optimizing partitioning and reducing shuffle size.

Apache Spark.pptxV.V.Vanniaperumal College for Women

Fast Data Analytics with Spark and PythonBenjamin Bengfort

Hadoop Spark Introduction-20150130Xuan-Chao Huang

Introduction to apache spark and the architecturesundharakumarkb2

Spark corePrashant Gupta

Apache sparkPrashant Pranay

OVERVIEW ON SPARK.pptxAishg4

Apache spark architecture (Big Data and Analytics)Jyotasana Bharti

Apache Spark OverviewDharmjit Singh

spark ...................................itsTIM66

Apache Spark Introduction.pdfMaheshPandit16

Bring the Spark To Your EyesDemi Ben-Ari

Spark from the SurfaceJosi Aranda

spark example spark example spark examplespark examplespark examplespark exampleShidrokhGoudarzi1

Module01NPN Training

CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptxbhuvankumar3877

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn

Introduction to apache sparkJohn Godoi

Spark basic.pdfssuser8b6c85

Apache Spark: What's under the hoodAdarsh Pannu

Recently uploaded (20)

Anti-Depressants pharmacology 1slide.pptxMayuri Chavan

Presentation on Tourism Product Development By Md Shaifullar RabbiMd Shaifullar Rabbi

apa-style-referencing-visual-guide-2025.pdfIshika Ghosh

Title: A Quick and Illustrated Guide to APA Style Referencing (7th Edition) This visual and beginner-friendly guide simplifies the APA referencing style (7th edition) for academic writing. Designed especially for commerce students and research beginners, it includes: ✅ Real examples from original research papers ✅ Color-coded diagrams for clarity ✅ Key rules for in-text citation and reference list formatting ✅ Free citation tools like Mendeley & Zotero explained Whether you're writing a college assignment, dissertation, or academic article, this guide will help you cite your sources correctly, confidently, and consistent. Created by: Prof. Ishika Ghosh, Faculty. 📩 For queries or feedback: [email protected]

Operations Management (Dr. Abdulfatah Salem).pdfArab Academy for Science, Technology and Maritime Transport

Presentation of the MIPLM subject matter expert Erdem KayaMIPLM

CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - WorksheetSritoma Majumder

Introduction All the materials around us are made up of elements. These elements can be broadly divided into two major groups: Metals Non-Metals Each group has its own unique physical and chemical properties. Let's understand them one by one. Physical Properties 1. Appearance Metals: Shiny (lustrous). Example: gold, silver, copper. Non-metals: Dull appearance (except iodine, which is shiny). 2. Hardness Metals: Generally hard. Example: iron. Non-metals: Usually soft (except diamond, a form of carbon, which is very hard). 3. State Metals: Mostly solids at room temperature (except mercury, which is a liquid). Non-metals: Can be solids, liquids, or gases. Example: oxygen (gas), bromine (liquid), sulphur (solid). 4. Malleability Metals: Can be hammered into thin sheets (malleable). Non-metals: Not malleable. They break when hammered (brittle). 5. Ductility Metals: Can be drawn into wires (ductile). Non-metals: Not ductile. 6. Conductivity Metals: Good conductors of heat and electricity. Non-metals: Poor conductors (except graphite, which is a good conductor). 7. Sonorous Nature Metals: Produce a ringing sound when struck. Non-metals: Do not produce sound. Chemical Properties 1. Reaction with Oxygen Metals react with oxygen to form metal oxides. These metal oxides are usually basic. Non-metals react with oxygen to form non-metallic oxides. These oxides are usually acidic. 2. Reaction with Water Metals: Some react vigorously (e.g., sodium). Some react slowly (e.g., iron). Some do not react at all (e.g., gold, silver). Non-metals: Generally do not react with water. 3. Reaction with Acids Metals react with acids to produce salt and hydrogen gas. Non-metals: Do not react with acids. 4. Reaction with Bases Some non-metals react with bases to form salts, but this is rare. Metals generally do not react with bases directly (except amphoteric metals like aluminum and zinc). Displacement Reaction More reactive metals can displace less reactive metals from their salt solutions. Uses of Metals Iron: Making machines, tools, and buildings. Aluminum: Used in aircraft, utensils. Copper: Electrical wires. Gold and Silver: Jewelry. Zinc: Coating iron to prevent rusting (galvanization). Uses of Non-Metals Oxygen: Breathing. Nitrogen: Fertilizers. Chlorine: Water purification. Carbon: Fuel (coal), steel-making (coke). Iodine: Medicines. Alloys An alloy is a mixture of metals or a metal with a non-metal. Alloys have improved properties like strength, resistance to rusting.

Political History of Pala dynasty Pala Rulers NEP.pptxArya Mahila P. G. College, Banaras Hindu University, Varanasi, India.

The Pala kings were people-protectors. In fact, Gopal was elected to the throne only to end Matsya Nyaya. Bhagalpur Abhiledh states that Dharmapala imposed only fair taxes on the people. Rampala abolished the unjust taxes imposed by Bhima. The Pala rulers were lovers of learning. Vikramshila University was established by Dharmapala. He opened 50 other learning centers. A famous Buddhist scholar named Haribhadra was to be present in his court. Devpala appointed another Buddhist scholar named Veerdeva as the vice president of Nalanda Vihar. Among other scholars of this period, Sandhyakar Nandi, Chakrapani Dutta and Vajradatta are especially famous. Sandhyakar Nandi wrote the famous poem of this period 'Ramcharit'.

SPRING FESTIVITIES - UK AND USA -Colégio Santa Teresinha

Stein, Hunt, Green letter to Congress April 2025Mebane Rash

Exploring-Substances-Acidic-Basic-and-Neutral.pdfSandeep Swamy

Exploring Substances: Acidic, Basic, and Neutral Welcome to the fascinating world of acids and bases! Join siblings Ashwin and Keerthi as they explore the colorful world of substances at their school's National Science Day fair. Their adventure begins with a mysterious white paper that reveals hidden messages when sprayed with a special liquid. In this presentation, we'll discover how different substances can be classified as acidic, basic, or neutral. We'll explore natural indicators like litmus, red rose extract, and turmeric that help us identify these substances through color changes. We'll also learn about neutralization reactions and their applications in our daily lives. by sandeep swamy

How to manage Multiple Warehouses for multiple floors in odoo point of saleCeline George

How to Customize Your Financial Reports & Tax Reports With Odoo 17 AccountingCeline George

Biophysics Chapter 3 Methods of Studying Macromolecules.pdfPKLI-Institute of Nursing and Allied Health Sciences Lahore , Pakistan.

This chapter provides an in-depth overview of the viscosity of macromolecules, an essential concept in biophysics and medical sciences, especially in understanding fluid behavior like blood flow in the human body. Key concepts covered include: ✅ Definition and Types of Viscosity: Dynamic vs. Kinematic viscosity, cohesion, and adhesion. ⚙️ Methods of Measuring Viscosity: Rotary Viscometer Vibrational Viscometer Falling Object Method Capillary Viscometer 🌡️ Factors Affecting Viscosity: Temperature, composition, flow rate. 🩺 Clinical Relevance: Impact of blood viscosity in cardiovascular health. 🌊 Fluid Dynamics: Laminar vs. turbulent flow, Reynolds number. 🔬 Extension Techniques: Chromatography (adsorption, partition, TLC, etc.) Electrophoresis (protein/DNA separation) Sedimentation and Centrifugation methods.

UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACYDR.PRISCILLA MARY J

Metamorphosis: Life's Transformative JourneyArshad Shaikh

Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...Library Association of Ireland

Unit 6_Introduction_Phishing_Password Cracking.pdfKanchanPatil34

pulse ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulsesushreesangita003

GDGLSPGCOER - Git and GitHub Workshop.pptxazeenhodekar

P-glycoprotein pamphlet: iteration 4 of 4 finalbs22n2s

Anti-Depressants pharmacology 1slide.pptxMayuri Chavan

Presentation on Tourism Product Development By Md Shaifullar RabbiMd Shaifullar Rabbi

apa-style-referencing-visual-guide-2025.pdfIshika Ghosh

Operations Management (Dr. Abdulfatah Salem).pdfArab Academy for Science, Technology and Maritime Transport

Presentation of the MIPLM subject matter expert Erdem KayaMIPLM

CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - WorksheetSritoma Majumder

Political History of Pala dynasty Pala Rulers NEP.pptxArya Mahila P. G. College, Banaras Hindu University, Varanasi, India.

SPRING FESTIVITIES - UK AND USA -Colégio Santa Teresinha

Stein, Hunt, Green letter to Congress April 2025Mebane Rash

Exploring-Substances-Acidic-Basic-and-Neutral.pdfSandeep Swamy

How to manage Multiple Warehouses for multiple floors in odoo point of saleCeline George

How to Customize Your Financial Reports & Tax Reports With Odoo 17 AccountingCeline George

Biophysics Chapter 3 Methods of Studying Macromolecules.pdfPKLI-Institute of Nursing and Allied Health Sciences Lahore , Pakistan.

UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACYDR.PRISCILLA MARY J

Metamorphosis: Life's Transformative JourneyArshad Shaikh

Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...Library Association of Ireland

Unit 6_Introduction_Phishing_Password Cracking.pdfKanchanPatil34

pulse ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulsesushreesangita003

GDGLSPGCOER - Git and GitHub Workshop.pptxazeenhodekar

P-glycoprotein pamphlet: iteration 4 of 4 finalbs22n2s

Apache Spark.

1. Apache Spark… By, Janani.J I.M.Sc Information technology

2. Spark 💜 Spark it’s used for data processing. 💜Apache Spark has been designed for quick computation by a simple cluster technology. 💜Apache Spark is another open source cluster computing framework for data analytics. However, Spark supports in-memory cluster computing, Parallel processing and promises to be Faster than Hadoop. 💜 Spark supports various high-level tools for data analysis. 💜 Spark provides APIs for Scala, Java and Python Language.

3. Spark Unifies 🍒 Batch processing 🍒 Real-time processing 🍒 Stream Analytics 🍒 Machine Learning 🍒 Interactive SQL 🍒 Graph processing

4. How does Apache Spark work?

5. Spark Components..

6. Spark Core Spark Core provides distributed task dispatching, scheduling, basic I/O etc. All these functionalities are exposed through an application programming interface (for Java, Python, Scala, and R) called driver program.

7. Spark Architecture….

8. Driver Program 🎈The Driver Program is a process that runs the main() function of the application and creates the SparkContext object. The purpose of SparkContext is to coordinate the spark applications, running as independent sets of processes on a cluster 🎈To run on a cluster, the SparkContext connects to a different type of cluster managers and then perform the following tasks: - 🎈It acquires executors on nodes in the cluster. 🎈Then, it sends your application code to the executors. Here, the application code can be defined by JAR or Python files passed to the SparkContext. 🎈At last, the SparkContext sends tasks to the executors to run.

9. Cluster Manager 💜The role of the cluster manager is to allocate resources across applications. The Spark is capable enough of running on a large number of clusters. 💜It consists of various types of cluster managers such as Hadoop YARN, Apache Mesos and Standalone Scheduler. 💜Here, the Standalone Scheduler is a standalone spark cluster manager that facilitates to install Spark on an empty set of machines. Worker node 💜The worker node is a slave node 💜Its role is to run the application code in the cluster.

10. Executor 👉An executor is a process launched for an application on a worker node. 👉It runs tasks and keeps data in memory or disk storage across them. 👉It read and write data to the external sources. 👉Every application contains its executor. Task 👉A unit of work that will be sent to one executor.

11. Resilient Distributed Datasets… ⭐ Spark Core is embedded with RDDs(Resilient Distributed Datasets), an immutable fault-tolerant, distributed collection of objects that can be operated on in parallel.

12. Let’s see some of the frequently used RDD Transformations. Map(func) It returns a new distributed dataset formed by passing each element of the source through a function func. flatMap(func) Here, each input item can be mapped to zero or more output items, so func should return a sequence rather than a single item. groupByKey([numPartitions]) It returns a dataset of (K, Iterable) pairs when called on a dataset of (K, V) pairs. reduceByKey(func, [numPartitions]) When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V.

13. Let’s see some of the frequently used RDD Actions. Reduce(func) It aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.

14. Thank you..

Apache Spark.

Recommended

More Related Content

What's hot (20)

Similar to Apache Spark. (20)

Recently uploaded (20)

Apache Spark.