SlideShare a Scribd company logo
Apache Spark
Syed
Solutions Engineer - Big Data
mail.syed786@gmail.com
info.syedacademy@gmail.com
+91-9030477368
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Writing my own RDD? What for?
● To write your own RDD, you need to understand to
some extent internal mechanics of Apache Spark
● Writing your own RDD will prove you understand them
well
● When connecting to external storage, it is reasonable to
create your own RDD for it
RDD - the definition
RDD - the definition
RDD stands for resilient distributed dataset
RDD - the definition
RDD stands for resilient distributed dataset
Dataset - initial data comes from some
distributed storage
RDD - the definition
RDD stands for resilient distributed dataset
Dataset - initial data comes from some
distributed storage
Distributed - stored in nodes
among the cluster
RDD - the definition
RDD stands for resilient distributed dataset
Dataset - initial data comes from some
distributed storage
Distributed - stored in nodes
among the cluster
Resilient - if data is lost, data can
be recreated
Quiz: what is an “RDD”?
A: distributed collection of objects on disk
B: distributed collection of objects in memory
C: distributed collection of objects in Cassandra
Answer: could be any of the above!
Scientific Answer: RDD is an
Interface!
1. Set of partitions (“splits” in Hadoop)
2. List of dependencies on parent RDDs
3. Function to compute a partition (as
an Iterator) given its parent(s)
4. (Optional) partitioner (hash, range)
5. (Optional) preferred location(s)
for each partition
“lineage”
optimized
execution
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
RDD Persistence
•Each node stores any partitions of it that it computes in memory and
reuses them in other actions on that dataset.
•After marking an RDD to be persisted, the first time the dataset is
computed in an action, it will be kept in memory on the nodes.
•Allows future actions to be much faster (often by more than 10x) since
you’re not re-computing some data every time you perform an action.
•If data is too big to be cached, then it will spill to disk and memory will
gradually degrade
•Least Recently Used (LRU) replacement policy
RDD Persistence (Storage Levels)
RDD Persistence APIs
rdd.persist()
rdd.persist(StorageLevel)
•Persist this RDD with the default storage level (MEMORY_ONLY).
•You can override the StorageLevel for fine grain control over
persistence
rdd.cache()
•Persists the RDD with the default storage level (MEMORY_ONLY)
rdd.checkpoint()
•RDD will be saved to a file inside the checkpoint directory set with
SparkContext#setCheckpointDir(“/path/to/dir”)
•Used for RDDs with long lineage chains with wide dependencies since
it would be expensive to re-compute
rdd.unpersist()
•Marks it as non-persistent and/or removes all blocks of it from memory
and disk
Fault Tolerance
• RDDs contain lineage graphs (coarse grained updates/transformations) to
help it rebuild partitions that were lost
• Only the lost partitions of an RDD need to be recomputed upon failure.
• They can be recomputed in parallel on different nodes without having to roll
back the entire app
• Also lets a system tolerate slow nodes (stragglers) by running a backup
copy of the troubled task.
• Original process on straggling node will be killed when new process is
complete
• Cached/Check pointed partitions are also used to re-compute lost partitions
if available in shared memory
Thank you!
www.syedacademy.com
mail.syed786@gmail.com
info.syedacademy@gmail.com
+91-9030477368

More Related Content

What's hot (19)

DevDay: Vault Recycler Right to be Forgotten, R3
DevDay: Vault Recycler Right to be Forgotten, R3DevDay: Vault Recycler Right to be Forgotten, R3
DevDay: Vault Recycler Right to be Forgotten, R3
R3
 
Everything You Need to Know About Sharding
Everything You Need to Know About ShardingEverything You Need to Know About Sharding
Everything You Need to Know About Sharding
MongoDB
 
Instaclustr Apache Cassandra Best Practices & Toubleshooting
Instaclustr Apache Cassandra Best Practices & ToubleshootingInstaclustr Apache Cassandra Best Practices & Toubleshooting
Instaclustr Apache Cassandra Best Practices & Toubleshooting
Instaclustr
 
Real world capacity
Real world capacityReal world capacity
Real world capacity
Edward Capriolo
 
Ceph c01
Ceph c01Ceph c01
Ceph c01
Lâm Đào
 
Raid data recovery Tips
Raid data recovery TipsRaid data recovery Tips
Raid data recovery Tips
Hone Software
 
Instaclustr introduction to managing cassandra
Instaclustr introduction to managing cassandraInstaclustr introduction to managing cassandra
Instaclustr introduction to managing cassandra
Instaclustr
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
Edward Capriolo
 
Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...
Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...
Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...
Instaclustr
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batch
Edward Capriolo
 
Cassandra tw presentation
Cassandra tw presentationCassandra tw presentation
Cassandra tw presentation
OmarFaroque16
 
Seagate Implementation of Dense Storage Utilizing HDDs and SSDs
Seagate Implementation of Dense Storage Utilizing HDDs and SSDsSeagate Implementation of Dense Storage Utilizing HDDs and SSDs
Seagate Implementation of Dense Storage Utilizing HDDs and SSDs
Red_Hat_Storage
 
Building your own NSQL store
Building your own NSQL storeBuilding your own NSQL store
Building your own NSQL store
Edward Capriolo
 
Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage
Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen StoragePros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage
Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage
Eric Carter
 
Redis as database - HashedIn
Redis as database - HashedInRedis as database - HashedIn
Redis as database - HashedIn
HashedIn Technologies
 
Ceph Day Berlin: Scaling an Academic Cloud
Ceph Day Berlin: Scaling an Academic CloudCeph Day Berlin: Scaling an Academic Cloud
Ceph Day Berlin: Scaling an Academic Cloud
Ceph Community
 
Implementation of Dense Storage Utilizing HDDs with SSDs and PCIe Flash Acc...
Implementation of Dense Storage Utilizing  HDDs with SSDs and PCIe Flash  Acc...Implementation of Dense Storage Utilizing  HDDs with SSDs and PCIe Flash  Acc...
Implementation of Dense Storage Utilizing HDDs with SSDs and PCIe Flash Acc...
Red_Hat_Storage
 
How to Achieve Scale with MongoDB
How to Achieve Scale with MongoDBHow to Achieve Scale with MongoDB
How to Achieve Scale with MongoDB
MongoDB
 
Ravi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS ArchitectureRavi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS Architecture
Ravi namboori
 
DevDay: Vault Recycler Right to be Forgotten, R3
DevDay: Vault Recycler Right to be Forgotten, R3DevDay: Vault Recycler Right to be Forgotten, R3
DevDay: Vault Recycler Right to be Forgotten, R3
R3
 
Everything You Need to Know About Sharding
Everything You Need to Know About ShardingEverything You Need to Know About Sharding
Everything You Need to Know About Sharding
MongoDB
 
Instaclustr Apache Cassandra Best Practices & Toubleshooting
Instaclustr Apache Cassandra Best Practices & ToubleshootingInstaclustr Apache Cassandra Best Practices & Toubleshooting
Instaclustr Apache Cassandra Best Practices & Toubleshooting
Instaclustr
 
Raid data recovery Tips
Raid data recovery TipsRaid data recovery Tips
Raid data recovery Tips
Hone Software
 
Instaclustr introduction to managing cassandra
Instaclustr introduction to managing cassandraInstaclustr introduction to managing cassandra
Instaclustr introduction to managing cassandra
Instaclustr
 
Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...
Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...
Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...
Instaclustr
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batch
Edward Capriolo
 
Cassandra tw presentation
Cassandra tw presentationCassandra tw presentation
Cassandra tw presentation
OmarFaroque16
 
Seagate Implementation of Dense Storage Utilizing HDDs and SSDs
Seagate Implementation of Dense Storage Utilizing HDDs and SSDsSeagate Implementation of Dense Storage Utilizing HDDs and SSDs
Seagate Implementation of Dense Storage Utilizing HDDs and SSDs
Red_Hat_Storage
 
Building your own NSQL store
Building your own NSQL storeBuilding your own NSQL store
Building your own NSQL store
Edward Capriolo
 
Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage
Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen StoragePros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage
Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage
Eric Carter
 
Ceph Day Berlin: Scaling an Academic Cloud
Ceph Day Berlin: Scaling an Academic CloudCeph Day Berlin: Scaling an Academic Cloud
Ceph Day Berlin: Scaling an Academic Cloud
Ceph Community
 
Implementation of Dense Storage Utilizing HDDs with SSDs and PCIe Flash Acc...
Implementation of Dense Storage Utilizing  HDDs with SSDs and PCIe Flash  Acc...Implementation of Dense Storage Utilizing  HDDs with SSDs and PCIe Flash  Acc...
Implementation of Dense Storage Utilizing HDDs with SSDs and PCIe Flash Acc...
Red_Hat_Storage
 
How to Achieve Scale with MongoDB
How to Achieve Scale with MongoDBHow to Achieve Scale with MongoDB
How to Achieve Scale with MongoDB
MongoDB
 
Ravi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS ArchitectureRavi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS Architecture
Ravi namboori
 

Similar to Spark_RDD_SyedAcademy (20)

Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
clairvoyantllc
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Spark
SparkSpark
Spark
Heena Madan
 
Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache Spark
Gao Yunzhong
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
DataArt
 
Some thoughts on apache spark & shark
Some thoughts on apache spark & sharkSome thoughts on apache spark & shark
Some thoughts on apache spark & shark
Viet-Trung TRAN
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
David Smelker
 
Spark learning
Spark learningSpark learning
Spark learning
Ajay Guyyala
 
Big data overview
Big data overviewBig data overview
Big data overview
beCloudReady
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
Atif Akhtar
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and Clojure
Dr. Christian Betz
 
Spark
SparkSpark
Spark
Mário Almeida
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Algorithm Analytics Anomaly Detection Artificial Intelligence (AI) Big Data
Algorithm Analytics Anomaly Detection Artificial Intelligence (AI) Big DataAlgorithm Analytics Anomaly Detection Artificial Intelligence (AI) Big Data
Algorithm Analytics Anomaly Detection Artificial Intelligence (AI) Big Data
Gabriel Kamau
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
Taposh Roy
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
wang xing
 
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache Spark
Gao Yunzhong
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
DataArt
 
Some thoughts on apache spark & shark
Some thoughts on apache spark & sharkSome thoughts on apache spark & shark
Some thoughts on apache spark & shark
Viet-Trung TRAN
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
David Smelker
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
Atif Akhtar
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and Clojure
Dr. Christian Betz
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Algorithm Analytics Anomaly Detection Artificial Intelligence (AI) Big Data
Algorithm Analytics Anomaly Detection Artificial Intelligence (AI) Big DataAlgorithm Analytics Anomaly Detection Artificial Intelligence (AI) Big Data
Algorithm Analytics Anomaly Detection Artificial Intelligence (AI) Big Data
Gabriel Kamau
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
Taposh Roy
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
wang xing
 

More from Syed Hadoop (6)

Kafka syed academy_v1_introduction
Kafka syed academy_v1_introductionKafka syed academy_v1_introduction
Kafka syed academy_v1_introduction
Syed Hadoop
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
Syed Hadoop
 
Spark Streaming In Depth - www.syedacademy.com
Spark Streaming In Depth - www.syedacademy.comSpark Streaming In Depth - www.syedacademy.com
Spark Streaming In Depth - www.syedacademy.com
Syed Hadoop
 
Spark_Intro_Syed_Academy
Spark_Intro_Syed_AcademySpark_Intro_Syed_Academy
Spark_Intro_Syed_Academy
Syed Hadoop
 
Hadoop Architecture in Depth
Hadoop Architecture in DepthHadoop Architecture in Depth
Hadoop Architecture in Depth
Syed Hadoop
 
Hadoop course content Syed Academy
Hadoop course content Syed AcademyHadoop course content Syed Academy
Hadoop course content Syed Academy
Syed Hadoop
 
Kafka syed academy_v1_introduction
Kafka syed academy_v1_introductionKafka syed academy_v1_introduction
Kafka syed academy_v1_introduction
Syed Hadoop
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
Syed Hadoop
 
Spark Streaming In Depth - www.syedacademy.com
Spark Streaming In Depth - www.syedacademy.comSpark Streaming In Depth - www.syedacademy.com
Spark Streaming In Depth - www.syedacademy.com
Syed Hadoop
 
Spark_Intro_Syed_Academy
Spark_Intro_Syed_AcademySpark_Intro_Syed_Academy
Spark_Intro_Syed_Academy
Syed Hadoop
 
Hadoop Architecture in Depth
Hadoop Architecture in DepthHadoop Architecture in Depth
Hadoop Architecture in Depth
Syed Hadoop
 
Hadoop course content Syed Academy
Hadoop course content Syed AcademyHadoop course content Syed Academy
Hadoop course content Syed Academy
Syed Hadoop
 

Recently uploaded (20)

EASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License CodeEASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License Code
aneelaramzan63
 
Download YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full ActivatedDownload YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full Activated
saniamalik72555
 
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& ConsiderationsDesigning AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Dinusha Kumarasiri
 
WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)
sh607827
 
Expand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchangeExpand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchange
Fexle Services Pvt. Ltd.
 
Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025
kashifyounis067
 
Solidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license codeSolidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license code
aneelaramzan63
 
Landscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature ReviewLandscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature Review
Hironori Washizaki
 
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Eric D. Schabell
 
Automation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath CertificateAutomation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath Certificate
VICTOR MAESTRE RAMIREZ
 
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage DashboardsAdobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
BradBedford3
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Andre Hora
 
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRYLEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
NidaFarooq10
 
Societal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainabilitySocietal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainability
Jordi Cabot
 
Exploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the FutureExploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the Future
ICS
 
Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]
saniaaftab72555
 
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and CollaborateMeet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Maxim Salnikov
 
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
Andre Hora
 
Revolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptxRevolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptx
nidhisingh691197
 
EASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License CodeEASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License Code
aneelaramzan63
 
Download YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full ActivatedDownload YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full Activated
saniamalik72555
 
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& ConsiderationsDesigning AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Dinusha Kumarasiri
 
WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)
sh607827
 
Expand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchangeExpand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchange
Fexle Services Pvt. Ltd.
 
Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025
kashifyounis067
 
Solidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license codeSolidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license code
aneelaramzan63
 
Landscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature ReviewLandscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature Review
Hironori Washizaki
 
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Eric D. Schabell
 
Automation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath CertificateAutomation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath Certificate
VICTOR MAESTRE RAMIREZ
 
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage DashboardsAdobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
BradBedford3
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Andre Hora
 
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRYLEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
NidaFarooq10
 
Societal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainabilitySocietal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainability
Jordi Cabot
 
Exploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the FutureExploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the Future
ICS
 
Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]
saniaaftab72555
 
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and CollaborateMeet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Maxim Salnikov
 
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
Andre Hora
 
Revolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptxRevolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptx
nidhisingh691197
 

Spark_RDD_SyedAcademy

  • 5. Writing my own RDD? What for? ● To write your own RDD, you need to understand to some extent internal mechanics of Apache Spark ● Writing your own RDD will prove you understand them well ● When connecting to external storage, it is reasonable to create your own RDD for it
  • 6. RDD - the definition
  • 7. RDD - the definition RDD stands for resilient distributed dataset
  • 8. RDD - the definition RDD stands for resilient distributed dataset Dataset - initial data comes from some distributed storage
  • 9. RDD - the definition RDD stands for resilient distributed dataset Dataset - initial data comes from some distributed storage Distributed - stored in nodes among the cluster
  • 10. RDD - the definition RDD stands for resilient distributed dataset Dataset - initial data comes from some distributed storage Distributed - stored in nodes among the cluster Resilient - if data is lost, data can be recreated
  • 11. Quiz: what is an “RDD”? A: distributed collection of objects on disk B: distributed collection of objects in memory C: distributed collection of objects in Cassandra Answer: could be any of the above!
  • 12. Scientific Answer: RDD is an Interface! 1. Set of partitions (“splits” in Hadoop) 2. List of dependencies on parent RDDs 3. Function to compute a partition (as an Iterator) given its parent(s) 4. (Optional) partitioner (hash, range) 5. (Optional) preferred location(s) for each partition “lineage” optimized execution
  • 46. RDD Persistence •Each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset. •After marking an RDD to be persisted, the first time the dataset is computed in an action, it will be kept in memory on the nodes. •Allows future actions to be much faster (often by more than 10x) since you’re not re-computing some data every time you perform an action. •If data is too big to be cached, then it will spill to disk and memory will gradually degrade •Least Recently Used (LRU) replacement policy
  • 48. RDD Persistence APIs rdd.persist() rdd.persist(StorageLevel) •Persist this RDD with the default storage level (MEMORY_ONLY). •You can override the StorageLevel for fine grain control over persistence rdd.cache() •Persists the RDD with the default storage level (MEMORY_ONLY) rdd.checkpoint() •RDD will be saved to a file inside the checkpoint directory set with SparkContext#setCheckpointDir(“/path/to/dir”) •Used for RDDs with long lineage chains with wide dependencies since it would be expensive to re-compute rdd.unpersist() •Marks it as non-persistent and/or removes all blocks of it from memory and disk
  • 49. Fault Tolerance • RDDs contain lineage graphs (coarse grained updates/transformations) to help it rebuild partitions that were lost • Only the lost partitions of an RDD need to be recomputed upon failure. • They can be recomputed in parallel on different nodes without having to roll back the entire app • Also lets a system tolerate slow nodes (stragglers) by running a backup copy of the troubled task. • Original process on straggling node will be killed when new process is complete • Cached/Check pointed partitions are also used to re-compute lost partitions if available in shared memory