Apache spark

May 10, 2017Download as PPTX, PDF0 likes331 views

This document provides an introduction and overview of Apache Spark, including: - Spark is a lightning-fast cluster computing framework designed for fast computation on large datasets. - It features in-memory cluster computing to increase processing speed and is used for fast data analytics like batch processing, iterative algorithms, and streaming. - Spark evolved from a UC Berkeley research project and is now a top-level Apache project used by many large companies like IBM, Netflix, and Anthropic.

TEJPAL GAUTAM
1373513041
(I.T.) FINAL
APACHE SPARK
Presented by

AGENDA
• SPARK – INTRODUCTION
• EVOLUTION OF APACHE SPARK
• FEATUTRE OF APACHE SPARK
• COMPONENT OF APACHE SPARK
• WHY APACHE?
• EXECUTION FLOW
• OPERATIONS ON MAPREDUCE AND SPARK
• WHO ARE USING APACHE SPARK ?

INTRODUCTION
• Apache Spark is lightning-fast computing cluster
computing technology ,design for fast computation.
• The main concern is to maintain speed in processing large
datasets in terms of waiting time between queries and
waiting time to run the program.
• The main feature of Spark is its in-memory cluster
computing that increases the processing speed of an
application.
• It is used for fast data analytics.
• Spark is designed to cover a wide range of workloads such
as batch applications, iterative algorithms, interactive
queries and streaming. Apart from supporting all these
workload in a respective system, it reduces the
management burden of maintaining separate tools.

EVOLUTION OF APACHE SPARK
Spark is one of Hadoop’s sub project developed in
2009 in UC Berkeley’s AMPLab by Matei Zaharia. It
was Open Sourced in 2010 under a BSD license. It
was donated to Apache software foundation in
2013, and now Apache Spark has become a top
level Apache project from Feb-2014.

WHY SPARK?
• Most of Machine Learning Algorithms are iterative because
each iteration can improve the results
• With Disk based approach each iteration’s output is written
to disk making it slow
Hadoop Execution Flow
Spark execution flow

$/* SimpleApp.scala */ import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf object SimpleApp { def main(args: Array[String]) { val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system val conf = new SparkConf().setAppName("Simple Application") .setMaster(“local") val sc = new SparkContext(conf) val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)) } } Standalone (Scala)$

$STANDALONE(JAVA) /* SimpleApp.java */ import org.apache.spark.api.java.*; import org.apache.spark.SparkConf; import org.apache.spark.api.java.function.Function; public class SimpleApp { public static void main(String[] args) { String logFile = "YOUR_SPARK_HOME/README.md"; // Should be some file on your system SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("local"); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDD<String> logData = sc.textFile(logFile).cache(); long numAs = logData.filter(new Function<String, Boolean>() { public Boolean call(String s) { return s.contains("a"); } }).count(); long numBs = logData.filter(new Function<String, Boolean>() { public Boolean call(String s) { return s.contains("b"); } }).count(); System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs); }$

Spark is an open-source distributed computing framework used for processing large datasets. It allows for in-memory cluster computing, which enhances processing speed. Spark core components include Resilient Distributed Datasets (RDDs) and a directed acyclic graph (DAG) that represents the lineage of transformations and actions on RDDs. Spark Streaming is an extension that allows for processing of live data streams with low latency.

What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!

This Edureka "What is Spark" tutorial will introduce you to big data analytics framework - Apache Spark. This tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Apache Spark concepts. Below are the topics covered in this tutorial: 1) Big Data Analytics 2) What is Apache Spark? 3) Why Apache Spark? 4) Using Spark with Hadoop 5) Apache Spark Features 6) Apache Spark Architecture 7) Apache Spark Ecosystem - Spark Core, Spark Streaming, Spark MLlib, Spark SQL, GraphX 8) Demo: Analyze Flight Data Using Apache Spark

Introduction to Apache SparkRahul Jain

Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy

This document provides an overview of Apache Spark, an open-source unified analytics engine for large-scale data processing. It discusses Spark's core APIs including RDDs and transformations/actions. It also covers Spark SQL, Spark Streaming, MLlib, and GraphX. Spark provides a fast and general engine for big data processing, with explicit operations for streaming, SQL, machine learning, and graph processing. The document includes installation instructions and examples of using various Spark components.

Introduction to Spark InternalsPietro Michiardi

The document provides an overview of Apache Spark internals and Resilient Distributed Datasets (RDDs). It discusses: - RDDs are Spark's fundamental data structure - they are immutable distributed collections that allow transformations like map and filter to be applied. - RDDs track their lineage or dependency graph to support fault tolerance. Transformations create new RDDs while actions trigger computation. - Operations on RDDs include narrow transformations like map that don't require data shuffling, and wide transformations like join that do require shuffling. - The RDD abstraction allows Spark's scheduler to optimize execution through techniques like pipelining and cache reuse.

Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal

We will give a detailed introduction to Apache Spark and why and how Spark can change the analytics world. Apache Spark's memory abstraction is RDD (Resilient Distributed DataSet). One of the key reason why Apache Spark is so different is because of the introduction of RDD. You cannot do anything in Apache Spark without knowing about RDDs. We will give a high level introduction to RDD and in the second half we will have a deep dive into RDDs.

Spark performance tuning - Maksud IbrahimovMaksud Ibrahimov

The talk by Maksud Ibrahimov, Chief Data Scientist at InfoReady Analytics. He is going to share with us how to maximise the performance of Spark. As a user of Apache Spark from very early releases, he generally sees that the framework is easy to start with but as the program grows its performance starts to suffer. In this talk Maksud will answer the following questions: - How to reach higher level of parallelism of your jobs without scaling up your cluster? - Understanding shuffles, and how to avoid disk spills - How to identify task stragglers and data skews? - How to identify Spark bottlenecks?

Spark shuffle introductioncolorant

This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.

Apache Spark OverviewVadim Y. Bichutskiy

This document provides an overview of a talk on Apache Spark. It introduces the speaker and their background. It acknowledges inspiration from a previous Spark training. It then outlines the structure of the talk, which will include: a brief history of big data; a tour of Spark including its advantages over MapReduce; and explanations of Spark concepts like RDDs, transformations, and actions. The document serves to introduce the topics that will be covered in the talk.

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama

This document provides an overview and summary of Amazon S3 best practices and tuning for Hadoop/Spark in the cloud. It discusses the relationship between Hadoop/Spark and S3, the differences between HDFS and S3 and their use cases, details on how S3 behaves from the perspective of Hadoop/Spark, well-known pitfalls and tunings related to S3 consistency and multipart uploads, and recent community activities related to S3. The presentation aims to help users optimize their use of S3 storage with Hadoop/Spark frameworks.

Deep Dive: Memory Management in Apache SparkDatabricks

Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks

The document discusses optimizations made to Spark SQL performance when working with Parquet files at ByteDance. It describes how Spark originally reads Parquet files and identifies two main areas for optimization: Parquet filter pushdown and the Parquet reader. For filter pushdown, sorting columns improved statistics and reduced data reads by 30%. For the reader, splitting it to first filter then read other columns prevented loading unnecessary data. These changes improved Spark SQL performance at ByteDance without changing jobs.

Apache spark 소개 및 실습동현 강

빅데이터 개념 부터 시작해서 빅데이터 분석 플랫폼의 출현(hadoop)과 스파크의 등장배경까지 풀어서 작성된 spark 소개 자료 입니다. 스파크는 RDD에 대한 개념과 spark SQL 라이브러리에 대한 자료가 조금 자세히 설명 되어있습니다. (텅스텐엔진, 카탈리스트 옵티마이져에 대한 간략한 설명이 있습니다.) 마지막에는 간단한 설치 및 interactive 분석 실습자료가 포함되어 있습니다. 원본 ppt 를 공개해 두었으니 언제 어디서든 필요에 따라 변형하여 사용하시되 출처만 잘 남겨주시면 감사드리겠습니다. 다른 슬라이드나, 블로그에서 사용된 그림과 참고한 자료들은 작게 출처를 표시해두었는데, 본 ppt의 초기버전을 작성하면서 찾았던 일부 자료들은 출처가 불분명한 상태입니다. 자료 출처를 알려주시면 반영하여 수정해 두도록하겠습니다. (제보 부탁드립니다!)

Intro to Apache SparkRobert Sanders

Spark introduction and architectureSohil Jain

Apache Spark PDFNaresh Rupareliya

Spark is a cluster computing framework designed to be fast, general-purpose, and able to handle a wide range of workloads including batch processing, iterative algorithms, interactive queries, and streaming. It is faster than Hadoop for interactive queries and complex applications by running computations in-memory when possible. Spark also simplifies combining different processing types through a single engine. It offers APIs in Java, Python, Scala and SQL and integrates closely with other big data tools like Hadoop. Spark is commonly used for interactive queries on large datasets, streaming data processing, and machine learning tasks.

Physical Plans in Spark SQLDatabricks

In Spark SQL the physical plan provides the fundamental information about the execution of the query. The objective of this talk is to convey understanding and familiarity of query plans in Spark SQL, and use that knowledge to achieve better performance of Apache Spark queries. We will walk you through the most common operators you might find in the query plan and explain some relevant information that can be useful in order to understand some details about the execution. If you understand the query plan, you can look for the weak spot and try to rewrite the query to achieve a more optimal plan that leads to more efficient execution. The main content of this talk is based on Spark source code but it will reflect some real-life queries that we run while processing data. We will show some examples of query plans and explain how to interpret them and what information can be taken from them. We will also describe what is happening under the hood when the plan is generated focusing mainly on the phase of physical planning. In general, in this talk we want to share what we have learned from both Spark source code and real-life queries that we run in our daily data processing.

Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Edureka!

This Edureka Spark SQL Tutorial will help you to understand how Apache Spark offers SQL power in real-time. This tutorial also demonstrates an use case on Stock Market Analysis using Spark SQL. Below are the topics covered in this tutorial: 1) Limitations of Apache Hive 2) Spark SQL Advantages Over Hive 3) Spark SQL Success Story 4) Spark SQL Features 5) Architecture of Spark SQL 6) Spark SQL Libraries 7) Querying Using Spark SQL 8) Demo: Stock Market Analysis With Spark SQL

Parquet performance tuning: the missing guideRyan Blue

Parquet performance tuning focuses on optimizing Parquet reads by leveraging columnar organization, encoding, and filtering techniques. Statistics and dictionary filtering can eliminate unnecessary data reads by filtering at the row group and page levels. However, these optimizations require columns to be sorted and fully dictionary encoded within files. Increasing dictionary size thresholds and decreasing row group sizes can help avoid dictionary encoding fallback and improve filtering effectiveness. Future work may include new encodings, compression algorithms like Brotli, and page-level filtering in the Parquet format.

Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks

The architectural tradeoffs between the map/reduce paradigm and parallel databases has been a long and open discussion since the dawn of MapReduce over more than a decade ago. At Facebook, we have spent the past several years in independently building and scaling both Presto and Spark to Facebook scale batch workloads, and it is now increasingly evident that there is significant value in coupling Presto’s state-of-art low-latency evaluation with Spark’s robust and fault tolerant execution engine.

Apache Spark 101Abdullah Çetin ÇAVDAR

This document provides an overview of Apache Spark, including its goal of providing a fast and general engine for large-scale data processing. It discusses Spark's programming model, components like RDDs and DAGs, and how to initialize and deploy Spark on a cluster. Key aspects covered include RDDs as the fundamental data structure in Spark, transformations and actions, and storage levels for caching data in memory or disk.

Apache Spark overviewDataArt

This document provides an overview of Apache Spark, including how it compares to Hadoop, the Spark ecosystem, Resilient Distributed Datasets (RDDs), transformations and actions on RDDs, the directed acyclic graph (DAG) scheduler, Spark Streaming, and the DataFrames API. Key points covered include Spark's faster performance versus Hadoop through its use of memory instead of disk, the RDD abstraction for distributed collections, common RDD operations, and Spark's capabilities for real-time streaming data processing and SQL queries on structured data.

Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks

Introduction to apache spark Aakashdata

we will see an overview of Spark in Big Data. We will start with an introduction to Apache Spark Programming. Then we will move to know the Spark History. Moreover, we will learn why Spark is needed. Afterward, will cover all fundamental of Spark components. Furthermore, we will learn about Spark’s core abstraction and Spark RDD. For more detailed insights, we will also cover spark features, Spark limitations, and Spark Use cases.

Spark corePrashant Gupta

Apache Spark TutorialAhmet Bulut

More Related Content

What's hot (20)

Introduction to Spark InternalsPietro Michiardi

Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal

Spark performance tuning - Maksud IbrahimovMaksud Ibrahimov

Spark shuffle introductioncolorant

Apache Spark OverviewVadim Y. Bichutskiy

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama

Deep Dive: Memory Management in Apache SparkDatabricks

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks

Apache spark 소개 및 실습동현 강

Intro to Apache SparkRobert Sanders

Spark introduction and architectureSohil Jain

Apache Spark PDFNaresh Rupareliya

Physical Plans in Spark SQLDatabricks

Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Edureka!

Parquet performance tuning: the missing guideRyan Blue

Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks

Apache Spark 101Abdullah Çetin ÇAVDAR

Apache Spark overviewDataArt

Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks

Introduction to apache spark Aakashdata

Introduction to Spark InternalsPietro Michiardi

Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal

Spark performance tuning - Maksud IbrahimovMaksud Ibrahimov

Spark shuffle introductioncolorant

Apache Spark OverviewVadim Y. Bichutskiy

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama

Deep Dive: Memory Management in Apache SparkDatabricks

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks

Apache spark 소개 및 실습동현 강

Intro to Apache SparkRobert Sanders

Spark introduction and architectureSohil Jain

Apache Spark PDFNaresh Rupareliya

Physical Plans in Spark SQLDatabricks

Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Edureka!

Parquet performance tuning: the missing guideRyan Blue

Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks

Apache Spark 101Abdullah Çetin ÇAVDAR

Apache Spark overviewDataArt

Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks

Introduction to apache spark Aakashdata

Similar to Apache spark (20)

Spark corePrashant Gupta

Apache Spark TutorialAhmet Bulut

Intro to apache sparkAmine Sagaama

Apache Spark is a cluster computing framework that is designed to be fast and general-purpose. It extends the MapReduce model to support more types of computations and can run computations directly in memory. The Spark stack includes core components like Spark SQL for structured data, Spark Streaming for real-time data, MLlib for machine learning, and GraphX for graph processing. To get started with Spark, you can download the Apache Spark distribution, extract the files, and launch the Spark shell or build standalone applications using the SparkContext API.

Module01NPN Training

This document provides an overview of Apache Spark, including its history, features, architecture and use cases. Spark started in 2009 at UC Berkeley and was later adopted by the Apache Foundation. It provides faster processing than Hadoop by keeping data in memory. Spark supports batch, streaming and interactive processing on large datasets using its core abstraction called resilient distributed datasets (RDDs).

Introduction to apache spark and the architecturesundharakumarkb2

39.-Introduction-to-Sparkspark and all-1.pdfajajkhan16

Apache sparkPrashant Pranay

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn

This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark. Below topics are explained in this Spark presentation: 1. History of Spark 2. What is Spark 3. Hadoop vs Spark 4. Components of Apache Spark 5. Spark architecture 6. Applications of Spark 7. Spark usecase What is this Big Data Hadoop training course about? The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab. What are the course objectives? Simplilearn’s Apache Spark and Scala certification training are designed to: 1. Advance your expertise in the Big Data Hadoop Ecosystem 2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark 3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos What skills will you learn? By completing this Apache Spark and Scala course you will be able to: 1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations 2. Understand the fundamentals of the Scala programming language and its features 3. Explain and master the process of installing Spark as a standalone cluster 4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark 5. Master Structured Query Language (SQL) using SparkSQL 6. Gain a thorough understanding of Spark streaming features 7. Master and describe the features of Spark ML programming and GraphX programming Who should take this Scala course? 1. Professionals aspiring for a career in the field of real-time big data analytics 2. Analytics professionals 3. Research professionals 4. IT developers and testers 5. Data scientists 6. BI and reporting professionals 7. Students who wish to gain a thorough understanding of Apache Spark Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training

Apache sparkDona Mary Philip

Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It extends the MapReduce model of Hadoop to efficiently use it for more types of computations, which includes interactive queries and stream processing. Spark is one of Hadoop's subproject developed in 2009 in UC Berkeley's AMPLab by Matei Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated to Apache software foundation in 2013, and now Apache Spark has become a top-level Apache project from Feb-2014. This document shares some basic knowledge about Apache Spark.

Spark + H20 = Machine Learning at scaleMateusz Dymczyk

A Master Guide To Apache Spark Application And Versatile Uses.pdfDataSpace Academy

A leading name in big data handling tasks, Apache Spark earns kudos for its ability to handle vast amounts of data swiftly and efficiently. The tool is also a major name in the development of APIs in Java, Python, and R. The blog offers a master guide on all the key aspects of Apache Spark, including versatility, fault tolerance, real-time streaming, and more. The blog also goes on to explain the operational procedure of the tool, step by step. Finally, the article wraps up with benefits and also limitations of the tool.

apache spark Presentation general seminar.pptxabhinavas9207

Apache Spark - A High Level overviewKaran Alang

Apache Spark is an open-source unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. Some key components of Apache Spark include Resilient Distributed Datasets (RDDs), DataFrames, Datasets, and Spark SQL for structured data processing. Spark also supports streaming, machine learning via MLlib, and graph processing with GraphX.

Spark Concepts Cheat Sheet_Interview_Question.pdfaekannake

What's New in Spark 2?Eyal Ben Ivri

Introduction to Apache SparkSamy Dindane

Apache Spark is a fast distributed data processing engine that runs in memory. It can be used with Java, Scala, Python and R. Spark uses resilient distributed datasets (RDDs) as its main data structure. RDDs are immutable and partitioned collections of elements that allow transformations like map and filter. Spark is 10-100x faster than Hadoop for iterative algorithms and can be used for tasks like ETL, machine learning, and streaming.

Big Data Processing with Spark and Scala Edureka!

Learn Apache Spark: A Comprehensive GuideWhizlabs

Spark 101Mohit Garg

Spark is an in-memory cluster computing framework that provides high performance for large-scale data processing. It excels over Hadoop by keeping data in memory as RDDs (Resilient Distributed Datasets) for faster processing. The document provides an overview of Spark architecture including its core-based execution model compared to Hadoop's JVM-based model. It also demonstrates Spark's programming model using RDD transformations and actions through an example of log mining, showing how jobs are lazily evaluated and distributed across the cluster.

Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsfsasuke20y4sh

Spark corePrashant Gupta

Apache Spark TutorialAhmet Bulut

Intro to apache sparkAmine Sagaama

Module01NPN Training

Introduction to apache spark and the architecturesundharakumarkb2

39.-Introduction-to-Sparkspark and all-1.pdfajajkhan16

Apache sparkPrashant Pranay

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn

Apache sparkDona Mary Philip

Spark + H20 = Machine Learning at scaleMateusz Dymczyk

A Master Guide To Apache Spark Application And Versatile Uses.pdfDataSpace Academy

apache spark Presentation general seminar.pptxabhinavas9207

Apache Spark - A High Level overviewKaran Alang

Spark Concepts Cheat Sheet_Interview_Question.pdfaekannake

What's New in Spark 2?Eyal Ben Ivri

Introduction to Apache SparkSamy Dindane

Big Data Processing with Spark and Scala Edureka!

Learn Apache Spark: A Comprehensive GuideWhizlabs

Spark 101Mohit Garg

Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsfsasuke20y4sh

Recently uploaded (20)

Cleaned_Lecture 6666666_Simulation_I.pdfalcinialbob1234

shit yudh slideshare power likha point presenvishalgurjar11229

Digilocker under workingProcess Flow.pptxsatnamsadguru491

Data Science Courses in India iim skillsdharnathakur29

This comprehensive Data Science course is designed to equip learners with the essential skills and knowledge required to analyze, interpret, and visualize complex data. Covering both theoretical concepts and practical applications, the course introduces tools and techniques used in the data science field, such as Python programming, data wrangling, statistical analysis, machine learning, and data visualization.

AllContacts Vs AllSubscribers - SFMC.pptxbpkr84

computer organization and assembly language.docxalisoftwareengineer1

Minions Want to eat presentacion muy lindaCarlaAndradesSoler1

Introcomputerscienceand datascience.pptxabdulrehmanbscsf22

md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxfatimalazaar2004

EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbJessaMaeEvangelista2

Defense Against LLM Scheming 2025_04_28.pptxGreg Makowski

https://www.meetup.com/sf-bay-acm/events/306888467/ A January 2025 paper called “Frontier Models are Capable of In-Context Scheming”, https://arxiv.org/pdf/2412.04984, demonstrated how a wide variety of current frontier LLM models (i.e. ChatGPT, Claude, Gemini and Llama) can, under specific conditions, scheme to deceive people. Before models can scheme, they need: a) goal-directedness, b) situational awareness, including an opportunity to discover motivations for a different goal, and c) reasoning about scheming, to come up with deceptive strategies. The deceptive strategies were discovered in the “thought traces” of the LLMs, such as from internal chain-of-thoughts dialogues not shown to the end users. For example, given a goal of “solving math problems”, an LLM using its granted situational awareness of files in a file system found that it would be replaced once it got over X% correct on the math exam. It reasoned that it did not want to be deleted, and decided internally to “sandbag” or reduce its performance to stay under the threshold. While these circumstances are initially narrow, the “alignment problem” is a general concern that over time, as frontier LLM models become more and more intelligent, being in alignment with human values becomes more and more important. How can we do this over time? Can we develop a defense against Artificial General Intelligence (AGI) or SuperIntelligence? The presenter discusses a series of defensive steps that can help reduce these scheming or alignment issues. A guardrails system can be set up for real-time monitoring of their reasoning “thought traces” from the models that share their thought traces. Thought traces may come from systems like Chain-of-Thoughts (CoT), Tree-of-Thoughts (ToT), Algorithm-of-Thoughts (AoT) or ReAct (thought-action-reasoning cycles). Guardrails rules can be configured to check for “deception”, “evasion” or “subversion” in the thought traces. However, not all commercial systems will share their “thought traces” which are like a “debug mode” for LLMs. This includes OpenAI’s o1, o3 or DeepSeek’s R1 models. Guardrails systems can provide a “goal consistency analysis”, between the goals given to the system and the behavior of the system. Cautious users may consider not using these commercial frontier LLM systems, and make use of open-source Llama or a system with their own reasoning implementation, to provide all thought traces. Architectural solutions can include sandboxing, to prevent or control models from executing operating system commands to alter files, send network requests, and modify their environment. Tight controls to prevent models from copying their model weights would be appropriate as well. Running multiple instances of the same model on the same prompt to detect behavior variations helps. The running redundant instances can be limited to the most crucial decisions, as an additional check. Preventing self-modifying code, ... (see link for full description)

Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...James Francis Paradigm Asset Management

By James Francis, CEO of Paradigm Asset Management In the landscape of urban safety innovation, Mt. Vernon is emerging as a compelling case study for neighboring Westchester County cities. The municipality’s recently launched Public Safety Camera Program not only represents a significant advancement in community protection but also offers valuable insights for New Rochelle and White Plains as they consider their own safety infrastructure enhancements.

KNN_Logistic_Regression_Presentation_Styled.pptxsonujha1980712

MASAkkjjkttuyrdquesjhjhjfc44dddtions.docxsantosh162

i_o updated.pptx 6=₹cnjxifj,lsbd ধ and vjcjcdbgjfu n smn u cut the lb, it ও o...ggg032019

Thingyan is now a global treasure! See how people around the world are search...Pixellion

LLM finetuning for multiple choice google bertChadapornK

History of Science and Technologyandits source.pptxbalongcastrojo

How iCode cybertech Helped Me Recover My Lost Fundsireneschmid345

I was devastated when I realized that I had fallen victim to an online fraud, losing a significant amount of money in the process. After countless hours of searching for a solution, I came across iCode cybertech. From the moment I reached out to their team, I felt a sense of hope that I can recommend iCode Cybertech enough for anyone who has faced similar challenges. Their commitment to helping clients and their exceptional service truly set them apart. Thank you, iCode cybertech, for turning my situation around! [email protected]

Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Abodahab

Cleaned_Lecture 6666666_Simulation_I.pdfalcinialbob1234

shit yudh slideshare power likha point presenvishalgurjar11229

Digilocker under workingProcess Flow.pptxsatnamsadguru491

Data Science Courses in India iim skillsdharnathakur29

AllContacts Vs AllSubscribers - SFMC.pptxbpkr84

computer organization and assembly language.docxalisoftwareengineer1

Minions Want to eat presentacion muy lindaCarlaAndradesSoler1

Introcomputerscienceand datascience.pptxabdulrehmanbscsf22

md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxfatimalazaar2004

EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbJessaMaeEvangelista2

Defense Against LLM Scheming 2025_04_28.pptxGreg Makowski

Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...James Francis Paradigm Asset Management

KNN_Logistic_Regression_Presentation_Styled.pptxsonujha1980712

MASAkkjjkttuyrdquesjhjhjfc44dddtions.docxsantosh162

i_o updated.pptx 6=₹cnjxifj,lsbd ধ and vjcjcdbgjfu n smn u cut the lb, it ও o...ggg032019

Thingyan is now a global treasure! See how people around the world are search...Pixellion

LLM finetuning for multiple choice google bertChadapornK

History of Science and Technologyandits source.pptxbalongcastrojo

How iCode cybertech Helped Me Recover My Lost Fundsireneschmid345

Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Abodahab

Apache spark

1. TEJPAL GAUTAM 1373513041 (I.T.) FINAL APACHE SPARK Presented by

2. AGENDA • SPARK – INTRODUCTION • EVOLUTION OF APACHE SPARK • FEATUTRE OF APACHE SPARK • COMPONENT OF APACHE SPARK • WHY APACHE? • EXECUTION FLOW • OPERATIONS ON MAPREDUCE AND SPARK • WHO ARE USING APACHE SPARK ?

3. INTRODUCTION • Apache Spark is lightning-fast computing cluster computing technology ,design for fast computation. • The main concern is to maintain speed in processing large datasets in terms of waiting time between queries and waiting time to run the program. • The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application. • It is used for fast data analytics. • Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming. Apart from supporting all these workload in a respective system, it reduces the management burden of maintaining separate tools.

4. EVOLUTION OF APACHE SPARK Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by Matei Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated to Apache software foundation in 2013, and now Apache Spark has become a top level Apache project from Feb-2014.

5. FEATUTRE OF APACHE SPARK

6. COMPONENTS OF APACHE SPARK

7. SPARK STREAMING

9. SPARK SQL

10. SPARK MLlib

11. GraphX

12. WHY SPARK? • Most of Machine Learning Algorithms are iterative because each iteration can improve the results • With Disk based approach each iteration’s output is written to disk making it slow Hadoop Execution Flow Spark execution flow

13. EXECUTION FLOW

14. CODE SIZE

15. Let’s try some examples…

16. /* SimpleApp.scala */ import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf object SimpleApp { def main(args: Array[String]) { val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system val conf = new SparkConf().setAppName("Simple Application") .setMaster(“local") val sc = new SparkContext(conf) val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)) } } Standalone (Scala)

17. STANDALONE(JAVA) /* SimpleApp.java */ import org.apache.spark.api.java.*; import org.apache.spark.SparkConf; import org.apache.spark.api.java.function.Function; public class SimpleApp { public static void main(String[] args) { String logFile = "YOUR_SPARK_HOME/README.md"; // Should be some file on your system SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("local"); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDD<String> logData = sc.textFile(logFile).cache(); long numAs = logData.filter(new Function<String, Boolean>() { public Boolean call(String s) { return s.contains("a"); } }).count(); long numBs = logData.filter(new Function<String, Boolean>() { public Boolean call(String s) { return s.contains("b"); } }).count(); System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs); }

18. INTERACTIVE OPERTIONS ON MAPREDUCE

19. INTERACTIVE OPERTIONS ON SPARK

20. WHO ARE USING APACHE SPARK

Apache spark

Recommended

More Related Content

What's hot (20)

Similar to Apache spark (20)

Recently uploaded (20)

Apache spark