Why Functional Programming Is Important in Big Data Era

Jun 22, 2014Download as PPTX, PDF1 like865 views

Handaru Sakti

The only thing that works for parallel programming is functional programming. --Carnegie Mello Professor Bob Harper

Why Functional Programming Is
Important In Big Data Era?
handaru@tiket.com

What Are The Steps?
Act On
Analyze
Collect

What We Need?
D
Distributed Computing
Cluster
ProcessData

What We Need?
• Spark as data processsing in cluster, originally
written in Scala, which allows concise
function syntax and interactive use
• Mesos as cluster manager
• ZooKeeper as highly reliable distributed
coordinator
• HDFS as distributed storage

What We Need?
• Pure functions
• Atomic operations
• Parallel patterns or skeletons
• Lightweight algorithms
The only thing that works for parallel programming
is functional programming.
--Carnegie Mello Professor Bob Harper

FP Quick Tour In Scala
• Basic transformations:
var array = new Array[Int](10)
var list = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
• Indexing:
array(0) = 1
println(list(0))
• Anonymous functions:
val multiplay = (x: Int, y: Int) => x * y
val procedure = { x: Int => {
println(“Hello, ”+x)
println(x * 10)
}
}

$FP Quick Tour In Scala • Scala closure syntax: (x: Int) => x * 10 // full version x => x * 10 // type interference _ * 10 // underscore syntax x => { // body is block of code val y = 10 x * y }$

FP Quick Tour In Scala
• Processing collections:
var list = List(1, 2, 3, 4, 5, 6, 7, 8, 9)
list.foreach(x => println(x))
list.map(_ * 10)
list.filter(x => x % 2 == 0)
list.reduce((x, y) => x + y)
list.reduce(_ + _)
def f(x: Int) = List(x-1, x x+1)
list.map(x => f(x))
list.map(f(_))
list.flatMap(x => f(x))
list.map(x => f(x)).reduce(_ ++ _)

Spark Quick Tour
• Spark context:
• Entry point to Spark functionality
• In spark-shell, crated as sc
• In standalone-spark-program, we must create it
• Resilient distributed datasets (RDDs) :
• A distributed memory abstraction
• A logically centralized entity but physically partitioned across multiple
machines inside a cluster based on some notion of key
• Immutable
• Automatically rebuilt on failure
• Based on LRU (Least Recent Use) eviction algorithm

Spark Quick Tour
• Transformations:
• Lazy operations to build RDDs from other RDDs
• Narrow transformation (involves no data shuffling) :
• map
• flatMap
• filter
• Wide transformation (involves data shuffling):
• sortByKey
• reduceByKey
• groupByKey
• Actions:
• Return a result or write it to storage
• collect
• count
• take(n)

Spark Quick Tour
• Creating RDDs:
val numbers = sc.parallelize(List(1, 2, 3, 4, 5))
val textFile = sc.textFile("hdfs://localhost/test/tobe.txt")
val textFile = sc.textFile("hdfs://localhost/test/*.txt")
• Basic transformations:
val squares = numbers.map(x => x * x)
val evens = squares.filter(_ < 9)
val mapto = numbers.flatMap(x => 1 to x)
val words = textFile.flatMap(_.split(" ")).cache()
Base RDD
Transformed RDD
Turn a collection
to RDD

Spark Quick Tour
• Basic actions:
words.collect()
words take(5)
words count
words.reduce(_ + _)
words.filter(_ == “be").count()
words.filter(_ == “or").count()
words.saveAsTextFile("hdfs://localhost/test/result")
The influence of
cache

Spark Quick Tour
• Pair syntax:
val pair = (a, b)
• Accessing pair elements:
pair._1
pair._2
• Key-value operations:
val pets = sc.parallelize(List(("cat", 1), ("dog", 2), ("cat", 3)))
pets.reduceByKey(_ + _)
pets.groupByKey()
pets.sortByKey()

Hello World
val logFile = "hdfs://localhost/test/tobe.txt"
val logData = sc.textFile(logFile).cache()
val wordCount = logData.flatMap(_.split(“ “))
.map((_, 1))
.reduceByKey(_ + _)
wordCount.saveAsTextFile("hdfs://localhost/wordcount/result")
sc.stop()

Software Components
Application
Spark Context
ZooKeeper
Mesos
Master
Mesos Slave
Spark Executor
Mesos Slave
Spark Executor
HDFS/Other Storage

Literature
Parallel Programming With Spark
Spark: Low latency, massively parallel processing framework

Contact: https://www.linkedin.com/in/brandonjobrien @hakczar Code examples available at https://github.com/br4nd0n/spark-streaming and https://github.com/br4nd0n/spark-viz A demo and explanation of building a streaming application using Spark Streaming, Node.js and Redis with a real time visualization. Includes discussion of internals of Spark and Spark streaming including RDD partitioning and code and data distribution and cluster resource allocation.

Demystifying the Distributed Database LandscapeScyllaDB

What is the state of the art of high performance, distributed databases as we head into 2022, and which options are best suited for your own development projects? The data-intensive applications leading this next tech cycle are typically powered by multiple types of databases and data stores — each satisfying specific needs and often interacting with a broader data ecosystem. Even the very notion of “a database” is evolving as new hardware architectures and methodologies allow for ever-greater capabilities and expectations for horizontal and vertical scalability, performance, and reliability. In this webinar, ScyllaDB Director of Technology Advocacy Peter Corless will survey the current landscape of distributed database systems and highlight new directions in the industry. This talk will cover different database and database-adjacent technologies as well as describe their appropriate use cases, patterns and antipatterns with a focus on: - Distributed SQL, NewSQL and NoSQL - In-memory datastores and caches - Streaming technologies with persistent data storage

Cassandra Day SV 2014: Scaling Hulu’s Video Progress Tracking Service with Ap...DataStax Academy

At Hulu, we deal with scaling our web services to meet the demands of an ever growing number of users. During this talk, we will discuss our initial use case for cassandra at Hulu: the video progress tracking service known as hugetop. While cassandra provides a fantastic platform on which to build scalable applications, there are some dark corners of which to be cautious. We will provide a walkthrough of hugetop and some design decisions that went into the hugetop keyspace, our hardware choices, and our experiences operating cassandra in a high-traffic environment.

EclairJS = Node.Js + Apache SparkJen Aman

EclairJS allows developers to use JavaScript and Node.js to interact with Apache Spark for large-scale data processing and analytics. It provides a Spark API for Node.js so that compute-intensive workloads can be handed off to Spark running in the backend. EclairJS also enables the use of JavaScript with Jupyter notebooks, so data engineers and web developers can experiment with Spark from within the browser using familiar JavaScript syntax.

Running Scylla on Kubernetes with Scylla OperatorScyllaDB

- The document discusses running Scylla, a NoSQL database, on Kubernetes using the Scylla Operator. The Operator allows Kubernetes to leverage for workload management and provides a management layer for Scylla. - A demo shows deploying a Scylla cluster on Kubernetes with the Operator, stress testing the deployment, and performing common procedures like scaling up and upgrading Scylla versions. - The Operator uses custom resources and controllers to map Scylla concepts like members, clusters, and datacenters to Kubernetes concepts like statefulsets and pods. This provides capabilities like topology changes and rolling upgrades.

Spark: Interactive To ProductionJen Aman

This document summarizes a presentation given at Spark Summit 2016 about using Spark for real-time data processing and analytics at Uber and Marketplace Data. Some key points: - Uber generates large amounts of data across its 70+ countries and 450+ cities that is used for real-time processing, analytics, and forecasting. - Marketplace Data uses Spark for real-time data processing, analytics, and forecasting of Uber's data, which involves challenges like complex event processing, geo aggregation, and querying large and streaming datasets. - Jupyter notebooks are used to empower users and data scientists to work with Spark in a flexible way, though challenges remain around reliability, freshness, and isolating queries.

Introduction to dfMohit Jaggi

A lot of data scientists use the python library pandas for quick exploration of data. The most useful construct in pandas (based on R, I think) is the dataframe, which is a 2D array(aka matrix) with the option to “name” the columns (and rows). But pandas is not distributed, so there is a limit on the data size that can be explored. Spark is a great map-reduce like framework that can handle very big data by using a shared nothing cluster of machines. This work is an attempt to provide a pandas-like DSL on top of spark, so that data scientists familiar with pandas have a very gradual learning curve.

Getting started with SparkSQL - Desert Code Camp 2016clairvoyantllc

Hoodie: How (And Why) We built an analytical datastore on SparkVinoth Chandar

Exploring a specific problem of ingesting petabytes of data in Uber and why they ended up building an analytical datastore from scratch using Spark. Then, discuss design choices and implementation approaches in building Hoodie to provide near-real-time data ingestion and querying using Spark and HDFS. https://spark-summit.org/2017/events/incremental-processing-on-large-analytical-datasets/

Introduction to Streaming Distributed Processing with StormBrandon O'Brien

Contact: https://www.linkedin.com/in/brandonjobrien @hakczar Introducing streaming data concepts, Storm cluster architecture, Storm topology architecture, and demonstrate working example of a WordCount topology for SIGKDD Seattle chapter meetup. Presented by Brandon O'Brien Code example: https://github.com/OpenDataMining/brandonobrien Meetup: http://www.meetup.com/seattlesigkdd/events/222955114/

Scylla Summit 2016: Graph Processing with Titan and ScyllaScyllaDB

Building DSLs with ScalaMohit Jaggi

This document discusses building domain-specific languages (DSLs) with Scala. It begins by introducing the speaker and their background. It then discusses why Scala is well-suited for DSLs, highlighting features like less red tape, static typing with type inference, and the ability to use the same language across different roles. The rest of the document covers considerations for good APIs, different types of DSLs (external vs internal), Scala language constructs useful for DSLs like apply and update methods, and lessons learned from building DSLs.

Apache Spark avec NodeJS ? Oui, c'est possible avec EclairJS !Bruno Bonnin

This document discusses EclairJS, which allows using Apache Spark from Node.js. EclairJS implements Spark's core API and SQL API in JavaScript so that Spark code can be written and run from Node.js. It works by having the Node.js code execute JavaScript code on the JVM using Nashorn. This allows leveraging Spark from JavaScript developers. Examples show Spark operations like reading JSON data, transforming datasets, and running SQL queries from Node.js code. EclairJS can be deployed to run Spark jobs from various environments like Jupyter notebooks.

Reactive dashboard’s using apache sparkRahul Kumar

Webinar how to build a highly available time series solution with kairos-db (1)Julia Angell

WEBINAR - Introducing Scylla Open Source 3.0: Materialized Views, Secondary I...ScyllaDB

The industry’s most performant NoSQL database just got better. Scylla Open Source 3.0 introduces much-anticipated new features for more efficient querying, reduced storage requirements, lower repair times, and better overall database performance. It includes production-ready capabilities beyond those available with Apache Cassandra or any other NoSQL database. Join ScyllaDB CEO and co-founder Dor Laor and vice president of field engineering Glauber Costa for a technical overview of the new features and capabilities in Scylla Open Source 3.0, including: - Materialized Views - Global Secondary Indexes - New storage format: SSTable 3.0 - Hinted Handoff - Streaming Improvements - Full scan improvements

Cassandra vs. ScyllaDB: Evolutionary DifferencesScyllaDB

Apache Spark on Kubernetesharidasnss

xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)Claudiu Barbura

xPatterns is a big data analytics platform as a service that enables a rapid development of enterprise-grade analytical applications. It provides tools, api sets and a management console for building an ELT pipeline with data monitoring and quality gates, a data warehouse for ad-hoc and scheduled querying, analysis, model building and experimentation, tools for exporting data to NoSql and solrCloud cluster for real-time access through low-latency/high-throughput apis as well as dashboard and visualization api/tools leveraging the available data and models. In this presentation we will showcase one of the analytical applications build on top of xPatterns for our largest customer for that runs xPatterns in production on top a data warehouse consisting of several hundreds TB of medical, pharmacy and lab data records consisting of tens of billions of records. We will showcase the xPatterns components in the form of APIs and tools employed throughout the entire lifecycle of this application. The core of the presentation is the evolution of the infrastructure from the Hadoop/Hive stack to the new BDAS Spark, Shark, Mesos and Tachyon, with lessons learned and demos.

Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Databricks

At Databricks, we have a unique view into hundreds different companies using Apache Spark for development and production use-cases, from their support tickets and forum posts. Having seen so many different workflows and applications, some discernible patterns emerge when looking at common manageability, debugging, and visibility issues that our users run into. This talk will first show some representatives of these common issues. Then, we will show you what we have done and have been working on in Databricks to make Spark clusters easier to manage, monitor, and debug.

Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit

This document provides lessons learned from using Apache Spark Streaming. It discusses key architecture decisions when using Spark Streaming vs Structured Streaming. It also outlines the top 5 support issues encountered, including type mismatches, errors finding leader offsets, issues with toDF functions, non-serializable tasks, and efficiently pushing JSON records. It provides solutions and references for each issue.

Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaSpark Summit

Spark had been elected, deservedly, as the main massive parallel processing framework, and HDFS is the one of the most popular Big Data storage technologies. Therefore its combination is one of the most usual Big Data’s use cases. But, what happens with the security? Can these two technologies coexist in a secure environment? Furthermore, with the proliferation of BI technologies adapted to Big Data environments, that demands that several users interacts with the same cluster concurrently, can we continue to ensure that our Big Data environments are still secure? In this lecture, Abel and Jorge will explain which adaptations of Spark´s core they had to perform in order to guarantee the security of multiple concurrent users using a single Spark cluster, which can use any of its cluster managers, without degrading the outstanding Spark’s performance.

Apache spark on Hadoop Yarn Resource Managerharidasnss

Real-time Fraud Detection for Southeast Asia’s Leading Mobile PlatformScyllaDB

Grab is one of the most frequently used mobile platforms in Southeast Asia, providing the everyday services that matter most to consumers. Its users commute, eat, arrange shopping deliveries, and pay with one e-wallet. Grab relies on the combination of Apache Kafka and Scylla for a very critical use case -- instantaneously detecting fraudulent transactions that might occur across approximately more than six million on-demand rides per day taking place in eight countries across Southeast Asia. Doing this successfully requires many things to happen in near-real time. Join our webinar for this fascinating real-time big data use case, and learn the steps Grab took to optimize their fraud detection systems using the Scylla NoSQL database along with Apache Kafka.

Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Databricks

At the end of day, the only thing that data scientists want is tabular data for their analysis. They do not want to spend hours or days preparing data. How does a data engineer handle the massive amount of data that is being streamed at them from IoT devices and apps, and at the same time add structure to it so that data scientists can focus on finding insights and not preparing data? By the way, you need to do this within minutes (sometimes seconds). Oh… and there are a lot of other data sources that you need to ingest, and the current providers of data are changing their structure. GoPro has massive amounts of heterogeneous data being streamed from their consumer devices and applications, and they have developed the concept of “dynamic DDL” to structure their streamed data on the fly using Spark Streaming, Kafka, HBase, Hive and S3. The idea is simple: Add structure (schema) to the data as soon as possible; allow the providers of the data to dictate the structure; and automatically create event-based and state-based tables (DDL) for all data sources to allow data scientists to access the data via their lingua franca, SQL, within minutes.

Bigdata and Hadoop with Dockerharidasnss

Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAwareLucidworks

Spark can be used to improve the performance of importing and searching large datasets in Solr. Data can be imported from HDFS files into Solr in parallel using Spark, speeding up the import process. Spark can also be used to stream data from Solr into RDDs for further processing, such as aggregation, filtering, and joining with other data. Techniques like column-based denormalization and compressed storage of event data in Solr documents can reduce data volume and improve import and query speeds by orders of magnitude.

Scylla Summit 2022: What’s New in ScyllaDB Operator for KubernetesScyllaDB

This document summarizes the Scylla Operator for Kubernetes, including its developers, features, releases, and roadmap. Key points include: - The Scylla Operator manages and automates tasks for Scylla clusters on Kubernetes. - Features include seedless mode, security enhancements, performance tuning, and improved stability. - It follows a rapid 6-week release cycle and supports the latest two releases. - Future plans include additional performance optimizations, persistent storage support, TLS encryption, and multi-datacenter capabilities.

Apache Spark - San Diego Big Data Meetup Jan 14th 2015cdmaxime

This document provides an introduction to Apache Spark presented by Maxime Dumas of Cloudera. It discusses: 1. What Cloudera does including distributing Hadoop components with enterprise tooling and support. 2. An overview of the Apache Hadoop ecosystem including why Hadoop is used for scalability, efficiency, and flexibility with large amounts of data. 3. An introduction to Apache Spark which improves on MapReduce by being faster, easier to use, and supporting more types of applications such as machine learning and graph processing. Spark can be 100x faster than MapReduce for certain applications.

Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014cdmaxime

This document provides an introduction to Apache Spark, a general purpose cluster computing framework. It discusses how Spark improves upon MapReduce by offering better performance, support for iterative algorithms, and an easier developer experience. Spark retains MapReduce's advantages like scalability, fault tolerance, and data locality, but offers more by leveraging distributed memory and supporting directed acyclic graphs of tasks. Examples demonstrate how Spark can run programs up to 100x faster than Hadoop MapReduce and how it supports machine learning algorithms and streaming data analysis.

More Related Content

What's hot (20)

Hoodie: How (And Why) We built an analytical datastore on SparkVinoth Chandar

Introduction to Streaming Distributed Processing with StormBrandon O'Brien

Scylla Summit 2016: Graph Processing with Titan and ScyllaScyllaDB

Building DSLs with ScalaMohit Jaggi

Apache Spark avec NodeJS ? Oui, c'est possible avec EclairJS !Bruno Bonnin

Reactive dashboard’s using apache sparkRahul Kumar

Webinar how to build a highly available time series solution with kairos-db (1)Julia Angell

WEBINAR - Introducing Scylla Open Source 3.0: Materialized Views, Secondary I...ScyllaDB

Cassandra vs. ScyllaDB: Evolutionary DifferencesScyllaDB

Apache Spark on Kubernetesharidasnss

xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)Claudiu Barbura

Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Databricks

Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit

Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaSpark Summit

Apache spark on Hadoop Yarn Resource Managerharidasnss

Real-time Fraud Detection for Southeast Asia’s Leading Mobile PlatformScyllaDB

Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Databricks

Bigdata and Hadoop with Dockerharidasnss

Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAwareLucidworks

Scylla Summit 2022: What’s New in ScyllaDB Operator for KubernetesScyllaDB

Hoodie: How (And Why) We built an analytical datastore on SparkVinoth Chandar

Introduction to Streaming Distributed Processing with StormBrandon O'Brien

Scylla Summit 2016: Graph Processing with Titan and ScyllaScyllaDB

Building DSLs with ScalaMohit Jaggi

Apache Spark avec NodeJS ? Oui, c'est possible avec EclairJS !Bruno Bonnin

Reactive dashboard’s using apache sparkRahul Kumar

Webinar how to build a highly available time series solution with kairos-db (1)Julia Angell

WEBINAR - Introducing Scylla Open Source 3.0: Materialized Views, Secondary I...ScyllaDB

Cassandra vs. ScyllaDB: Evolutionary DifferencesScyllaDB

Apache Spark on Kubernetesharidasnss

xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)Claudiu Barbura

Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Databricks

Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit

Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaSpark Summit

Apache spark on Hadoop Yarn Resource Managerharidasnss

Real-time Fraud Detection for Southeast Asia’s Leading Mobile PlatformScyllaDB

Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Databricks

Bigdata and Hadoop with Dockerharidasnss

Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAwareLucidworks

Scylla Summit 2022: What’s New in ScyllaDB Operator for KubernetesScyllaDB

Similar to Why Functional Programming Is Important in Big Data Era (20)

Apache Spark - San Diego Big Data Meetup Jan 14th 2015cdmaxime

Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014cdmaxime

Dive into spark2Gal Marder

Abstract – Spark 2 is here, while Spark has been the leading cluster computation framework for severl years, its second version takes Spark to new heights. In this seminar, we will go over Spark internals and learn the new concepts of Spark 2 to create better scalable big data applications. Target Audience Architects, Java/Scala developers, Big Data engineers, team leaders Prerequisites Java/Scala knowledge and SQL knowledge Contents: - Spark internals - Architecture - RDD - Shuffle explained - Dataset API - Spark SQL - Spark Streaming

Apache Spark Overview @ ferretAndrii Gakhov

Apache Spark is a fast and general engine for large-scale data processing. It was originally developed in 2009 and is now supported by Databricks. Spark provides APIs in Java, Scala, Python and can run on Hadoop, Mesos, standalone or in the cloud. It provides high-level APIs like Spark SQL, MLlib, GraphX and Spark Streaming for structured data processing, machine learning, graph analytics and stream processing.

Apache Spark™ is a multi-language engine for executing data-S5.pptbhargavi804095

Introduction to Apache Spark EcosystemBojan Babic

This document introduces Apache Spark, an open-source cluster computing system that provides fast, general execution engines for large-scale data processing. It summarizes key Spark concepts including resilient distributed datasets (RDDs) that let users spread data across a cluster, transformations that operate on RDDs, and actions that return values to the driver program. Examples demonstrate how to load data from files, filter and transform it using RDDs, and run Spark programs on a local or cluster environment.

Spark real world use cases and optimizationsGal Marder

This document provides an overview of Spark, its core abstraction of resilient distributed datasets (RDDs), and common transformations and actions. It discusses how Spark partitions and distributes data across a cluster, its lazy evaluation model, and the concept of dependencies between RDDs. Common use cases like word counting, bucketing user data, finding top results, and analytics reporting are demonstrated. Key topics covered include avoiding expensive shuffle operations, choosing optimal aggregation methods, and potentially caching data in memory.

Paris Data Geek - Spark Streaming Djamel Zouaoui

This document discusses Spark Streaming and its use for near real-time ETL. It provides an overview of Spark Streaming, how it works internally using receivers and workers to process streaming data, and an example use case of building a recommender system to find matches using both batch and streaming data. Key points covered include the streaming execution model, handling data receipt and job scheduling, and potential issues around data loss and (de)serialization.

TriHUG talk on Spark and Sharktrihug

This document summarizes Lightning-Fast Cluster Computing with Spark and Shark, a presentation about the Spark and Shark frameworks. Spark is an open-source cluster computing system that aims to provide fast, fault-tolerant processing of large datasets. It uses resilient distributed datasets (RDDs) and supports diverse workloads with sub-second latency. Shark is a system built on Spark that exposes the HiveQL query language and compiles queries down to Spark programs for faster, interactive analysis of large datasets.

Spark ProgrammingTaewook Eom

Spark is an open-source cluster computing framework. It was developed in 2009 at UC Berkeley and open sourced in 2010. Spark supports batch, streaming, and interactive computations in a unified framework. The core abstraction in Spark is the resilient distributed dataset (RDD), which allows data to be partitioned across a cluster for parallel processing. RDDs support transformations like map and filter that return new RDDs and actions that return values to the driver program.

Spark - The Ultimate Scala Collections by Martin OderskySpark Summit

Spark is a domain-specific language for working with collections that is implemented in Scala and runs on a cluster. While similar to Scala collections, Spark differs in that it is lazy and supports additional functionality for paired data. Scala can learn from Spark by adding views to make laziness clearer, caching for persistence, and pairwise operations. Types are important for Spark as they prevent logic errors and help with programming complex functional operations across a cluster.

Scala and sparkFabio Fumarola

This document provides an introduction to Apache Spark, including its architecture and programming model. Spark is a cluster computing framework that provides fast, in-memory processing of large datasets across multiple cores and nodes. It improves upon Hadoop MapReduce by allowing iterative algorithms and interactive querying of datasets through its use of resilient distributed datasets (RDDs) that can be cached in memory. RDDs act as immutable distributed collections that can be manipulated using transformations and actions to implement parallel operations.

Introduction to Spark - Phoenix Meetup 08-19-2014cdmaxime

This document provides an introduction to Apache Spark presented by Maxime Dumas. It discusses how Spark improves on MapReduce by offering better performance through leveraging distributed memory and supporting iterative algorithms. Spark retains MapReduce's advantages of scalability, fault-tolerance, and data locality while offering a more powerful and easier to use programming model. Examples demonstrate how tasks like word counting, logistic regression, and streaming data processing can be implemented on Spark. The document concludes by discussing Spark's integration with other Hadoop components and inviting attendees to try Spark.

4Introduction+to+Spark.pptx sdfsdfsdfsdfsdfyafora8192

r instance, in zero-dimensional (0D) nanomaterials all the dimensions are measured within the nanoscale (no dimensions are larger than 100 nm); in two-dimensional nanomaterials (2D), two dimensions are outside the nanoscale; and in three-dimensional nanomaterials (3D) are materials that are not confined to the nanoscale in any dimension. This class can contain bulk powders, dispersions of nanoparticles, bundles of nanowires, and nanotubes as well as multi-nanolayers. Check our Frequently Asked Questions to get more details. For instance, in zero-dimensional (0D) nanomaterials all the dimensions are measured within the nanoscale (no dimensions are larger than 100 nm); in two-dimensional nanomaterials (2D), two dimensions are outside the nanoscale; and in three-dimensional nanomaterials (3D) are materials that are not confined to the nanoscale in any dimension. This class can contain bulk powders, dispersions of nanoparticles, bundles of nanowires, and nanotubes as well as multi-nanolayers. Check our Frequently Asked Questions to g

Introduction to apache sparkJohn Godoi

Apache Spark is an open-source distributed processing engine that allows for iterative and interactive processing of big data. It provides a framework with a functional API to create distributed applications that run across a cluster. Spark contains various components, with the core providing the base functionality and other components adding features for specific purposes like SQL, streaming, and machine learning. The functional programming paradigm underlies Spark's API, with immutable data and functions without side effects. Spark uses the map-reduce model where transformations are lazy and actions trigger execution, similar to Hadoop but with improved performance through in-memory caching of data.

Introduction to Apache SparkRahul Jain

Sumedh Wale's presentationpunesparkmeetup

This document provides an overview of Apache Spark, including its architecture, usage model, and capabilities. The key points covered include Spark's use of resilient distributed datasets (RDDs) to perform parallel transformations efficiently across a cluster, its support for SQL, streaming, and machine learning workloads, and how it achieves faster performance than other frameworks like MapReduce through optimizations like caching data in memory. Examples of WordCount in Spark and MapReduce are also provided and compared.

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy

Apache Spark has grown to be one of the largest open source communities in big data, with over 190 developers and dozens of companies contributing. The latest 1.0 release alone includes contributions from 117 people. A clean API, interactive shell, distributed in-memory computation, stream processing, interactive SQL, and libraries delivering everything from machine learning to graph processing make it an excellent unified platform to solve a number of problems. Apache Spark works very well with a growing number of big data solutions, including Cassandra and Hadoop. Come learn about Apache Spark and see how easy it is for you to get started using Spark to build your own high performance big data applications today.

Apache spark-melbourne-april-2015-meetupNed Shawa

This document provides an agenda and summaries for a meetup on introducing DataFrames and R on Apache Spark. The agenda includes overviews of Apache Spark 1.3, DataFrames, R on Spark, and large scale machine learning on Spark. There will also be discussions on news items, contributions so far, what's new in Spark 1.3, more data source APIs, what DataFrames are, writing DataFrames, and DataFrames with RDDs and Parquet. Presentations will cover Spark components, an introduction to SparkR, and Spark machine learning experiences.

Apache Spark TutorialAhmet Bulut

Apache Spark - San Diego Big Data Meetup Jan 14th 2015cdmaxime

Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014cdmaxime

Dive into spark2Gal Marder

Apache Spark Overview @ ferretAndrii Gakhov

Apache Spark™ is a multi-language engine for executing data-S5.pptbhargavi804095

Introduction to Apache Spark EcosystemBojan Babic

Spark real world use cases and optimizationsGal Marder

Paris Data Geek - Spark Streaming Djamel Zouaoui

TriHUG talk on Spark and Sharktrihug

Spark ProgrammingTaewook Eom

Spark - The Ultimate Scala Collections by Martin OderskySpark Summit

Scala and sparkFabio Fumarola

Introduction to Spark - Phoenix Meetup 08-19-2014cdmaxime

4Introduction+to+Spark.pptx sdfsdfsdfsdfsdfyafora8192

Introduction to apache sparkJohn Godoi

Introduction to Apache SparkRahul Jain

Sumedh Wale's presentationpunesparkmeetup

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy

Apache spark-melbourne-april-2015-meetupNed Shawa

Apache Spark TutorialAhmet Bulut

More from Handaru Sakti (15)

Game Theory of Oligopolistic Pricing StrategiesHandaru Sakti

Innovation managementHandaru Sakti

This document discusses innovation management and the importance of developing the right business model rather than prematurely focusing on solutions. It recommends taking a lean startup approach of quickly building prototypes, validating assumptions with customers, and iteratively improving the business model based on experimental feedback. The key steps are to fall in love with problems not solutions, identify the riskiest assumptions to test, and use traction not just build velocity to measure business model success.

Product Design Language SystemHandaru Sakti

The document discusses establishing a product design language system to improve collaboration, efficiency, and consistency across teams. The system aims to provide reusable components and a shared vocabulary both internally and externally through well-defined design patterns for research, UX, UI, testing, and roadmapping. Establishing such a system lays the foundation for building products in a repeatable, scalable way with minimum technical and design debt.

Real-Time Big DataHandaru Sakti

This document discusses big data and Hadoop. It defines big data as structured and unstructured data that is analyzed for predictive purposes. Hadoop is described as an open-source software framework for distributed storage and processing of large datasets across clusters of commodity servers. Key components of Hadoop include HDFS for storage, MapReduce for processing, and YARN which allows multiple data processing engines like Spark to run on Hadoop clusters. The document also briefly outlines other big data tools that can be used with Hadoop like Flume, Sqoop, and Spark.

IES Triangle PrincipleHandaru Sakti

The document outlines the IES Triangle Principle for building an everlasting company. It discusses focusing efforts on insights, efficiency, and scalability (IES) as the three key pillars. Under each pillar, it lists related areas of focus such as design research, reallocation, and roadmap/architecture. Maintaining alignment between these pillars over time as the business evolves is presented as important for sustainable growth.

Business Model CanvasHandaru Sakti

The document outlines the Business Model Canvas template, which is used to describe the various components of a business model. It provides Taobao as an example and walks through each element of the canvas: value propositions, customer segments, channels, customer relationships, revenue streams, key resources, key activities, key partnerships, cost structure. It then instructs the reader to fill out their own blank canvas using their own business information.

Transition management of product as platformHandaru Sakti

This document discusses transitioning a monolithic technology system to microservices as part of adopting a product-as-a-platform approach. It notes that as businesses and technology systems grow, scalability becomes more strategic. Adopting a product-as-a-platform brings pros like cross-selling and one-stop shopping but also cons like increased technical and design debt. It recommends having an organizational culture, product roadmap, product architecture, work breakdown structure, and using a strangler application approach to gradually transition functionality.

My StorialHandaru Sakti

Storial - Be StorytellerHandaru Sakti

This document discusses how to be a productive content maker in the mobile era. It suggests focusing on genuine creation rather than stealing content and being a curator who builds upon others' work rather than solely an artist. It also recommends taking advantage of constant connectivity to work continuously and draw inspiration from abundant information. The key principles involve feeling others' vulnerabilities, empathizing with different perspectives, connecting ideas, and iterating content by prototyping and testing. The overall message is that one should tell stories by applying these principles.

Mobile App Trends in 2016Handaru Sakti

In 2016, mobile app trends included apps serving as platforms for connecting supply and demand through dispatcher frameworks like Uber; cognitive design principles focused on human behavior and usability; and personalization driven by user context and IoT data from wearables and sensors. Backend trends emphasized scalability through concurrency, microservices, and fast/secure services compared to legacy systems. Overall, simplicity and usability remained important through following the KISS principle.

Android career opportunitiesHandaru Sakti

Android is an open-source operating system designed for touchscreen devices like smartphones and tablets. It is based on the Linux kernel and maintained by Google and the Android Open Source Project. Google purchased Android Inc. in 2005 and unveiled Android in 2007 with the Open Handset Alliance to advance open standards for mobile devices. The first Android phone, the HTC Dream, was introduced in 2008. Career opportunities in the growing Android market include positions as coders, developers, engineers, designers, marketers, and support staff working individually or for companies to develop Android operating systems, applications, devices, and services.

LoaderHandaru Sakti

Loader allows loading of data asynchronously in an activity or fragment. The LoaderManager initializes and manages Loader objects to perform loading. Loaders automatically reconnect to previously loaded data after configuration changes. Common loaders include AsyncTaskLoader and CursorLoader. To implement, an activity or fragment gets the LoaderManager and initializes a loader, providing a LoaderCallbacks implementation to receive loading callbacks.

Android Support PackageHandaru Sakti

The Android Support Package provides backward compatibility for Android features by including support libraries for using newer APIs on older Android versions. It allows features like fragments, viewpagers, notifications, sharing, and loaders to work across Android versions back to API level 4. Developers can import the support library to gain access to newer features without requiring the original API level. The support package is updated regularly and is available to download through the Android SDK manager.

Fisikawan dan Dunia KerjaHandaru Sakti

SAH2H PPTHandaru Sakti

Game Theory of Oligopolistic Pricing StrategiesHandaru Sakti

Innovation managementHandaru Sakti

Product Design Language SystemHandaru Sakti

Real-Time Big DataHandaru Sakti

IES Triangle PrincipleHandaru Sakti

Business Model CanvasHandaru Sakti

Transition management of product as platformHandaru Sakti

My StorialHandaru Sakti

Storial - Be StorytellerHandaru Sakti

Mobile App Trends in 2016Handaru Sakti

Android career opportunitiesHandaru Sakti

LoaderHandaru Sakti

Android Support PackageHandaru Sakti

Fisikawan dan Dunia KerjaHandaru Sakti

SAH2H PPTHandaru Sakti

Recently uploaded (20)

Simple_AI_Explanation_English somplr.pptxssuser2aa19f

Stack_and_Queue_Presentation_Final (1).pptxbinduraniha86

Ppt. Nikhil.pptxnshwuudgcudisisshvehsjkspanchariyasahil

Cleaned_Lecture 6666666_Simulation_I.pdfalcinialbob1234

Deloitte Analytics - Applying Process Mining in an audit contextProcess mining Evangelist

Mieke Jans is a Manager at Deloitte Analytics Belgium. She learned about process mining from her PhD supervisor while she was collaborating with a large SAP-using company for her dissertation. Mieke extended her research topic to investigate the data availability of process mining data in SAP and the new analysis possibilities that emerge from it. It took her 8-9 months to find the right data and prepare it for her process mining analysis. She needed insights from both process owners and IT experts. For example, one person knew exactly how the procurement process took place at the front end of SAP, and another person helped her with the structure of the SAP-tables. She then combined the knowledge of these different persons.

EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbJessaMaeEvangelista2

183409-christina-rossetti.pdfdsfsdasggsagfardin123rahman07

Principles of information security Chapter 5.pptEstherBaguma

LLM finetuning for multiple choice google bertChadapornK

md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxfatimalazaar2004

computer organization and assembly language.docxalisoftwareengineer1

Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...James Francis Paradigm Asset Management

By James Francis, CEO of Paradigm Asset Management In the landscape of urban safety innovation, Mt. Vernon is emerging as a compelling case study for neighboring Westchester County cities. The municipality’s recently launched Public Safety Camera Program not only represents a significant advancement in community protection but also offers valuable insights for New Rochelle and White Plains as they consider their own safety infrastructure enhancements.

04302025_CCC TUG_DataVista: The Design Storyccctableauusergroup

FPET_Implementation_2_MA to 360 Engage Direct.pptxssuser4ef83d

Geometry maths presentation for begginerszrjacob283

DPR_Expert_Recruitment_notice_Revised.pdfinmishra17121973

Data Science Courses in India iim skillsdharnathakur29

This comprehensive Data Science course is designed to equip learners with the essential skills and knowledge required to analyze, interpret, and visualize complex data. Covering both theoretical concepts and practical applications, the course introduces tools and techniques used in the data science field, such as Python programming, data wrangling, statistical analysis, machine learning, and data visualization.

Calories_Prediction_using_Linear_Regression.pptxTijiLMAHESHWARI

chapter3 Central Tendency statistics.pptjustinebandajbn

Molecular methods diagnostic and monitoring of infection - Repaired.pptx7tzn7x5kky

Simple_AI_Explanation_English somplr.pptxssuser2aa19f

Stack_and_Queue_Presentation_Final (1).pptxbinduraniha86

Ppt. Nikhil.pptxnshwuudgcudisisshvehsjkspanchariyasahil

Cleaned_Lecture 6666666_Simulation_I.pdfalcinialbob1234

Deloitte Analytics - Applying Process Mining in an audit contextProcess mining Evangelist

EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbJessaMaeEvangelista2

183409-christina-rossetti.pdfdsfsdasggsagfardin123rahman07

Principles of information security Chapter 5.pptEstherBaguma

LLM finetuning for multiple choice google bertChadapornK

md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxfatimalazaar2004

computer organization and assembly language.docxalisoftwareengineer1

Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...James Francis Paradigm Asset Management

04302025_CCC TUG_DataVista: The Design Storyccctableauusergroup

FPET_Implementation_2_MA to 360 Engage Direct.pptxssuser4ef83d

Geometry maths presentation for begginerszrjacob283

DPR_Expert_Recruitment_notice_Revised.pdfinmishra17121973

Data Science Courses in India iim skillsdharnathakur29

Calories_Prediction_using_Linear_Regression.pptxTijiLMAHESHWARI

chapter3 Central Tendency statistics.pptjustinebandajbn

Molecular methods diagnostic and monitoring of infection - Repaired.pptx7tzn7x5kky

Why Functional Programming Is Important in Big Data Era

1. Why Functional Programming Is Important In Big Data Era? [email protected]

2. What Is Big Data?

3. What Are The Steps? Act On Analyze Collect

4. What We Need? D Distributed Computing Cluster ProcessData

5. What We Need? • Spark as data processsing in cluster, originally written in Scala, which allows concise function syntax and interactive use • Mesos as cluster manager • ZooKeeper as highly reliable distributed coordinator • HDFS as distributed storage

6. What We Need? • Pure functions • Atomic operations • Parallel patterns or skeletons • Lightweight algorithms The only thing that works for parallel programming is functional programming. --Carnegie Mello Professor Bob Harper

7. What Is Functional Programming?

8. FP Quick Tour In Scala • Basic transformations: var array = new Array[Int](10) var list = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) • Indexing: array(0) = 1 println(list(0)) • Anonymous functions: val multiplay = (x: Int, y: Int) => x * y val procedure = { x: Int => { println(“Hello, ”+x) println(x * 10) } }

9. FP Quick Tour In Scala • Scala closure syntax: (x: Int) => x * 10 // full version x => x * 10 // type interference _ * 10 // underscore syntax x => { // body is block of code val y = 10 x * y }

10. FP Quick Tour In Scala • Processing collections: var list = List(1, 2, 3, 4, 5, 6, 7, 8, 9) list.foreach(x => println(x)) list.map(_ * 10) list.filter(x => x % 2 == 0) list.reduce((x, y) => x + y) list.reduce(_ + _) def f(x: Int) = List(x-1, x x+1) list.map(x => f(x)) list.map(f(_)) list.flatMap(x => f(x)) list.map(x => f(x)).reduce(_ ++ _)

11. Spark Quick Tour • Spark context: • Entry point to Spark functionality • In spark-shell, crated as sc • In standalone-spark-program, we must create it • Resilient distributed datasets (RDDs) : • A distributed memory abstraction • A logically centralized entity but physically partitioned across multiple machines inside a cluster based on some notion of key • Immutable • Automatically rebuilt on failure • Based on LRU (Least Recent Use) eviction algorithm

12. Spark Quick Tour Working with RDDs

13. Spark Quick Tour Cached RDDs

14. Spark Quick Tour • Transformations: • Lazy operations to build RDDs from other RDDs • Narrow transformation (involves no data shuffling) : • map • flatMap • filter • Wide transformation (involves data shuffling): • sortByKey • reduceByKey • groupByKey • Actions: • Return a result or write it to storage • collect • count • take(n)

15. Spark Quick Tour Transformations

16. Spark Quick Tour • Creating RDDs: val numbers = sc.parallelize(List(1, 2, 3, 4, 5)) val textFile = sc.textFile("hdfs://localhost/test/tobe.txt") val textFile = sc.textFile("hdfs://localhost/test/*.txt") • Basic transformations: val squares = numbers.map(x => x * x) val evens = squares.filter(_ < 9) val mapto = numbers.flatMap(x => 1 to x) val words = textFile.flatMap(_.split(" ")).cache() Base RDD Transformed RDD Turn a collection to RDD

17. Spark Quick Tour • Basic actions: words.collect() words take(5) words count words.reduce(_ + _) words.filter(_ == “be").count() words.filter(_ == “or").count() words.saveAsTextFile("hdfs://localhost/test/result") The influence of cache

18. Spark Quick Tour • Pair syntax: val pair = (a, b) • Accessing pair elements: pair._1 pair._2 • Key-value operations: val pets = sc.parallelize(List(("cat", 1), ("dog", 2), ("cat", 3))) pets.reduceByKey(_ + _) pets.groupByKey() pets.sortByKey()

19. Hello World val logFile = "hdfs://localhost/test/tobe.txt" val logData = sc.textFile(logFile).cache() val wordCount = logData.flatMap(_.split(“ “)) .map((_, 1)) .reduceByKey(_ + _) wordCount.saveAsTextFile("hdfs://localhost/wordcount/result") sc.stop()

20. Execution

21. Software Components Application Spark Context ZooKeeper Mesos Master Mesos Slave Spark Executor Mesos Slave Spark Executor HDFS/Other Storage

22. Literature Parallel Programming With Spark Spark: Low latency, massively parallel processing framework

23. [email protected]

24. [email protected]

Why Functional Programming Is Important in Big Data Era

Recommended

More Related Content

What's hot (20)

Similar to Why Functional Programming Is Important in Big Data Era (20)

More from Handaru Sakti (15)

Recently uploaded (20)

Why Functional Programming Is Important in Big Data Era