Workshop on Parallel, Cluster and Cloud Computing on Multi-core & GPU
(PCCCMG - 2015)
Workshop Conducted by
Computer Society of India
In Association with
Dept of CSE, VNIT and Persistence System Ltd, Nagpur
Workshop Dates 4th to 6th September 2015
This presentation briefly discusses about the following topics:
Data Analytics Lifecycle
Importance of Data Analytics Lifecycle
Phase 1: Discovery
Phase 2: Data Preparation
Phase 3: Model Planning
Phase 4: Model Building
Phase 5: Communication Results
Phase 6: Operationalize
Data Analytics Lifecycle Example
This document discusses techniques for mining data streams. It begins by defining different types of streaming data like time-series data and sequence data. It then discusses the characteristics of data streams like their huge volume, fast changing nature, and requirement for real-time processing. The key challenges in stream query processing are the unbounded memory requirements and need for approximate query answering. The document outlines several synopsis data structures and techniques used for mining data streams, including random sampling, histograms, sketches, and randomized algorithms. It also discusses architectures for stream query processing and classification of dynamic data streams.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
Cloud architectures can be thought of in layers, with each layer providing services to the next. There are three main layers: virtualization of resources, services layer, and server management processes. Virtualization abstracts hardware and provides flexibility. The services layer provides OS and application services. Management processes support service delivery through image management, deployment, scheduling, reporting, etc. When providing compute and storage services, considerations include hardware selection, virtualization, failover/redundancy, and reporting. Network services require capacity planning, redundancy, and reporting.
Relational databases vs Non-relational databasesJames Serra
There is a lot of confusion about the place and purpose of the many recent non-relational database solutions ("NoSQL databases") compared to the relational database solutions that have been around for so many years. In this presentation I will first clarify what exactly these database solutions are, compare them, and discuss the best use cases for each. I'll discuss topics involving OLTP, scaling, data warehousing, polyglot persistence, and the CAP theorem. We will even touch on a new type of database solution called NewSQL. If you are building a new solution it is important to understand all your options so you take the right path to success.
Welcome to the Supervised Machine Learning and Data Sciences.
Algorithms for building models. Support Vector Machines.
Classification algorithm explanation and code in Python ( SVM ) .
Data Mining: Mining ,associations, and correlationsDatamining Tools
Market basket analysis examines customer purchasing patterns to determine which items are commonly bought together. This can help retailers with marketing strategies like product bundling and complementary product placement. Association rule mining is a two-step process that first finds frequent item sets that occur together above a minimum support threshold, and then generates strong association rules from these frequent item sets based on minimum support and confidence. Various techniques can improve the efficiency of the Apriori algorithm for mining association rules, such as hashing, transaction reduction, partitioning, sampling, and dynamic item-set counting. Pruning strategies like item merging, sub-item-set pruning, and item skipping can also enhance efficiency. Constraint-based mining allows users to specify constraints on the type of
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
Big data is high-volume, high-velocity, and high-variety data that is difficult to process using traditional data management tools. It is characterized by 3Vs: volume of data is growing exponentially, velocity as data streams in real-time, and variety as data comes from many different sources and formats. The document discusses big data analytics techniques to gain insights from large and complex datasets and provides examples of big data sources and applications.
Mining Frequent Patterns, Association and CorrelationsJustin Cletus
This document summarizes Chapter 6 of the book "Data Mining: Concepts and Techniques" which discusses frequent pattern mining. It introduces basic concepts like frequent itemsets and association rules. It then describes several scalable algorithms for mining frequent itemsets, including Apriori, FP-Growth, and ECLAT. It also discusses optimizations to Apriori like partitioning the database and techniques to reduce the number of candidates and database scans.
The document discusses frequent pattern mining and the Apriori algorithm. It introduces frequent patterns as frequently occurring sets of items in transaction data. The Apriori algorithm is described as a seminal method for mining frequent itemsets via multiple passes over the data, generating candidate itemsets and pruning those that are not frequent. Challenges with Apriori include multiple database scans and large number of candidate sets generated.
The document discusses different types of parallelism that can be utilized in parallel database systems: I/O parallelism to retrieve relations from multiple disks in parallel, interquery parallelism to run different queries simultaneously, intraquery parallelism to parallelize operations within a single query, and intraoperation parallelism to parallelize individual operations like sort and join. It also covers techniques for partitioning relations across disks and handling skew to balance the workload.
The document discusses various data reduction strategies including attribute subset selection, numerosity reduction, and dimensionality reduction. Attribute subset selection aims to select a minimal set of important attributes. Numerosity reduction techniques like regression, log-linear models, histograms, clustering, and sampling can reduce data volume by finding alternative representations like model parameters or cluster centroids. Dimensionality reduction techniques include discrete wavelet transformation and principal component analysis, which transform high-dimensional data into a lower-dimensional representation.
Facing trouble in distinguishing Big Data, Hadoop & NoSQL as well as finding connection among them? This slide of Savvycom team can definitely help you.
Enjoy reading!
Overfitting and underfitting are modeling errors related to how well a model fits training data. Overfitting occurs when a model is too complex and fits the training data too closely, resulting in poor performance on new data. Underfitting occurs when a model is too simple and does not fit the training data well. The bias-variance tradeoff aims to balance these issues by finding a model complexity that minimizes total error.
MapReduce is a programming framework that allows for distributed and parallel processing of large datasets. It consists of a map step that processes key-value pairs in parallel, and a reduce step that aggregates the outputs of the map step. As an example, a word counting problem is presented where words are counted by mapping each word to a key-value pair of the word and 1, and then reducing by summing the counts of each unique word. MapReduce jobs are executed on a cluster in a reliable way using YARN to schedule tasks across nodes, restarting failed tasks when needed.
Hive is a data warehouse infrastructure tool that allows users to query and analyze large datasets stored in Hadoop. It uses a SQL-like language called HiveQL to process structured data stored in HDFS. Hive stores metadata about the schema in a database and processes data into HDFS. It provides a familiar interface for querying large datasets using SQL-like queries and scales easily to large datasets.
Support Vector Machine ppt presentationAyanaRukasar
Support vector machines (SVM) is a supervised machine learning algorithm used for both classification and regression problems. However, it is primarily used for classification. The goal of SVM is to create the best decision boundary, known as a hyperplane, that separates clusters of data points. It chooses extreme data points as support vectors to define the hyperplane. SVM is effective for problems that are not linearly separable by transforming them into higher dimensional spaces. It works well when there is a clear margin of separation between classes and is effective for high dimensional data. An example use case in Python is presented.
This document discusses data cubes, which are multidimensional data structures used in online analytical processing (OLAP) to enable fast retrieval of data organized by dimensions and measures. Data cubes can have 2-3 dimensions or more and contain measures like costs or units. Key concepts are slicing to select a 2D page, dicing to define a subcube, and rotating to change dimensional orientation. Data cubes represent categories through dimensions and levels, and store facts as measures in cells. They can be pre-computed fully, not at all, or partially to balance query speed and memory usage. Totals can also be stored to improve performance of aggregate queries.
Introduction to Hadoop and Hadoop component rebeccatho
This document provides an introduction to Apache Hadoop, which is an open-source software framework for distributed storage and processing of large datasets. It discusses Hadoop's main components of MapReduce and HDFS. MapReduce is a programming model for processing large datasets in a distributed manner, while HDFS provides distributed, fault-tolerant storage. Hadoop runs on commodity computer clusters and can scale to thousands of nodes.
Cloud computing allows users to access virtual hardware, software, platforms, and services on an as-needed basis without large upfront costs or commitments. This transforms computing into a utility that can be easily provisioned and composed. The long-term vision is for an open global marketplace where IT services are freely traded like utilities, lowering barriers and allowing flexible access to resources and software for all users.
This document provides an overview of YARN (Yet Another Resource Negotiator), the resource management system for Hadoop. It describes the key components of YARN including the Resource Manager, Node Manager, and Application Master. The Resource Manager tracks cluster resources and schedules applications, while Node Managers monitor nodes and containers. Application Masters communicate with the Resource Manager to manage applications. YARN allows Hadoop to run multiple applications like Spark and HBase, improves on MapReduce scheduling, and transforms Hadoop into a distributed operating system for big data processing.
The document discusses various clustering approaches including partitioning, hierarchical, density-based, grid-based, model-based, frequent pattern-based, and constraint-based methods. It focuses on partitioning methods such as k-means and k-medoids clustering. K-means clustering aims to partition objects into k clusters by minimizing total intra-cluster variance, representing each cluster by its centroid. K-medoids clustering is a more robust variant that represents each cluster by its medoid or most centrally located object. The document also covers algorithms for implementing k-means and k-medoids clustering.
This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques
This document discusses Spark Streaming and its use for near real-time ETL. It provides an overview of Spark Streaming, how it works internally using receivers and workers to process streaming data, and an example use case of building a recommender system to find matches using both batch and streaming data. Key points covered include the streaming execution model, handling data receipt and job scheduling, and potential issues around data loss and (de)serialization.
This document provides an agenda and summaries for a meetup on introducing DataFrames and R on Apache Spark. The agenda includes overviews of Apache Spark 1.3, DataFrames, R on Spark, and large scale machine learning on Spark. There will also be discussions on news items, contributions so far, what's new in Spark 1.3, more data source APIs, what DataFrames are, writing DataFrames, and DataFrames with RDDs and Parquet. Presentations will cover Spark components, an introduction to SparkR, and Spark machine learning experiences.
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
Big data is high-volume, high-velocity, and high-variety data that is difficult to process using traditional data management tools. It is characterized by 3Vs: volume of data is growing exponentially, velocity as data streams in real-time, and variety as data comes from many different sources and formats. The document discusses big data analytics techniques to gain insights from large and complex datasets and provides examples of big data sources and applications.
Mining Frequent Patterns, Association and CorrelationsJustin Cletus
This document summarizes Chapter 6 of the book "Data Mining: Concepts and Techniques" which discusses frequent pattern mining. It introduces basic concepts like frequent itemsets and association rules. It then describes several scalable algorithms for mining frequent itemsets, including Apriori, FP-Growth, and ECLAT. It also discusses optimizations to Apriori like partitioning the database and techniques to reduce the number of candidates and database scans.
The document discusses frequent pattern mining and the Apriori algorithm. It introduces frequent patterns as frequently occurring sets of items in transaction data. The Apriori algorithm is described as a seminal method for mining frequent itemsets via multiple passes over the data, generating candidate itemsets and pruning those that are not frequent. Challenges with Apriori include multiple database scans and large number of candidate sets generated.
The document discusses different types of parallelism that can be utilized in parallel database systems: I/O parallelism to retrieve relations from multiple disks in parallel, interquery parallelism to run different queries simultaneously, intraquery parallelism to parallelize operations within a single query, and intraoperation parallelism to parallelize individual operations like sort and join. It also covers techniques for partitioning relations across disks and handling skew to balance the workload.
The document discusses various data reduction strategies including attribute subset selection, numerosity reduction, and dimensionality reduction. Attribute subset selection aims to select a minimal set of important attributes. Numerosity reduction techniques like regression, log-linear models, histograms, clustering, and sampling can reduce data volume by finding alternative representations like model parameters or cluster centroids. Dimensionality reduction techniques include discrete wavelet transformation and principal component analysis, which transform high-dimensional data into a lower-dimensional representation.
Facing trouble in distinguishing Big Data, Hadoop & NoSQL as well as finding connection among them? This slide of Savvycom team can definitely help you.
Enjoy reading!
Overfitting and underfitting are modeling errors related to how well a model fits training data. Overfitting occurs when a model is too complex and fits the training data too closely, resulting in poor performance on new data. Underfitting occurs when a model is too simple and does not fit the training data well. The bias-variance tradeoff aims to balance these issues by finding a model complexity that minimizes total error.
MapReduce is a programming framework that allows for distributed and parallel processing of large datasets. It consists of a map step that processes key-value pairs in parallel, and a reduce step that aggregates the outputs of the map step. As an example, a word counting problem is presented where words are counted by mapping each word to a key-value pair of the word and 1, and then reducing by summing the counts of each unique word. MapReduce jobs are executed on a cluster in a reliable way using YARN to schedule tasks across nodes, restarting failed tasks when needed.
Hive is a data warehouse infrastructure tool that allows users to query and analyze large datasets stored in Hadoop. It uses a SQL-like language called HiveQL to process structured data stored in HDFS. Hive stores metadata about the schema in a database and processes data into HDFS. It provides a familiar interface for querying large datasets using SQL-like queries and scales easily to large datasets.
Support Vector Machine ppt presentationAyanaRukasar
Support vector machines (SVM) is a supervised machine learning algorithm used for both classification and regression problems. However, it is primarily used for classification. The goal of SVM is to create the best decision boundary, known as a hyperplane, that separates clusters of data points. It chooses extreme data points as support vectors to define the hyperplane. SVM is effective for problems that are not linearly separable by transforming them into higher dimensional spaces. It works well when there is a clear margin of separation between classes and is effective for high dimensional data. An example use case in Python is presented.
This document discusses data cubes, which are multidimensional data structures used in online analytical processing (OLAP) to enable fast retrieval of data organized by dimensions and measures. Data cubes can have 2-3 dimensions or more and contain measures like costs or units. Key concepts are slicing to select a 2D page, dicing to define a subcube, and rotating to change dimensional orientation. Data cubes represent categories through dimensions and levels, and store facts as measures in cells. They can be pre-computed fully, not at all, or partially to balance query speed and memory usage. Totals can also be stored to improve performance of aggregate queries.
Introduction to Hadoop and Hadoop component rebeccatho
This document provides an introduction to Apache Hadoop, which is an open-source software framework for distributed storage and processing of large datasets. It discusses Hadoop's main components of MapReduce and HDFS. MapReduce is a programming model for processing large datasets in a distributed manner, while HDFS provides distributed, fault-tolerant storage. Hadoop runs on commodity computer clusters and can scale to thousands of nodes.
Cloud computing allows users to access virtual hardware, software, platforms, and services on an as-needed basis without large upfront costs or commitments. This transforms computing into a utility that can be easily provisioned and composed. The long-term vision is for an open global marketplace where IT services are freely traded like utilities, lowering barriers and allowing flexible access to resources and software for all users.
This document provides an overview of YARN (Yet Another Resource Negotiator), the resource management system for Hadoop. It describes the key components of YARN including the Resource Manager, Node Manager, and Application Master. The Resource Manager tracks cluster resources and schedules applications, while Node Managers monitor nodes and containers. Application Masters communicate with the Resource Manager to manage applications. YARN allows Hadoop to run multiple applications like Spark and HBase, improves on MapReduce scheduling, and transforms Hadoop into a distributed operating system for big data processing.
The document discusses various clustering approaches including partitioning, hierarchical, density-based, grid-based, model-based, frequent pattern-based, and constraint-based methods. It focuses on partitioning methods such as k-means and k-medoids clustering. K-means clustering aims to partition objects into k clusters by minimizing total intra-cluster variance, representing each cluster by its centroid. K-medoids clustering is a more robust variant that represents each cluster by its medoid or most centrally located object. The document also covers algorithms for implementing k-means and k-medoids clustering.
This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques
This document discusses Spark Streaming and its use for near real-time ETL. It provides an overview of Spark Streaming, how it works internally using receivers and workers to process streaming data, and an example use case of building a recommender system to find matches using both batch and streaming data. Key points covered include the streaming execution model, handling data receipt and job scheduling, and potential issues around data loss and (de)serialization.
This document provides an agenda and summaries for a meetup on introducing DataFrames and R on Apache Spark. The agenda includes overviews of Apache Spark 1.3, DataFrames, R on Spark, and large scale machine learning on Spark. There will also be discussions on news items, contributions so far, what's new in Spark 1.3, more data source APIs, what DataFrames are, writing DataFrames, and DataFrames with RDDs and Parquet. Presentations will cover Spark components, an introduction to SparkR, and Spark machine learning experiences.
This document provides an overview of Apache Spark, including:
- Apache Spark is a next generation data processing engine for Hadoop that allows for fast in-memory processing of huge distributed and heterogeneous datasets.
- Spark offers tools for data science and components for data products and can be used for tasks like machine learning, graph processing, and streaming data analysis.
- Spark improves on MapReduce by being faster, allowing parallel processing, and supporting interactive queries. It works on both standalone clusters and Hadoop clusters.
This document provides an overview of Apache Spark, including:
- Apache Spark is a next generation data processing engine for Hadoop that allows for fast in-memory processing of huge distributed and heterogeneous datasets.
- Spark offers tools for data science and components for data products and can be used for tasks like machine learning, graph processing, and streaming data analysis.
- Spark improves on MapReduce by being faster, allowing parallel processing, and supporting interactive queries. It works on both standalone clusters and Hadoop clusters.
The document provides an overview of Apache Spark, including what it is, its ecosystem, features, and architecture. Some key points:
- Apache Spark is an open-source cluster computing framework for large-scale data processing. It is up to 100x faster than Hadoop for iterative/interactive algorithms.
- Spark features include its RDD abstraction, lazy evaluation, and use of DAGs to optimize performance. It supports Scala, Java, Python, and R.
- The Spark ecosystem includes tools like Spark SQL, MLlib, GraphX, and Spark Streaming. It can run on Hadoop YARN, Mesos, or in standalone mode.
- Spark's architecture includes the SparkContext,
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014cdmaxime
This document provides an introduction to Apache Spark presented by Maxime Dumas of Cloudera. It discusses Spark's advantages over MapReduce like leveraging distributed memory for better performance and supporting iterative algorithms. Spark concepts like RDDs, transformations and actions are explained. Examples shown include word count, logistic regression, and Spark Streaming. The presentation concludes with a discussion of SQL on Spark and a demo.
Apache Spark is an open-source distributed processing engine that is up to 100 times faster than Hadoop for processing data stored in memory and 10 times faster for data stored on disk. It provides high-level APIs in Java, Scala, Python and SQL and supports batch processing, streaming, and machine learning. Spark runs on Hadoop, Mesos, Kubernetes or standalone and can access diverse data sources using its core abstraction called resilient distributed datasets (RDDs).
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonChristian Perone
This document provides an introduction to Apache Spark and collaborative filtering. It discusses big data and the limitations of MapReduce, then introduces Apache Spark including Resilient Distributed Datasets (RDDs), transformations, actions, and DataFrames. It also covers Spark Machine Learning (ML) libraries and algorithms such as classification, regression, clustering, and collaborative filtering.
The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.
Jai Ranganathan, Senior Director of Product Management, discusses why Spark has experienced such wide adoption and provide a technical deep dive into the architecture. Additionally, he presents some use cases in production today. Finally, he shares our vision for the Hadoop ecosystem and why we believe Spark is the successor to MapReduce for Hadoop data processing.
Demi Ben Ari - Apache Spark 101 - First Steps into distributed computing:
The world has changed, having one huge server won’t do the job, the ability to Scale Out would be your savior. Apache Spark is a fast and general engine for big data processing, with streaming, SQL, machine learning and graph processing. Showing the basics of Apache Spark and distributed computing.
Demi is a Software engineer, Entrepreneur and an International Tech Speaker.
Demi has over 10 years of experience in building various systems both from the field of near real time applications and Big Data distributed systems.
Co-Founder of the “Big Things” Big Data community and Google Developer Group Cloud.
Big Data Expert, but interested in all kinds of technologies, from front-end to backend, whatever moves data around.
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
This document summarizes a presentation about scaling terabytes of data with Apache Spark and Scala. The key points are:
1) The presenter discusses how to use Apache Spark and Scala to process large scale data in a distributed manner across clusters. Spark operations like RDDs, DataFrames and Datasets are covered.
2) A case study is presented about reengineering a data processing platform for a retail business to improve performance. Changes included parallelizing jobs, tuning Spark hyperparameters, and building a fast data architecture using Spark, Kafka and data lakes.
3) Performance was improved through techniques like dynamic resource allocation in YARN, reducing memory and cores per executor to better utilize cluster resources, and processing data
Getting Started with Apache Spark (Scala)Knoldus Inc.
In this session, we are going to cover Apache Spark, the architecture of Apache Spark, Data Lineage, Direct Acyclic Graph(DAG), and many more concepts. Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.
This introductory workshop is aimed at data analysts & data engineers new to Apache Spark and exposes them how to analyze big data with Spark SQL and DataFrames.
In this partly instructor-led and self-paced labs, we will cover Spark concepts and you’ll do labs for Spark SQL and DataFrames
in Databricks Community Edition.
Toward the end, you’ll get a glimpse into newly minted Databricks Developer Certification for Apache Spark: what to expect & how to prepare for it.
* Apache Spark Basics & Architecture
* Spark SQL
* DataFrames
* Brief Overview of Databricks Certified Developer for Apache Spark
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai
This document discusses machine learning techniques for large-scale datasets using Apache Spark. It provides an overview of Spark's machine learning library (MLlib), describing algorithms like logistic regression, linear regression, collaborative filtering, and clustering. It also compares Spark to traditional Hadoop MapReduce, highlighting how Spark leverages caching and iterative algorithms to enable faster machine learning model training.
http://bit.ly/1BTaXZP – As organizations look for even faster ways to derive value from big data, they are turning to Apache Spark is an in-memory processing framework that offers lightning-fast big data analytics, providing speed, developer productivity, and real-time processing advantages. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Spark Streaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis. This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop. By the end of the session, you’ll come away with a deeper understanding of how you can unlock deeper insights from your data, faster, with Spark.
This document discusses efficient data mining solutions using Hadoop, Cassandra, and Spark. It describes Cassandra as a fast, robust, and efficient key-value database but notes it has limitations for certain queries. Spark is presented as an alternative to Hadoop MapReduce that can be 100 times faster for interactive algorithms and data mining. The document demonstrates how Spark can integrate with Cassandra to allow distributed data processing over Cassandra data without needing to clone the data or use other databases. Future extensions are proposed to directly access Cassandra's SSTable files from Spark and extend CQL3 to leverage Spark.
Unit II Real Time Data Processing tools.pptxRahul Borate
Apache Spark is a lightning-fast cluster computing framework designed for real-time processing. It overcomes limitations of Hadoop by running 100 times faster in memory and 10 times faster on disk. Spark uses resilient distributed datasets (RDDs) that allow data to be partitioned across clusters and cached in memory for faster processing.
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
How iCode cybertech Helped Me Recover My Lost Fundsireneschmid345
I was devastated when I realized that I had fallen victim to an online fraud, losing a significant amount of money in the process. After countless hours of searching for a solution, I came across iCode cybertech. From the moment I reached out to their team, I felt a sense of hope that I can recommend iCode Cybertech enough for anyone who has faced similar challenges. Their commitment to helping clients and their exceptional service truly set them apart. Thank you, iCode cybertech, for turning my situation around!
[email protected]
Mieke Jans is a Manager at Deloitte Analytics Belgium. She learned about process mining from her PhD supervisor while she was collaborating with a large SAP-using company for her dissertation.
Mieke extended her research topic to investigate the data availability of process mining data in SAP and the new analysis possibilities that emerge from it. It took her 8-9 months to find the right data and prepare it for her process mining analysis. She needed insights from both process owners and IT experts. For example, one person knew exactly how the procurement process took place at the front end of SAP, and another person helped her with the structure of the SAP-tables. She then combined the knowledge of these different persons.
Thingyan is now a global treasure! See how people around the world are search...Pixellion
We explored how the world searches for 'Thingyan' and 'သင်္ကြန်' and this year, it’s extra special. Thingyan is now officially recognized as a World Intangible Cultural Heritage by UNESCO! Dive into the trends and celebrate with us!
1. Workshop on Parallel, Cluster and
Cloud Computing on Multi-core & GPU
(PCCCMG - 2015)
Workshop Conducted by
Computer Society of India In Association with
Dept. of CSE, VNIT and
Persistence System Ltd, Nagpur
4th – 6th Sep’15
2. Big-Data Cluster Computing
Advance tools & technologies
Jagadeesan A S
Software Engineer
Persistent Systems Limited
www.github.com/jagadeesanas2
www.linkedin.com/in/jagadeesanas2
3. ContentContent
Overview of Big Data
• Data clustering concepts
• Clustering vs Classification
• Data Journey
Advance tools and technologies
• Apache Hadoop
• Apache Spark
Future of analytics
• Demo - Spark RDD in Intellij IDEA
4. Big-Data is similar to Small-Data , but bigger in size and complexity.
What is Big-Data ?
Definition from Wikipedia:
Big data is the term for a collection of data sets so large and complex that
it becomes difficult to process using on-hand database management tools
or traditional data processing applications.
9. What is a Cluster ?
A group of the same or similar elements gathered or occurring closely together.
Clustering is the key to Big Data problem
• Not feasible to “label” large collection of objects
• No prior knowledge of the number and nature of groups (clusters) in data
• Clusters may evolve over time
• Clustering provides efficient browsing, search, recommendation and organization of data
16. Large-Scale Data Analytics
MapReduce computing paradigm vs. Traditional database systems
Database
Many enterprises are turned to Hadoop
Especially applications generating big data, Web applications, social networks, scientific applications
17. APACHE HADOOP (Disk Based Computing)
open-source software framework written in Java for distributed storage and distributed processing
Design Principles of Hadoop
• Need to process big data
• Need to parallelize computation across thousands of nodes
• Commodity hardware
• Large number of low-end cheap machines working in parallel to solve
a computing problem
• Small number of high-end expensive machines
18. Hadoop cluster architecture
A Hadoop cluster can be divided into two abstract entities:
MapReduce engine + distributed file system =
19. What is SPARK
Why SPARK
How to configure SPARK
APACHE SPARK
Open-source cluster computing framework
20. APACHE SPARK (Memory Based Computing)
open-source software framework written in Java for distributed storage and distributed processing
• Fast cluster computing system for large-scale data processing
compatible with Apache Hadoop
• Improves efficiency through:
• In-memory computing primitives
• General computation graphs
• Improves usability through:
• Rich APIs in Java, Scala, Python
• Interactive shell
Up to 100× faster
Often 2-10× less code
21. Spark OverviewSpark Overview
Spark Shell Spark applications
• Interactive shell for learning
or data exploration
• Python or Scala
• It provides a preconfigured
Spark context called sc.
• For large scale data processing
• Python, Java, Scala and R
• Every spark application requires
a spark Context. It is the main
entry point to the Spark API.
Scala Interactive shell Python Interactive shell
22. Spark Overview
Resilient distributed datasets (RDDs)
Immutable collections of objects spread across a cluster
Built through parallel transformations (map, filter, etc)
Automatically rebuilt on failure
Controllable persistence (e.g. caching in RAM) for reuse
Shared variables that can be used in parallel operations
Work with distributed collections as we would with local ones
23. Resilient Distributed Datasets (RDDs)
Two types of RDD operation
• Transformation – define new RDDs based on the current one
Example: Filter, map, reduce
• Action – return values.
Example : count, take(n)
24. Resilient Distributed Datasets (RDDs)
I have never seen the horror movies.
I never hope to see one;
But I can tell you, anyhow,
I had rather see than be one.
File: movie.txt RDD: mydata
I have never seen the horror movies.
I never hope to see one;
But I can tell you, anyhow,
I had rather see than be one.
25. Resilient Distributed Datasets (RDDs)
map and filter Transformation
I have never seen the horror movies.
I never hope to see one;
But I can tell you, anyhow,
I had rather see than be one.
I HAVE NEVER SEEN THE HORROR MOVIES.
I NEVER HOPE TO SEE ONE;
BUT I CAN TELL YOU, ANYHOW,
I HAD RATHER SEE THAN BE ONE.
I HAVE NEVER SEEN THE HORROR MOVIES.
I NEVER HOPE TO SEE ONE;
I HAD RATHER SEE THAN BE ONE.
Map(lambda line : line.upper())
Filter(lambda line: line.startswith(‘I’))
Map(line => line.toUpperCase())
Filter(line => line.startsWith(‘I’))
26. Spark Stack
• Spark SQL :
--- For SQL and unstructured data processing
• Spark Streaming :
--- Stream processing of live data streams
• MLib:
--- For machine learning algorithm
• GraphX:
--- Graph processing
27. Why Spark ?
Core engine with SQL, Streaming, machine learning and graph processing
modules.
Can run today’s most advanced algorithms.
Alternative to Map Reduce for certain applications.
APIs in Java, Scala and Python
Interactive shells in Scala and Python
Runs on Yarn, Mesos and Standalone.
28. Spark’s major use cases over Hadoop
• Iterative Algorithms in Machine Learning
• Interactive Data Mining and Data Processing
• Spark is a fully Apache Hive-compatible data warehousing system that
can run 100x faster than Hive.
• Stream processing: Log processing and Fraud detection in live streams
for alerts, aggregates and analysis
• Sensor data processing: Where data is fetched and joined from
multiple sources, in-memory dataset really helpful as they are easy and
fast to process.
33. Example : Page Rank
A way of analyzing websites based on their link relationships
• Good example of a more complex algorithm
• Multiple stages of map & reduce
• Benefits from Spark’s in-memory caching
• Multiple iterations over the same data
Basic Idea
Give pages ranks (scores) based on links to them
• Links from many pages high rank
• Link from a high-rank page high rank
35. Other Iterative Algorithms
0.96
110
0 25 50 75 100 125
Logistic Regression
4.1
155
0 30 60 90 120 150 180
K-Means Clustering
Hadoop Spark
TIME PER ITERATION(S)
NOTE : Less Iteration Time denotes high Performance
36. Spark Installation
(For end-user side)
Download Spark distribution from https://spark.apache.org/downloads.html
which pre-build of hadoop 2.4 or later.
38. Spark Installation (continue)
Build the source code using maven and hadoop
<SPARK_HOME>#build/mvn –Pyarn –Phadoop –Phaddop-2.4 -Dhadoop.version=2.6.0
39. How to run Spark ?
(Standalone mode )
Once the build is completed. Go to your bin directory which is inside Spark home
directory in a terminal and invoke Spark Shell
<SPARK_HOME>/bin#./spark-shell
40. To start all Spark’s Master and slave nodes:
To execute following terminal inside sbin directory side spark home directory.
<SPARK_HOME>/sbin#./start-all.sh
42. To stop all Spark’s Master and slave nodes:
To execute following terminal inside sbin directory side spark home directory.
<SPARK_HOME>/sbin#./stop-all.sh