SlideShare a Scribd company logo
Workshop on Parallel, Cluster and
Cloud Computing on Multi-core & GPU
(PCCCMG - 2015)
Workshop Conducted by
Computer Society of India In Association with
Dept. of CSE, VNIT and
Persistence System Ltd, Nagpur
4th – 6th Sep’15
Big-Data Cluster Computing
Advance tools & technologies
Jagadeesan A S
Software Engineer
Persistent Systems Limited
www.github.com/jagadeesanas2
www.linkedin.com/in/jagadeesanas2
ContentContent
Overview of Big Data
• Data clustering concepts
• Clustering vs Classification
• Data Journey
Advance tools and technologies
• Apache Hadoop
• Apache Spark
Future of analytics
• Demo - Spark RDD in Intellij IDEA
Big-Data is similar to Small-Data , but bigger in size and complexity.
What is Big-Data ?
Definition from Wikipedia:
Big data is the term for a collection of data sets so large and complex that
it becomes difficult to process using on-hand database management tools
or traditional data processing applications.
Characterization of Big Data: 4V’s
Veracity
Characterization of Big Data: 4V’s
Now big question ????
Why we need
Big Data ?
What to do with
those Data ?
And the answer is very clear…!!
What is a Cluster ?
A group of the same or similar elements gathered or occurring closely together.
Clustering is the key to Big Data problem
• Not feasible to “label” large collection of objects
• No prior knowledge of the number and nature of groups (clusters) in data
• Clusters may evolve over time
• Clustering provides efficient browsing, search, recommendation and organization of data
Difference between Clustering & classification
Clustering data on
Clustering videos on
Clustering Algorithms
Hundreds of Clustering algorithms are available.
• K-Means
• Kernel K-means
• Nearest neighbour
• Gaussian mixture
• Fuzzy Clustering
• OPTICS algorithm
Data Journey
Advance tools
&
Technologies
Large-Scale Data Analytics
MapReduce computing paradigm vs. Traditional database systems
Database
Many enterprises are turned to Hadoop
Especially applications generating big data, Web applications, social networks, scientific applications
APACHE HADOOP (Disk Based Computing)
open-source software framework written in Java for distributed storage and distributed processing
Design Principles of Hadoop
• Need to process big data
• Need to parallelize computation across thousands of nodes
• Commodity hardware
• Large number of low-end cheap machines working in parallel to solve
a computing problem
• Small number of high-end expensive machines
Hadoop cluster architecture
A Hadoop cluster can be divided into two abstract entities:
MapReduce engine + distributed file system =
What is SPARK
Why SPARK
How to configure SPARK
APACHE SPARK
Open-source cluster computing framework
APACHE SPARK (Memory Based Computing)
open-source software framework written in Java for distributed storage and distributed processing
• Fast cluster computing system for large-scale data processing
compatible with Apache Hadoop
• Improves efficiency through:
• In-memory computing primitives
• General computation graphs
• Improves usability through:
• Rich APIs in Java, Scala, Python
• Interactive shell
Up to 100× faster
Often 2-10× less code
Spark OverviewSpark Overview
Spark Shell Spark applications
• Interactive shell for learning
or data exploration
• Python or Scala
• It provides a preconfigured
Spark context called sc.
• For large scale data processing
• Python, Java, Scala and R
• Every spark application requires
a spark Context. It is the main
entry point to the Spark API.
Scala Interactive shell Python Interactive shell
Spark Overview
Resilient distributed datasets (RDDs)
 Immutable collections of objects spread across a cluster
 Built through parallel transformations (map, filter, etc)
 Automatically rebuilt on failure
 Controllable persistence (e.g. caching in RAM) for reuse
 Shared variables that can be used in parallel operations
Work with distributed collections as we would with local ones
Resilient Distributed Datasets (RDDs)
Two types of RDD operation
• Transformation – define new RDDs based on the current one
Example: Filter, map, reduce
• Action – return values.
Example : count, take(n)
Resilient Distributed Datasets (RDDs)
I have never seen the horror movies.
I never hope to see one;
But I can tell you, anyhow,
I had rather see than be one.
File: movie.txt RDD: mydata
I have never seen the horror movies.
I never hope to see one;
But I can tell you, anyhow,
I had rather see than be one.
Resilient Distributed Datasets (RDDs)
map and filter Transformation
I have never seen the horror movies.
I never hope to see one;
But I can tell you, anyhow,
I had rather see than be one.
I HAVE NEVER SEEN THE HORROR MOVIES.
I NEVER HOPE TO SEE ONE;
BUT I CAN TELL YOU, ANYHOW,
I HAD RATHER SEE THAN BE ONE.
I HAVE NEVER SEEN THE HORROR MOVIES.
I NEVER HOPE TO SEE ONE;
I HAD RATHER SEE THAN BE ONE.
Map(lambda line : line.upper())
Filter(lambda line: line.startswith(‘I’))
Map(line => line.toUpperCase())
Filter(line => line.startsWith(‘I’))
Spark Stack
• Spark SQL :
--- For SQL and unstructured data processing
• Spark Streaming :
--- Stream processing of live data streams
• MLib:
--- For machine learning algorithm
• GraphX:
--- Graph processing
Why Spark ?
 Core engine with SQL, Streaming, machine learning and graph processing
modules.
 Can run today’s most advanced algorithms.
 Alternative to Map Reduce for certain applications.
 APIs in Java, Scala and Python
 Interactive shells in Scala and Python
 Runs on Yarn, Mesos and Standalone.
Spark’s major use cases over Hadoop
• Iterative Algorithms in Machine Learning
• Interactive Data Mining and Data Processing
• Spark is a fully Apache Hive-compatible data warehousing system that
can run 100x faster than Hive.
• Stream processing: Log processing and Fraud detection in live streams
for alerts, aggregates and analysis
• Sensor data processing: Where data is fetched and joined from
multiple sources, in-memory dataset really helpful as they are easy and
fast to process.
MapReduce Example: Word Count
MapReduce Example: Word Count
MapReduce Example: Word Count
MapReduce Example: Word Count
Example : Page Rank
A way of analyzing websites based on their link relationships
• Good example of a more complex algorithm
• Multiple stages of map & reduce
• Benefits from Spark’s in-memory caching
• Multiple iterations over the same data
Basic Idea
Give pages ranks (scores) based on links to them
• Links from many pages  high rank
• Link from a high-rank page  high rank
PageRank Performance
171
80
23
14
0
20
40
60
80
100
120
140
160
180
200
30 60
ITERATIONTIME(S)
NUMBER OF MACHINES
Hadoop Spark
NOTE : Less Iteration Time denotes high Performance
Other Iterative Algorithms
0.96
110
0 25 50 75 100 125
Logistic Regression
4.1
155
0 30 60 90 120 150 180
K-Means Clustering
Hadoop Spark
TIME PER ITERATION(S)
NOTE : Less Iteration Time denotes high Performance
Spark Installation
(For end-user side)
Download Spark distribution from https://spark.apache.org/downloads.html
which pre-build of hadoop 2.4 or later.
Spark Installation
Clone from apache https://github.com/apache/spark GitHub repository
(For developer side)
Spark Installation (continue)
Build the source code using maven and hadoop
<SPARK_HOME>#build/mvn –Pyarn –Phadoop –Phaddop-2.4 -Dhadoop.version=2.6.0
How to run Spark ?
(Standalone mode )
Once the build is completed. Go to your bin directory which is inside Spark home
directory in a terminal and invoke Spark Shell
<SPARK_HOME>/bin#./spark-shell
To start all Spark’s Master and slave nodes:
To execute following terminal inside sbin directory side spark home directory.
<SPARK_HOME>/sbin#./start-all.sh
Spark Master at Spark (Browser view):
localhost:8080
To stop all Spark’s Master and slave nodes:
To execute following terminal inside sbin directory side spark home directory.
<SPARK_HOME>/sbin#./stop-all.sh
Future of analytics
Analytics in the Cloud
https://www.youtube.com/watch?v=JfqJTQnVZvA
• IBM is making Spark available as a cloud service on its
Bluemix cloud platform.
• 3,500 IBM researchers and developers to work on Spark-
related projects at more than a dozen labs worldwide.
Demo - Spark RDD in
Intellij IDEA
Ad

More Related Content

What's hot (20)

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Dr. C.V. Suresh Babu
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
Mohit Saini
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
Justin Cletus
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
Azad public school
 
20. Parallel Databases in DBMS
20. Parallel Databases in DBMS20. Parallel Databases in DBMS
20. Parallel Databases in DBMS
koolkampus
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reduction
Krish_ver2
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & Hadoop
Savvycom Savvycom
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & Underfitting
SOUMIT KAR
 
The CAP Theorem
The CAP Theorem The CAP Theorem
The CAP Theorem
Aleksandar Bradic
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
GauravBiswas9
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
Abhinav Tyagi
 
Support Vector Machine ppt presentation
Support Vector Machine ppt presentationSupport Vector Machine ppt presentation
Support Vector Machine ppt presentation
AyanaRukasar
 
Semi-Supervised Learning
Semi-Supervised LearningSemi-Supervised Learning
Semi-Supervised Learning
Lukas Tencer
 
Data cubes
Data cubesData cubes
Data cubes
Mohammed
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho
 
Vision of cloud computing
Vision of cloud computingVision of cloud computing
Vision of cloud computing
gaurav jain
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
Vigen Sahakyan
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
Krish_ver2
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
Sulman Ahmed
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
Mohit Saini
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
Justin Cletus
 
20. Parallel Databases in DBMS
20. Parallel Databases in DBMS20. Parallel Databases in DBMS
20. Parallel Databases in DBMS
koolkampus
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reduction
Krish_ver2
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & Hadoop
Savvycom Savvycom
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & Underfitting
SOUMIT KAR
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
GauravBiswas9
 
Support Vector Machine ppt presentation
Support Vector Machine ppt presentationSupport Vector Machine ppt presentation
Support Vector Machine ppt presentation
AyanaRukasar
 
Semi-Supervised Learning
Semi-Supervised LearningSemi-Supervised Learning
Semi-Supervised Learning
Lukas Tencer
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho
 
Vision of cloud computing
Vision of cloud computingVision of cloud computing
Vision of cloud computing
gaurav jain
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
Krish_ver2
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
Sulman Ahmed
 

Similar to Big data clustering (20)

Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
Dr. Mirko Kämpf
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
MaheshPandit16
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
cdmaxime
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Christian Perone
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
Cloudera, Inc.
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - Panorays
Demi Ben-Ari
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Getting Started with Apache Spark (Scala)
Getting Started with Apache Spark (Scala)Getting Started with Apache Spark (Scala)
Getting Started with Apache Spark (Scala)
Knoldus Inc.
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Stratio big data spain
Stratio   big data spainStratio   big data spain
Stratio big data spain
Álvaro Agea Herradón
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsfPyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
Dr. Mirko Kämpf
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
MaheshPandit16
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
cdmaxime
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Christian Perone
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
Cloudera, Inc.
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - Panorays
Demi Ben-Ari
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Getting Started with Apache Spark (Scala)
Getting Started with Apache Spark (Scala)Getting Started with Apache Spark (Scala)
Getting Started with Apache Spark (Scala)
Knoldus Inc.
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsfPyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Ad

Recently uploaded (20)

Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136
illuminati Agent uganda call+256776963507/0741506136
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
Ad

Big data clustering

  • 1. Workshop on Parallel, Cluster and Cloud Computing on Multi-core & GPU (PCCCMG - 2015) Workshop Conducted by Computer Society of India In Association with Dept. of CSE, VNIT and Persistence System Ltd, Nagpur 4th – 6th Sep’15
  • 2. Big-Data Cluster Computing Advance tools & technologies Jagadeesan A S Software Engineer Persistent Systems Limited www.github.com/jagadeesanas2 www.linkedin.com/in/jagadeesanas2
  • 3. ContentContent Overview of Big Data • Data clustering concepts • Clustering vs Classification • Data Journey Advance tools and technologies • Apache Hadoop • Apache Spark Future of analytics • Demo - Spark RDD in Intellij IDEA
  • 4. Big-Data is similar to Small-Data , but bigger in size and complexity. What is Big-Data ? Definition from Wikipedia: Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
  • 5. Characterization of Big Data: 4V’s Veracity
  • 6. Characterization of Big Data: 4V’s
  • 7. Now big question ???? Why we need Big Data ? What to do with those Data ?
  • 8. And the answer is very clear…!!
  • 9. What is a Cluster ? A group of the same or similar elements gathered or occurring closely together. Clustering is the key to Big Data problem • Not feasible to “label” large collection of objects • No prior knowledge of the number and nature of groups (clusters) in data • Clusters may evolve over time • Clustering provides efficient browsing, search, recommendation and organization of data
  • 10. Difference between Clustering & classification
  • 13. Clustering Algorithms Hundreds of Clustering algorithms are available. • K-Means • Kernel K-means • Nearest neighbour • Gaussian mixture • Fuzzy Clustering • OPTICS algorithm
  • 16. Large-Scale Data Analytics MapReduce computing paradigm vs. Traditional database systems Database Many enterprises are turned to Hadoop Especially applications generating big data, Web applications, social networks, scientific applications
  • 17. APACHE HADOOP (Disk Based Computing) open-source software framework written in Java for distributed storage and distributed processing Design Principles of Hadoop • Need to process big data • Need to parallelize computation across thousands of nodes • Commodity hardware • Large number of low-end cheap machines working in parallel to solve a computing problem • Small number of high-end expensive machines
  • 18. Hadoop cluster architecture A Hadoop cluster can be divided into two abstract entities: MapReduce engine + distributed file system =
  • 19. What is SPARK Why SPARK How to configure SPARK APACHE SPARK Open-source cluster computing framework
  • 20. APACHE SPARK (Memory Based Computing) open-source software framework written in Java for distributed storage and distributed processing • Fast cluster computing system for large-scale data processing compatible with Apache Hadoop • Improves efficiency through: • In-memory computing primitives • General computation graphs • Improves usability through: • Rich APIs in Java, Scala, Python • Interactive shell Up to 100× faster Often 2-10× less code
  • 21. Spark OverviewSpark Overview Spark Shell Spark applications • Interactive shell for learning or data exploration • Python or Scala • It provides a preconfigured Spark context called sc. • For large scale data processing • Python, Java, Scala and R • Every spark application requires a spark Context. It is the main entry point to the Spark API. Scala Interactive shell Python Interactive shell
  • 22. Spark Overview Resilient distributed datasets (RDDs)  Immutable collections of objects spread across a cluster  Built through parallel transformations (map, filter, etc)  Automatically rebuilt on failure  Controllable persistence (e.g. caching in RAM) for reuse  Shared variables that can be used in parallel operations Work with distributed collections as we would with local ones
  • 23. Resilient Distributed Datasets (RDDs) Two types of RDD operation • Transformation – define new RDDs based on the current one Example: Filter, map, reduce • Action – return values. Example : count, take(n)
  • 24. Resilient Distributed Datasets (RDDs) I have never seen the horror movies. I never hope to see one; But I can tell you, anyhow, I had rather see than be one. File: movie.txt RDD: mydata I have never seen the horror movies. I never hope to see one; But I can tell you, anyhow, I had rather see than be one.
  • 25. Resilient Distributed Datasets (RDDs) map and filter Transformation I have never seen the horror movies. I never hope to see one; But I can tell you, anyhow, I had rather see than be one. I HAVE NEVER SEEN THE HORROR MOVIES. I NEVER HOPE TO SEE ONE; BUT I CAN TELL YOU, ANYHOW, I HAD RATHER SEE THAN BE ONE. I HAVE NEVER SEEN THE HORROR MOVIES. I NEVER HOPE TO SEE ONE; I HAD RATHER SEE THAN BE ONE. Map(lambda line : line.upper()) Filter(lambda line: line.startswith(‘I’)) Map(line => line.toUpperCase()) Filter(line => line.startsWith(‘I’))
  • 26. Spark Stack • Spark SQL : --- For SQL and unstructured data processing • Spark Streaming : --- Stream processing of live data streams • MLib: --- For machine learning algorithm • GraphX: --- Graph processing
  • 27. Why Spark ?  Core engine with SQL, Streaming, machine learning and graph processing modules.  Can run today’s most advanced algorithms.  Alternative to Map Reduce for certain applications.  APIs in Java, Scala and Python  Interactive shells in Scala and Python  Runs on Yarn, Mesos and Standalone.
  • 28. Spark’s major use cases over Hadoop • Iterative Algorithms in Machine Learning • Interactive Data Mining and Data Processing • Spark is a fully Apache Hive-compatible data warehousing system that can run 100x faster than Hive. • Stream processing: Log processing and Fraud detection in live streams for alerts, aggregates and analysis • Sensor data processing: Where data is fetched and joined from multiple sources, in-memory dataset really helpful as they are easy and fast to process.
  • 33. Example : Page Rank A way of analyzing websites based on their link relationships • Good example of a more complex algorithm • Multiple stages of map & reduce • Benefits from Spark’s in-memory caching • Multiple iterations over the same data Basic Idea Give pages ranks (scores) based on links to them • Links from many pages  high rank • Link from a high-rank page  high rank
  • 34. PageRank Performance 171 80 23 14 0 20 40 60 80 100 120 140 160 180 200 30 60 ITERATIONTIME(S) NUMBER OF MACHINES Hadoop Spark NOTE : Less Iteration Time denotes high Performance
  • 35. Other Iterative Algorithms 0.96 110 0 25 50 75 100 125 Logistic Regression 4.1 155 0 30 60 90 120 150 180 K-Means Clustering Hadoop Spark TIME PER ITERATION(S) NOTE : Less Iteration Time denotes high Performance
  • 36. Spark Installation (For end-user side) Download Spark distribution from https://spark.apache.org/downloads.html which pre-build of hadoop 2.4 or later.
  • 37. Spark Installation Clone from apache https://github.com/apache/spark GitHub repository (For developer side)
  • 38. Spark Installation (continue) Build the source code using maven and hadoop <SPARK_HOME>#build/mvn –Pyarn –Phadoop –Phaddop-2.4 -Dhadoop.version=2.6.0
  • 39. How to run Spark ? (Standalone mode ) Once the build is completed. Go to your bin directory which is inside Spark home directory in a terminal and invoke Spark Shell <SPARK_HOME>/bin#./spark-shell
  • 40. To start all Spark’s Master and slave nodes: To execute following terminal inside sbin directory side spark home directory. <SPARK_HOME>/sbin#./start-all.sh
  • 41. Spark Master at Spark (Browser view): localhost:8080
  • 42. To stop all Spark’s Master and slave nodes: To execute following terminal inside sbin directory side spark home directory. <SPARK_HOME>/sbin#./stop-all.sh
  • 45. https://www.youtube.com/watch?v=JfqJTQnVZvA • IBM is making Spark available as a cloud service on its Bluemix cloud platform. • 3,500 IBM researchers and developers to work on Spark- related projects at more than a dozen labs worldwide.
  • 46. Demo - Spark RDD in Intellij IDEA