0% found this document useful (0 votes)

191 views

Unit-5 Unit-5: Case Studies of Big Data Analytics Using Map-Reduce Programming

This document discusses big data analytics using MapReduce programming and K-means clustering. It provides an introduction to big data analytics and machine learning, specifically supervised and unsupervised learning. It then describes K-means clustering in detail, including the basic K-means algorithm and how it can be implemented using MapReduce. Finally, it introduces Apache Mahout, an open source machine learning library for big data.

Uploaded by

Chitra Madhuri Yashoda

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

191 views

Unit-5 Unit-5: Case Studies of Big Data Analytics Using Map-Reduce Programming

Uploaded by

Chitra Madhuri Yashoda

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Case studies of Big Data analytics using Map-Reduce programming unit-5

Unit-5:
Case studies of Big Data analytics using Map-Reduce programming
1. K-Means clustering
2. using Big Data analytics libraries using Mahout.

INTRODUCTION

What is Big Data Analytics?

Big data analytics is the use of advanced analytic techniques against very large, diverse data sets
that include different types such as structured/unstructured and streaming/batch and different sizes from
terabytes to zettabytes. Big data is a term applied to data sets whose size or type is beyond the ability of
traditional relational databases to capture, manage, and process the data with low-latency. And it has one
or more of the following characteristics – high volume, high velocity, or high variety. Big data comes
from sensors, devices, video/audio, networks, log files, transactional applications, web, and social media -
much of it generated in real time and in a very large scale.

Analyzing big data allows analysts, researchers, and business users to make better and faster
decisions using data that was previously inaccessible or unusable. Using advanced analytics techniques
such as text analytics, machine learning, predictive analytics, data mining, statistics, and natural language
processing, businesses can analyze previously untapped data sources independent or together with their
existing enterprise data to gain new insights resulting in significantly better and faster decisions.

What is Machine Learning?

Machine learning is a branch of science that deals with programming the systems in such a way
that they automatically learn and improve with experience. Here, learning means recognizing and
understanding the input data and making wise decisions based on the supplied data.

It is very difficult to cater to all the decisions based on all possible inputs. To tackle this problem,
algorithms are developed. These algorithms build knowledge from specific data and past experience with
the principles of statistics, probability theory, logic, combinatorial optimization, search, reinforcement
learning, and control theory.

The developed algorithms form the basis of various applications such as:

• Vision processing
• Language processing
• Forecasting (e.g., stock market trends)
• Pattern recognition
• Games
• Data mining
• Expert systems
• Robotics

Machine learning is a vast area and it is quite beyond the scope of this tutorial to cover all its features.
There are several ways to implement machine learning techniques, however the most commonly used
ones are supervised and unsupervised learning.

1
G B Gangadhar
Case studies of Big Data analytics using Map-Reduce programming unit-5
Supervised Learning

Supervised learning deals with learning a function from available training data. A supervised learning
algorithm analyzes the training data and produces an inferred function, which can be used for mapping
new examples. Common examples of supervised learning include:

• classifying e-mails as spam,

• labeling webpages based on their content, and
• voice recognition.

There are many supervised learning algorithms such as neural networks, Support Vector Machines
(SVMs), and Naive Bayes classifiers. Mahout implements Naive Bayes classifier.

Unsupervised Learning

Unsupervised learning makes sense of unlabeled data without having any predefined dataset for its
training. Unsupervised learning is an extremely powerful tool for analyzing available data and look for
patterns and trends. It is most commonly used for clustering similar input into logical groups. Common
approaches to unsupervised learning include:

• k-means
• self-organizing maps, and
• hierarchical clustering

1. K-Means Clustering

Clustering is used to form groups or clusters of similar data based on common characteristics. Clustering
is a form of unsupervised learning.

• Search engines such as Google and Yahoo! use clustering techniques to group data with similar
characteristics.
• Newsgroups use clustering techniques to group various articles based on related topics.

The clustering engine goes through the input data completely and based on the characteristics of the data,
it will decide under which cluster it should be grouped.

K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e.,
data without defined categories or groups). The goal of this algorithm is to find groups in the data, with
the number of groups represented by the variable K. The algorithm works iteratively to assign each data
point to one of K groups based on the features that are provided. Data points are clustered based on
feature similarity. The results of the K-means clustering algorithm are:

1. The centroids of the K clusters, which can be used to label new data

2. Labels for the training data (each data point is assigned to a single cluster)

2
Case studies of Big Data analytics using Map-Reduce programming unit-5
Rather than defining groups before looking at the data, clustering allows you to find and analyze the
groups that have formed organically. The "Choosing K" section below describes how the number of
groups can be determined.

Each centroid of a cluster is a collection of feature values which define the resulting groups. Examining
the centroid feature weights can be used to qualitatively interpret what kind of group each cluster
represents.

In general, we have n data points xi, i=1...n that have to be partitioned in k clusters. The goal is to assign
a cluster to each data point. K-means is a clustering method that aims to find the positions ci, i=1...k of
the clusters that minimize the distance from the data points to the cluster. K-means clustering solves

K-means algorithm

1. Clusters the data into k groups where k is

predefined.
2. Select k points at random as cluster centers.
3. Assign objects to their closest cluster center
according to the Euclidean distance function.
4. Calculate the centroid or mean of all objects
in each cluster.
5. Repeat steps 2, 3 and 4 until the same points
are assigned to each cluster in consecutive
rounds.

3
Case studies of Big Data analytics using Map-Reduce programming unit-5

K-Means is relatively an efficient method. However, we need to specify the number of clusters, in
advance and the final results are sensitive to initialization and often terminates at a local optimum.
Unfortunately there is no global theoretical method to find the optimal number of clusters. A practical
approach is to compare the outcomes of multiple runs with different k and choose the best one based on
a predefined criterion. In general, a large k probably decreases the error but increases the risk of
overfitting.

Example:
Suppose we want to group the visitors to a website using just their age (a one-dimensional space) as
follows:
15,15,16,19,19,20,20,21,22,28,35,40,41,42,43,44,60,61,65
Initial clusters:
Centroid (C1) = 16 [16] 15+15+16=46
Centroid (C2) = 22 [22] 46/3=15.33
Iteration 1:
C1 = 15.33 [15,15,16]
C2 = 36.25 [19,19,20,20,21,22,28,35,40,41,42,43,44,60,61,65]
Iteration 2:
C1 = 18.56 [15,15,16,19,19,20,20,21,22]
C2 = 45.90 [28,35,40,41,42,43,44,60,61,65]
Iteration 3:
C1 = 19.50 [15,15,16,19,19,20,20,21,22,28]
C2 = 47.89 [35,40,41,42,43,44,60,61,65]
Iteration 4:
C1 = 19.50 [15,15,16,19,19,20,20,21,22,28]
C2 = 47.89 [35,40,41,42,43,44,60,61,65]
No change between iterations 3 and 4 has been noted. By using clustering, 2 groups have been identified
15-28 and 35-65. The initial choice of centroids can affect the output clusters, so the algorithm is often
run multiple times with different starting conditions in order to get a fair view of what the clusters
should be.

4
Case studies of Big Data analytics using Map-Reduce programming unit-5
MapReduce Approach

MapReduce works on keys and values, and is based on data partitioning. Thus, the assumption of having
all data points in memory fails in this paradigm. We have to design the algorithm in such a manner that
the task can be parallelized and doesn’t depend on other splits for any computation (Figure below).

Figure . Single pass of K-Means on MapReduce

The Mappers do the distance computation and spill out a key-value pair – <centroid_id, datapoint>. This
step finds the associativity of a data point with the cluster.

The Reducers work with specific cluster_id and a list of the data points associated with it. A reducer
computes new means and writes to the new centroid file.

Now, based on the user’s choice, algorithm termination method works – specific number of iterations, or
comparison with centroid in the previous iteration.

Figure . K-Means Algorithm. Algorithm termination method is user-driven

5
Case studies of Big Data analytics using Map-Reduce programming unit-5

2. APACHE MAHOUT

Mahout - Introduction

We are living in a day and age where information is available in abundance. The information
overload has scaled to such heights that sometimes it becomes difficult to manage our little
mailboxes! Imagine the volume of data and records some of the popular websites (the likes of
Facebook, Twitter, and Youtube) have to collect and manage on a daily basis. It is not uncommon
even for lesser known websites to receive huge amounts of information in bulk.

Normally we fall back on data mining algorithms to analyze bulk data to identify trends and draw
conclusions. However, no data mining algorithm can be efficient enough to process very large
datasets and provide outcomes in quick time, unless the computational tasks are run on multiple
machines distributed over the cloud.

We now have new frameworks that allow us to break down a computation task into multiple
segments and run each segment on a different machine. Mahout is such a data mining framework
that normally runs coupled with the Hadoop infrastructure at its background to manage huge
volumes of data.

What is Apache Mahout?

A mahout is one who drives an elephant as its master. The name comes from its close association
with Apache Hadoop which uses an elephant as its logo.

Hadoop is an open-source framework from Apache that allows to store and process big data in a
distributed environment across clusters of computers using simple programming models.

Apache Mahout is an open source project that is primarily used for creating scalable machine
learning algorithms. It implements popular machine learning techniques such as:

a. Recommendation
b. Classification
c. Clustering

Apache Mahout started as a sub-project of Apache’s Lucene in 2008. In 2010, Mahout became a
top level project of Apache.

Features of Mahout

The primitive features of Apache Mahout are listed below.

• The algorithms of Mahout are written on top of Hadoop, so it works well in distributed
environment. Mahout uses the Apache Hadoop library to scale effectively in the cloud.
• Mahout offers the coder a ready-to-use framework for doing data mining tasks on large
volumes of data.
• Mahout lets applications to analyze large sets of data effectively and in quick time.
• Includes several MapReduce enabled clustering implementations such as k-means, fuzzy k-
means, Canopy, Dirichlet, and Mean-Shift.

6
Case studies of Big Data analytics using Map-Reduce programming unit-5
• Supports Distributed Naive Bayes and Complementary Naive Bayes classification
implementations.
• Comes with distributed fitness function capabilities for evolutionary programming.
• Includes matrix and vector libraries.

Applications of Mahout

• Companies such as Adobe, Facebook, LinkedIn, Foursquare, Twitter, and Yahoo use Mahout
internally.
• Foursquare helps you in finding out places, food, and entertainment available in a particular
area. It uses the recommender engine of Mahout.
• Twitter uses Mahout for user interest modelling.
• Yahoo! uses Mahout for pattern mining.

Getting started with Mahout

Getting up and running with Mahout is relatively straightforward. To start, you need to install the
following prerequisites:

JDK 1.6 or higher

Ant 1.7 or higher

If you want to build the Mahout source, Maven 2.0.9 or 2.0.10

You also need this article's sample code (see Download), which includes a copy of Mahout and its
dependencies. Follow these steps to install the sample code:

1. unzip sample.zip
2. cd apache-mahout-examples
3. ant install

Step 3 downloads the necessary Wikipedia files and compiles the code. The Wikipedia file used is
approximately 2.5 gigabytes, so download times will depend on your bandwidth.

a. Recommendation
Recommendation is a popular technique that provides close recommendations based on user
information such as previous purchases, clicks, and ratings.

• Amazon uses this technique to display a list of recommended items that you might be
interested in, drawing information from your past actions. There are recommender engines
that work behind Amazon to capture user behavior and recommend selected items based on
your earlier actions.
• Facebook uses the recommender technique to identify and recommend the “people you
may know list”.

7
Case studies of Big Data analytics using Map-Reduce programming unit-5

Building a recommendation engine:

Mahout currently provides tools for building a recommendation engine through the Taste library
— a fast and flexible engine for CF. Taste supports both user-based and item-based
recommendations and comes with many choices for making recommendations, as well as

interfaces for you to define your own. Taste consists of five primary components that work with
Users, Items and Preferences:

Data Model: Storage for Users, Items, and Preferences

User Similarity: Interface defining the similarity between two users

Item Similarity: Interface defining the similarity between two items

Recommender: Interface for providing recommendations

User Neighborhood: Interface for computing a neighborhood of similar users that can then be used
by the Recommenders

These components and their implementations make it possible to build out complex
recommendation systems for either real-time-based recommendations or offline recommendations.
Real-time-based recommendations often can handle only a few thousand users, whereas offline
recommendations can scale much higher. Taste even comes with tools for leveraging Hadoop to
calculate recommendations offline. In many cases, this is a reasonable approach that allows you to
meet the demands of a large system with a lot of users, items, and preferences.

b. Classification
Classification, also known as categorization, is a machine learning technique that uses known
data to determine how the new data should be classified into a set of existing categories.
Classification is a form of supervised learning.

• Mail service providers such as Yahoo! and Gmail use this technique to decide whether a
new mail should be classified as a spam. The categorization algorithm trains itself by
analyzing user habits of marking certain mails as spams. Based on that, the classifier
decides whether a future mail should be deposited in your inbox or in the spams folder.
8
Case studies of Big Data analytics using Map-Reduce programming unit-5
• iTunes application uses classification to prepare playlists.

How Classification Works

While classifying a given set of data, the classifier system performs the following actions:

• Initially a new data model is prepared using any of the learning algorithms.
• Then the prepared data model is tested.
• Thereafter, this data model is used to evaluate the new data and to determine its class.

Applications of Classification

• Credit card fraud detection - The Classification mechanism is used to predict credit card
frauds. Using historical information of previous frauds, the classifier can predict which
future transactions may turn into frauds.
• Spam e-mails - Depending on the characteristics of previous spam mails, the classifier
determines whether a newly encountered e-mail should be sent to the spam folder.

Naive Bayes Classifier

Mahout uses the Naive Bayes classifier algorithm. It uses two implementations:
9
Case studies of Big Data analytics using Map-Reduce programming unit-5
• Distributed Naive Bayes classification
• Complementary Naive Bayes classification

Naive Bayes is a simple technique for constructing classifiers. It is not a single algorithm for
training such classifiers, but a family of algorithms. A Bayes classifier constructs models to
classify problem instances. These classifications are made using the available data.

An advantage of naive Bayes is that it only requires a small amount of training data to estimate the
parameters necessary for classification.

For some types of probability models, naive Bayes classifiers can be trained very efficiently in a
supervised learning setting.

Despite its oversimplified assumptions, naive Bayes classifiers have worked quite well in many
complex real-world situations.

Procedure of Classification

The following steps are to be followed to implement Classification:

• Generate example data

• Create sequence files from data
• Convert sequence files to vectors
• Train the vectors
• Test the vectors

c. Clustering

Clustering is used to form groups or clusters of similar data based on common characteristics.
Clustering is a form of unsupervised learning.

• Search engines such as Google and Yahoo! use clustering techniques to group data with
similar characteristics.
• Newsgroups use clustering techniques to group various articles based on related topics.

The clustering engine goes through the input data completely and based on the characteristics of
the data, it will decide under which cluster it should be grouped.

Using Mahout, we can cluster a given set of data. The steps required are as follows:

• Algorithm You need to select a suitable clustering algorithm to group the elements of a
cluster.
• Similarity and Dissimilarity You need to have a rule in place to verify the similarity between
the newly encountered elements and the elements in the groups.
• Stopping Condition A stopping condition is required to define the point where no clustering
is required.

10
Case studies of Big Data analytics using Map-Reduce programming unit-5
Procedure of Clustering

To cluster the given data you need to -

• Start the Hadoop server. Create required directories for storing files in Hadoop File System.
(Create directories for input file, sequence file, and clustered output in case of canopy).
• Copy the input file to the Hadoop File system from Unix file system.
• Prepare the sequence file from the input data.
• Run any of the available clustering algorithms.
• Get the clustered data.

Mahout supports several clustering-algorithm implementations, all written in Map-Reduce, each with its
own set of goals and criteria:

Canopy: A fast clustering algorithm often used to create initial seeds for other clustering algorithms.

k-Means (and fuzzy k-Means): Clusters items into k clusters based on the distance the items are from the
centroid, or center, of the previous iteration.

Mean-Shift: Algorithm that does not require any a priori knowledge about the number of clusters and can
produce arbitrarily shaped clusters.

Dirichlet: Clusters based on the mixing of many probabilistic models giving it the advantage that it
doesn't need to commit to a particular view of the clusters prematurely.

Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Content Area Reading and Literacy - Succeeding in Today - 039s Diveooms - (Etc.)
0% (1)
Content Area Reading and Literacy - Succeeding in Today - 039s Diveooms - (Etc.)
24 pages
FDP Day1
No ratings yet
FDP Day1
35 pages
UNIT V DWM Notes
No ratings yet
UNIT V DWM Notes
18 pages
PST UnitV
No ratings yet
PST UnitV
14 pages
Soft Computing Introduction
No ratings yet
Soft Computing Introduction
29 pages
KCG College of Technology Karapakkam Chennai-600 097
No ratings yet
KCG College of Technology Karapakkam Chennai-600 097
3 pages
Unit 2 Fod
No ratings yet
Unit 2 Fod
27 pages
Cs2253 - Computer Architecture 16 Marks Question Bank With Hints Unit - I 1. Explain Basic Functional Units of Computer. Input Unit
No ratings yet
Cs2253 - Computer Architecture 16 Marks Question Bank With Hints Unit - I 1. Explain Basic Functional Units of Computer. Input Unit
18 pages
Introduction To Databases CT042-3-1-IDB
No ratings yet
Introduction To Databases CT042-3-1-IDB
22 pages
Software Construction Lecture 1
No ratings yet
Software Construction Lecture 1
30 pages
DM Unit 3
No ratings yet
DM Unit 3
39 pages
UNIT 2 DMW
No ratings yet
UNIT 2 DMW
26 pages
DS Unit I
100% (1)
DS Unit I
40 pages
2023 Winter Question Paper (Msbte Study Resources)
No ratings yet
2023 Winter Question Paper (Msbte Study Resources)
2 pages
Introduction To Programming and Problem Solving
No ratings yet
Introduction To Programming and Problem Solving
33 pages
TRB Answer Key
100% (1)
TRB Answer Key
70 pages
Microprocessor File
No ratings yet
Microprocessor File
50 pages
ADBMS Lab Manual
No ratings yet
ADBMS Lab Manual
33 pages
Iceberg Queries and Other Data Mining Concepts
No ratings yet
Iceberg Queries and Other Data Mining Concepts
53 pages
EXCELR-FDP-59525-Prashant Mulge
No ratings yet
EXCELR-FDP-59525-Prashant Mulge
1 page
Data Recovery Presentation
No ratings yet
Data Recovery Presentation
8 pages
Chapter 1 Databases and Database Users
100% (1)
Chapter 1 Databases and Database Users
7 pages
L01 - Basics of Structured Programming
100% (2)
L01 - Basics of Structured Programming
35 pages
NN UNIT-1 Complete Notes with 153 pages (1)
No ratings yet
NN UNIT-1 Complete Notes with 153 pages (1)
153 pages
Practice Questions For Theory - DBMS
No ratings yet
Practice Questions For Theory - DBMS
18 pages
APP Question Bank Unit3
100% (1)
APP Question Bank Unit3
5 pages
Fds MCQ Set1 Sppu Se Computer Fds MCQ
No ratings yet
Fds MCQ Set1 Sppu Se Computer Fds MCQ
4 pages
cs401 PDF
33% (3)
cs401 PDF
55 pages
5 Ways of Increasing The Capacity of Cellular System
100% (1)
5 Ways of Increasing The Capacity of Cellular System
7 pages
Se Module 2 PPT
No ratings yet
Se Module 2 PPT
86 pages
Daa Question Paper Winter 2024
No ratings yet
Daa Question Paper Winter 2024
8 pages
Practical - Image Editing Tool 1 PDF
0% (1)
Practical - Image Editing Tool 1 PDF
1 page
V Sem Solution Bank
100% (1)
V Sem Solution Bank
303 pages
Big Data Question Bank
No ratings yet
Big Data Question Bank
38 pages
Chapter 4 - AI - Notes
No ratings yet
Chapter 4 - AI - Notes
16 pages
Part B Questions
No ratings yet
Part B Questions
3 pages
Mumbai University Question Paper Solutions: Data Warehousing
No ratings yet
Mumbai University Question Paper Solutions: Data Warehousing
58 pages
CPU Scheduling
No ratings yet
CPU Scheduling
48 pages
Unit 1 QB
No ratings yet
Unit 1 QB
20 pages
Se Lab 9 U (Datedddd
No ratings yet
Se Lab 9 U (Datedddd
4 pages
Graphs Assignment
No ratings yet
Graphs Assignment
5 pages
JULY 7, 2023: Computer Science MOE Model Exam Questions Prepared by Arsi University Comp Sc. Students
No ratings yet
JULY 7, 2023: Computer Science MOE Model Exam Questions Prepared by Arsi University Comp Sc. Students
25 pages
Hbase
No ratings yet
Hbase
13 pages
Multiple Granularity Locking
No ratings yet
Multiple Granularity Locking
1 page
Bca 4th Sem - Rdbms
No ratings yet
Bca 4th Sem - Rdbms
30 pages
Assignment 2b. Programming Fundamentals
No ratings yet
Assignment 2b. Programming Fundamentals
1 page
Chapter 1 Introduction To DBMS PDF
No ratings yet
Chapter 1 Introduction To DBMS PDF
6 pages
Modeling and Detection of Camouflaging Worm
No ratings yet
Modeling and Detection of Camouflaging Worm
37 pages
Bit 2202 Data Structures and Algorithms MS2
100% (1)
Bit 2202 Data Structures and Algorithms MS2
7 pages
Query Processing - Database Questions & Answers - Sanfoundry 00
No ratings yet
Query Processing - Database Questions & Answers - Sanfoundry 00
7 pages
BIT 4205 - Network Programming - Updated - Complete Notes
100% (1)
BIT 4205 - Network Programming - Updated - Complete Notes
99 pages
Data Clustering..
No ratings yet
Data Clustering..
10 pages
PPS Course Material
100% (1)
PPS Course Material
177 pages
Software Engineering MCQs
No ratings yet
Software Engineering MCQs
40 pages
Visual Basic
100% (1)
Visual Basic
3 pages
DAA-2020-21 Final Updated Course File
No ratings yet
DAA-2020-21 Final Updated Course File
49 pages
dbms
No ratings yet
dbms
42 pages
Subject Name Parallel and Distributed Computing
100% (1)
Subject Name Parallel and Distributed Computing
3 pages
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
15A05602 Data Warehousing & Mining
No ratings yet
15A05602 Data Warehousing & Mining
2 pages
15A05602 Data Warehousing & Mining
No ratings yet
15A05602 Data Warehousing & Mining
2 pages
15A05602 Data Warehousing & Mining
No ratings yet
15A05602 Data Warehousing & Mining
2 pages
15A05602 Data Warehousing & Mining
No ratings yet
15A05602 Data Warehousing & Mining
1 page
Distributed File Systems Leading To Hadoop File System: UNIT-2
No ratings yet
Distributed File Systems Leading To Hadoop File System: UNIT-2
12 pages
Big Data Unit 3
0% (1)
Big Data Unit 3
23 pages
15A05602 Data Warehousing & Mining
No ratings yet
15A05602 Data Warehousing & Mining
2 pages
15A05806 Cyber Security
No ratings yet
15A05806 Cyber Security
1 page
WWW - Manaresults.Co - In: Code: 13A12804
No ratings yet
WWW - Manaresults.Co - In: Code: 13A12804
1 page
Big Data Unit 4
No ratings yet
Big Data Unit 4
14 pages
Pre QP
No ratings yet
Pre QP
4 pages
DAY 1 - PPT - Supraja Technologies - BEC PDF
No ratings yet
DAY 1 - PPT - Supraja Technologies - BEC PDF
37 pages
B.Tech IV Year II Semester (R15) Regular Examinations April 2019
No ratings yet
B.Tech IV Year II Semester (R15) Regular Examinations April 2019
1 page
Cyber Notes
No ratings yet
Cyber Notes
73 pages
IV Year - II Sem CS 2
No ratings yet
IV Year - II Sem CS 2
3 pages
Machine Learning (Class 4-5) e
No ratings yet
Machine Learning (Class 4-5) e
20 pages
Campus
No ratings yet
Campus
34 pages
Role of AI in Marketing Rahul Sahai
No ratings yet
Role of AI in Marketing Rahul Sahai
32 pages
Speech Writing Notes
No ratings yet
Speech Writing Notes
2 pages
Faulty Listening Behaviors
No ratings yet
Faulty Listening Behaviors
3 pages
List 1.5 FILE
No ratings yet
List 1.5 FILE
9 pages
Organizational Climate Lawler III Et Al.
100% (1)
Organizational Climate Lawler III Et Al.
17 pages
AstuBooklet Draft
100% (1)
AstuBooklet Draft
109 pages
Klasifikasi Image Processing Pada Citra Warna Daun Padi Menggunakan Metode Convolutional Neural Network
No ratings yet
Klasifikasi Image Processing Pada Citra Warna Daun Padi Menggunakan Metode Convolutional Neural Network
12 pages
Using Systems Thinking To Leverage Technology For School Improvement
No ratings yet
Using Systems Thinking To Leverage Technology For School Improvement
24 pages
PerDev-F-Exam
No ratings yet
PerDev-F-Exam
4 pages
Practical Research 2: Quarter 1 - Module 3
No ratings yet
Practical Research 2: Quarter 1 - Module 3
42 pages
Diass Week 1-9-WPS Office
No ratings yet
Diass Week 1-9-WPS Office
15 pages
7 - Math and Logic Games
No ratings yet
7 - Math and Logic Games
3 pages
Lecture 5. Data Mining For Business Intelligence
No ratings yet
Lecture 5. Data Mining For Business Intelligence
9 pages
Programs Bachelor of Science in Business Administration (BSBA) Major in Business Economics
No ratings yet
Programs Bachelor of Science in Business Administration (BSBA) Major in Business Economics
4 pages
Sofia Silva - Oral Presentations Assessment Grid
No ratings yet
Sofia Silva - Oral Presentations Assessment Grid
3 pages
Brainstorming For Research Topics
No ratings yet
Brainstorming For Research Topics
27 pages
01 Oral Communication in Context
No ratings yet
01 Oral Communication in Context
12 pages
Lesson Plan 14
No ratings yet
Lesson Plan 14
6 pages
Syllabus Books 2020 Class 6 To 10
No ratings yet
Syllabus Books 2020 Class 6 To 10
6 pages
Barriers of Communication
No ratings yet
Barriers of Communication
15 pages
XC 4 JSJFV
No ratings yet
XC 4 JSJFV
49 pages
7 M7ME IIa 1 2 A Measurement
No ratings yet
7 M7ME IIa 1 2 A Measurement
4 pages
STS Midterm Draft
No ratings yet
STS Midterm Draft
5 pages
Job Satisfaction:: Comparisons Among Diverse Public Organizations in The UAE
No ratings yet
Job Satisfaction:: Comparisons Among Diverse Public Organizations in The UAE
20 pages
Contreras 2007
No ratings yet
Contreras 2007
293 pages
Basic Competencies 3
No ratings yet
Basic Competencies 3
6 pages
Callero 2003 The Sociology of Self
No ratings yet
Callero 2003 The Sociology of Self
20 pages

Unit-5 Unit-5: Case Studies of Big Data Analytics Using Map-Reduce Programming

Uploaded by

Unit-5 Unit-5: Case Studies of Big Data Analytics Using Map-Reduce Programming

Uploaded by

Case studies of Big Data analytics using Map-Reduce programming unit-5

What is Big Data Analytics?

What is Machine Learning?

• classifying e-mails as spam,

1. Clusters the data into k groups where k is

Figure . Single pass of K-Means on MapReduce

Figure . K-Means Algorithm. Algorithm termination method is user-driven

What is Apache Mahout?

The primitive features of Apache Mahout are listed below.

Getting started with Mahout

JDK 1.6 or higher

If you want to build the Mahout source, Maven 2.0.9 or 2.0.10

Building a recommendation engine:

Data Model: Storage for Users, Items, and Preferences

User Similarity: Interface defining the similarity between two users

Item Similarity: Interface defining the similarity between two items

Recommender: Interface for providing recommendations

How Classification Works

Naive Bayes Classifier

The following steps are to be followed to implement Classification:

• Generate example data

To cluster the given data you need to -

You might also like