0% found this document useful (0 votes)
191 views

Unit-5 Unit-5: Case Studies of Big Data Analytics Using Map-Reduce Programming

This document discusses big data analytics using MapReduce programming and K-means clustering. It provides an introduction to big data analytics and machine learning, specifically supervised and unsupervised learning. It then describes K-means clustering in detail, including the basic K-means algorithm and how it can be implemented using MapReduce. Finally, it introduces Apache Mahout, an open source machine learning library for big data.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
191 views

Unit-5 Unit-5: Case Studies of Big Data Analytics Using Map-Reduce Programming

This document discusses big data analytics using MapReduce programming and K-means clustering. It provides an introduction to big data analytics and machine learning, specifically supervised and unsupervised learning. It then describes K-means clustering in detail, including the basic K-means algorithm and how it can be implemented using MapReduce. Finally, it introduces Apache Mahout, an open source machine learning library for big data.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Case studies of Big Data analytics using Map-Reduce programming unit-5

Unit-5:
Case studies of Big Data analytics using Map-Reduce programming
1. K-Means clustering
2. using Big Data analytics libraries using Mahout.

INTRODUCTION

What is Big Data Analytics?

Big data analytics is the use of advanced analytic techniques against very large, diverse data sets
that include different types such as structured/unstructured and streaming/batch and different sizes from
terabytes to zettabytes. Big data is a term applied to data sets whose size or type is beyond the ability of
traditional relational databases to capture, manage, and process the data with low-latency. And it has one
or more of the following characteristics – high volume, high velocity, or high variety. Big data comes
from sensors, devices, video/audio, networks, log files, transactional applications, web, and social media -
much of it generated in real time and in a very large scale.

Analyzing big data allows analysts, researchers, and business users to make better and faster
decisions using data that was previously inaccessible or unusable. Using advanced analytics techniques
such as text analytics, machine learning, predictive analytics, data mining, statistics, and natural language
processing, businesses can analyze previously untapped data sources independent or together with their
existing enterprise data to gain new insights resulting in significantly better and faster decisions.

What is Machine Learning?

Machine learning is a branch of science that deals with programming the systems in such a way
that they automatically learn and improve with experience. Here, learning means recognizing and
understanding the input data and making wise decisions based on the supplied data.

It is very difficult to cater to all the decisions based on all possible inputs. To tackle this problem,
algorithms are developed. These algorithms build knowledge from specific data and past experience with
the principles of statistics, probability theory, logic, combinatorial optimization, search, reinforcement
learning, and control theory.

The developed algorithms form the basis of various applications such as:

• Vision processing
• Language processing
• Forecasting (e.g., stock market trends)
• Pattern recognition
• Games
• Data mining
• Expert systems
• Robotics

Machine learning is a vast area and it is quite beyond the scope of this tutorial to cover all its features.
There are several ways to implement machine learning techniques, however the most commonly used
ones are supervised and unsupervised learning.

1
G B Gangadhar
Case studies of Big Data analytics using Map-Reduce programming unit-5
Supervised Learning

Supervised learning deals with learning a function from available training data. A supervised learning
algorithm analyzes the training data and produces an inferred function, which can be used for mapping
new examples. Common examples of supervised learning include:

• classifying e-mails as spam,


• labeling webpages based on their content, and
• voice recognition.

There are many supervised learning algorithms such as neural networks, Support Vector Machines
(SVMs), and Naive Bayes classifiers. Mahout implements Naive Bayes classifier.

Unsupervised Learning

Unsupervised learning makes sense of unlabeled data without having any predefined dataset for its
training. Unsupervised learning is an extremely powerful tool for analyzing available data and look for
patterns and trends. It is most commonly used for clustering similar input into logical groups. Common
approaches to unsupervised learning include:

• k-means
• self-organizing maps, and
• hierarchical clustering

1. K-Means Clustering

Clustering is used to form groups or clusters of similar data based on common characteristics. Clustering
is a form of unsupervised learning.

• Search engines such as Google and Yahoo! use clustering techniques to group data with similar
characteristics.
• Newsgroups use clustering techniques to group various articles based on related topics.

The clustering engine goes through the input data completely and based on the characteristics of the data,
it will decide under which cluster it should be grouped.

K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e.,
data without defined categories or groups). The goal of this algorithm is to find groups in the data, with
the number of groups represented by the variable K. The algorithm works iteratively to assign each data
point to one of K groups based on the features that are provided. Data points are clustered based on
feature similarity. The results of the K-means clustering algorithm are:

1. The centroids of the K clusters, which can be used to label new data

2. Labels for the training data (each data point is assigned to a single cluster)

2
Case studies of Big Data analytics using Map-Reduce programming unit-5
Rather than defining groups before looking at the data, clustering allows you to find and analyze the
groups that have formed organically. The "Choosing K" section below describes how the number of
groups can be determined.

Each centroid of a cluster is a collection of feature values which define the resulting groups. Examining
the centroid feature weights can be used to qualitatively interpret what kind of group each cluster
represents.

In general, we have n data points xi, i=1...n that have to be partitioned in k clusters. The goal is to assign
a cluster to each data point. K-means is a clustering method that aims to find the positions ci, i=1...k of
the clusters that minimize the distance from the data points to the cluster. K-means clustering solves

K-means algorithm

1. Clusters the data into k groups where k is


predefined.
2. Select k points at random as cluster centers.
3. Assign objects to their closest cluster center
according to the Euclidean distance function.
4. Calculate the centroid or mean of all objects
in each cluster.
5. Repeat steps 2, 3 and 4 until the same points
are assigned to each cluster in consecutive
rounds.

3
Case studies of Big Data analytics using Map-Reduce programming unit-5

K-Means is relatively an efficient method. However, we need to specify the number of clusters, in
advance and the final results are sensitive to initialization and often terminates at a local optimum.
Unfortunately there is no global theoretical method to find the optimal number of clusters. A practical
approach is to compare the outcomes of multiple runs with different k and choose the best one based on
a predefined criterion. In general, a large k probably decreases the error but increases the risk of
overfitting.

Example:
Suppose we want to group the visitors to a website using just their age (a one-dimensional space) as
follows:
15,15,16,19,19,20,20,21,22,28,35,40,41,42,43,44,60,61,65
Initial clusters:
Centroid (C1) = 16 [16] 15+15+16=46
Centroid (C2) = 22 [22] 46/3=15.33
Iteration 1:
C1 = 15.33 [15,15,16]
C2 = 36.25 [19,19,20,20,21,22,28,35,40,41,42,43,44,60,61,65]
Iteration 2:
C1 = 18.56 [15,15,16,19,19,20,20,21,22]
C2 = 45.90 [28,35,40,41,42,43,44,60,61,65]
Iteration 3:
C1 = 19.50 [15,15,16,19,19,20,20,21,22,28]
C2 = 47.89 [35,40,41,42,43,44,60,61,65]
Iteration 4:
C1 = 19.50 [15,15,16,19,19,20,20,21,22,28]
C2 = 47.89 [35,40,41,42,43,44,60,61,65]
No change between iterations 3 and 4 has been noted. By using clustering, 2 groups have been identified
15-28 and 35-65. The initial choice of centroids can affect the output clusters, so the algorithm is often
run multiple times with different starting conditions in order to get a fair view of what the clusters
should be.

4
Case studies of Big Data analytics using Map-Reduce programming unit-5
MapReduce Approach

MapReduce works on keys and values, and is based on data partitioning. Thus, the assumption of having
all data points in memory fails in this paradigm. We have to design the algorithm in such a manner that
the task can be parallelized and doesn’t depend on other splits for any computation (Figure below).

Figure . Single pass of K-Means on MapReduce

The Mappers do the distance computation and spill out a key-value pair – <centroid_id, datapoint>. This
step finds the associativity of a data point with the cluster.

The Reducers work with specific cluster_id and a list of the data points associated with it. A reducer
computes new means and writes to the new centroid file.

Now, based on the user’s choice, algorithm termination method works – specific number of iterations, or
comparison with centroid in the previous iteration.

Figure . K-Means Algorithm. Algorithm termination method is user-driven


5
Case studies of Big Data analytics using Map-Reduce programming unit-5

2. APACHE MAHOUT

Mahout - Introduction

We are living in a day and age where information is available in abundance. The information
overload has scaled to such heights that sometimes it becomes difficult to manage our little
mailboxes! Imagine the volume of data and records some of the popular websites (the likes of
Facebook, Twitter, and Youtube) have to collect and manage on a daily basis. It is not uncommon
even for lesser known websites to receive huge amounts of information in bulk.

Normally we fall back on data mining algorithms to analyze bulk data to identify trends and draw
conclusions. However, no data mining algorithm can be efficient enough to process very large
datasets and provide outcomes in quick time, unless the computational tasks are run on multiple
machines distributed over the cloud.

We now have new frameworks that allow us to break down a computation task into multiple
segments and run each segment on a different machine. Mahout is such a data mining framework
that normally runs coupled with the Hadoop infrastructure at its background to manage huge
volumes of data.

What is Apache Mahout?

A mahout is one who drives an elephant as its master. The name comes from its close association
with Apache Hadoop which uses an elephant as its logo.

Hadoop is an open-source framework from Apache that allows to store and process big data in a
distributed environment across clusters of computers using simple programming models.

Apache Mahout is an open source project that is primarily used for creating scalable machine
learning algorithms. It implements popular machine learning techniques such as:

a. Recommendation
b. Classification
c. Clustering

Apache Mahout started as a sub-project of Apache’s Lucene in 2008. In 2010, Mahout became a
top level project of Apache.

Features of Mahout

The primitive features of Apache Mahout are listed below.

• The algorithms of Mahout are written on top of Hadoop, so it works well in distributed
environment. Mahout uses the Apache Hadoop library to scale effectively in the cloud.
• Mahout offers the coder a ready-to-use framework for doing data mining tasks on large
volumes of data.
• Mahout lets applications to analyze large sets of data effectively and in quick time.
• Includes several MapReduce enabled clustering implementations such as k-means, fuzzy k-
means, Canopy, Dirichlet, and Mean-Shift.

6
Case studies of Big Data analytics using Map-Reduce programming unit-5
• Supports Distributed Naive Bayes and Complementary Naive Bayes classification
implementations.
• Comes with distributed fitness function capabilities for evolutionary programming.
• Includes matrix and vector libraries.

Applications of Mahout

• Companies such as Adobe, Facebook, LinkedIn, Foursquare, Twitter, and Yahoo use Mahout
internally.
• Foursquare helps you in finding out places, food, and entertainment available in a particular
area. It uses the recommender engine of Mahout.
• Twitter uses Mahout for user interest modelling.
• Yahoo! uses Mahout for pattern mining.

Getting started with Mahout


Getting up and running with Mahout is relatively straightforward. To start, you need to install the
following prerequisites:

JDK 1.6 or higher


Ant 1.7 or higher

If you want to build the Mahout source, Maven 2.0.9 or 2.0.10


You also need this article's sample code (see Download), which includes a copy of Mahout and its
dependencies. Follow these steps to install the sample code:

1. unzip sample.zip
2. cd apache-mahout-examples
3. ant install

Step 3 downloads the necessary Wikipedia files and compiles the code. The Wikipedia file used is
approximately 2.5 gigabytes, so download times will depend on your bandwidth.

a. Recommendation
Recommendation is a popular technique that provides close recommendations based on user
information such as previous purchases, clicks, and ratings.

• Amazon uses this technique to display a list of recommended items that you might be
interested in, drawing information from your past actions. There are recommender engines
that work behind Amazon to capture user behavior and recommend selected items based on
your earlier actions.
• Facebook uses the recommender technique to identify and recommend the “people you
may know list”.

7
Case studies of Big Data analytics using Map-Reduce programming unit-5

Building a recommendation engine:

Mahout currently provides tools for building a recommendation engine through the Taste library
— a fast and flexible engine for CF. Taste supports both user-based and item-based
recommendations and comes with many choices for making recommendations, as well as

interfaces for you to define your own. Taste consists of five primary components that work with
Users, Items and Preferences:

Data Model: Storage for Users, Items, and Preferences

User Similarity: Interface defining the similarity between two users

Item Similarity: Interface defining the similarity between two items

Recommender: Interface for providing recommendations

User Neighborhood: Interface for computing a neighborhood of similar users that can then be used
by the Recommenders

These components and their implementations make it possible to build out complex
recommendation systems for either real-time-based recommendations or offline recommendations.
Real-time-based recommendations often can handle only a few thousand users, whereas offline
recommendations can scale much higher. Taste even comes with tools for leveraging Hadoop to
calculate recommendations offline. In many cases, this is a reasonable approach that allows you to
meet the demands of a large system with a lot of users, items, and preferences.

b. Classification
Classification, also known as categorization, is a machine learning technique that uses known
data to determine how the new data should be classified into a set of existing categories.
Classification is a form of supervised learning.

• Mail service providers such as Yahoo! and Gmail use this technique to decide whether a
new mail should be classified as a spam. The categorization algorithm trains itself by
analyzing user habits of marking certain mails as spams. Based on that, the classifier
decides whether a future mail should be deposited in your inbox or in the spams folder.
8
Case studies of Big Data analytics using Map-Reduce programming unit-5
• iTunes application uses classification to prepare playlists.

How Classification Works

While classifying a given set of data, the classifier system performs the following actions:

• Initially a new data model is prepared using any of the learning algorithms.
• Then the prepared data model is tested.
• Thereafter, this data model is used to evaluate the new data and to determine its class.

Applications of Classification

• Credit card fraud detection - The Classification mechanism is used to predict credit card
frauds. Using historical information of previous frauds, the classifier can predict which
future transactions may turn into frauds.
• Spam e-mails - Depending on the characteristics of previous spam mails, the classifier
determines whether a newly encountered e-mail should be sent to the spam folder.

Naive Bayes Classifier

Mahout uses the Naive Bayes classifier algorithm. It uses two implementations:
9
Case studies of Big Data analytics using Map-Reduce programming unit-5
• Distributed Naive Bayes classification
• Complementary Naive Bayes classification

Naive Bayes is a simple technique for constructing classifiers. It is not a single algorithm for
training such classifiers, but a family of algorithms. A Bayes classifier constructs models to
classify problem instances. These classifications are made using the available data.

An advantage of naive Bayes is that it only requires a small amount of training data to estimate the
parameters necessary for classification.

For some types of probability models, naive Bayes classifiers can be trained very efficiently in a
supervised learning setting.

Despite its oversimplified assumptions, naive Bayes classifiers have worked quite well in many
complex real-world situations.

Procedure of Classification

The following steps are to be followed to implement Classification:

• Generate example data


• Create sequence files from data
• Convert sequence files to vectors
• Train the vectors
• Test the vectors

c. Clustering

Clustering is used to form groups or clusters of similar data based on common characteristics.
Clustering is a form of unsupervised learning.

• Search engines such as Google and Yahoo! use clustering techniques to group data with
similar characteristics.
• Newsgroups use clustering techniques to group various articles based on related topics.

The clustering engine goes through the input data completely and based on the characteristics of
the data, it will decide under which cluster it should be grouped.

Using Mahout, we can cluster a given set of data. The steps required are as follows:

• Algorithm You need to select a suitable clustering algorithm to group the elements of a
cluster.
• Similarity and Dissimilarity You need to have a rule in place to verify the similarity between
the newly encountered elements and the elements in the groups.
• Stopping Condition A stopping condition is required to define the point where no clustering
is required.

10
Case studies of Big Data analytics using Map-Reduce programming unit-5
Procedure of Clustering

To cluster the given data you need to -

• Start the Hadoop server. Create required directories for storing files in Hadoop File System.
(Create directories for input file, sequence file, and clustered output in case of canopy).
• Copy the input file to the Hadoop File system from Unix file system.
• Prepare the sequence file from the input data.
• Run any of the available clustering algorithms.
• Get the clustered data.

Mahout supports several clustering-algorithm implementations, all written in Map-Reduce, each with its
own set of goals and criteria:

Canopy: A fast clustering algorithm often used to create initial seeds for other clustering algorithms.

k-Means (and fuzzy k-Means): Clusters items into k clusters based on the distance the items are from the
centroid, or center, of the previous iteration.

Mean-Shift: Algorithm that does not require any a priori knowledge about the number of clusters and can
produce arbitrarily shaped clusters.

Dirichlet: Clusters based on the mixing of many probabilistic models giving it the advantage that it
doesn't need to commit to a particular view of the clusters prematurely.

11

You might also like