0% found this document useful (0 votes)

22 views62 pages

BDA Notes Unit-5

Uploaded by

Varshini Dirishala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views62 pages

BDA Notes Unit-5

Uploaded by

Varshini Dirishala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

SUBJECT: BIG DATA ANALYTICTS

UNIT 5 NOTES
UNIT V
Data Analytics with R Machine Learning: Introduction, Supervised Learning,
Unsupervised Learning, Collaborative Filtering, Social Media Analytics,
Mobile Analytics, Big Data Analytics with BigR.

Data Analytics with R Machine Learning:

What is machine learning
Machine learning is a growing technology which enables computers to learn automatically
from past data. Machine learning uses various algorithms for building mathematical models
and making predictions using historical data or information. Currently, it is being used for
various tasks such as image recognition, speech recognition, email filtering, Facebook auto-
tagging, recommender system, and many more.
The machine learning techniques such as Supervised, Unsupervised, and Reinforcement learning.
Regression and classification models, clustering methods, hidden Markov models, and various
sequential models.

Machine learning enables a machine to automatically learn from data, improve performance
from experiences, and predict things without being explicitly programmed.

With the help of sample historical data, which is known as training data, machine learning
algorithms build a mathematical model that helps in making predictions or decisions without
being explicitly programmed. Machine learning brings computer science and statistics together
for creating predictive models. Machine learning constructs or uses the algorithms that learn
from historical data. The more we will provide the information, the higher will be the
performance.

A machine has the ability to learn if it can improve its performance by gaining more data.

How does Machine Learning work

A Machine Learning system learns from historical data, builds the prediction models, and
whenever it receives new data, predicts the output for it. The accuracy of predicted output
depends upon the amount of data, as the huge amount of data helps to build a better model
which predicts the output more accurately.

Suppose we have a complex problem, where we need to perform some predictions, so instead
of writing a code for it, we just need to feed the data to generic algorithms, and with the help
of these algorithms, machine builds the logic as per the data and predict the output. Machine
learning has changed our way of thinking about the problem. The below block diagram explains
the working of Machine Learning algorithm:
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

Features of Machine Learning:

o Machine learning uses data to detect various patterns in a given dataset.
o It can learn from past data and improve automatically.
o It is a data-driven technology.
o Machine learning is much similar to data mining as it also deals with the huge amount
of the data.

Need for Machine Learning

The need for machine learning is increasing day by day. The reason behind the need for
machine learning is that it is capable of doing tasks that are too complex for a person to
implement directly. As a human, we have some limitations as we cannot access the huge
amount of data manually, so for this, we need some computer systems and here comes the
machine learning to make things easy for us.

We can train machine learning algorithms by providing them the huge amount of data and let
them explore the data, construct the models, and predict the required output automatically. The
performance of the machine learning algorithm depends on the amount of data, and it can be
determined by the cost function. With the help of machine learning, we can save both time and
money.

The importance of machine learning can be easily understood by its uses cases, Currently,
machine learning is used in self-driving cars, cyber fraud detection, face recognition,
and friend suggestion by Facebook, etc. Various top companies such as Netflix and Amazon
have build machine learning models that are using a vast amount of data to analyze the user
interest and recommend product accordingly.

Following are some key points which show the importance of Machine Learning:

o Rapid increment in the production of data

o Solving complex problems, which are difficult for a human
o Decision making in various sector including finance
o Finding hidden patterns and extracting useful information from data.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

Classification of Machine Learning

At a broad level, machine learning can be classified into three types:

1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning

Data Analytics with R Machine Learning:

What is R Analytics?

R analytics is data analytics using R programming language, an open-source

language used for statistical computing or graphics. This programming language
is often used in statistical analysis and data mining. It can be used for analytics
to identify patterns and build practical models. R not only can help analyze
organizations’ data, but also be used to help in the creation and development of
software applications that perform statistical analysis.

With a graphical user interface for developing programs, R supports a variety of

analytical modeling techniques such as classical statistical tests, clustering, time-
series analysis, linear and nonlinear modeling, and more. The interface has four
windows: the script window, console window, workspace and history window,
and tabs of interest (help, packages, plots, and files). R allows for publication-
ready plots and graphics and for storage of reusable analytics for future data.

R has become increasingly popular over many years and remains a top analytics
language for many universities and colleges. It is well established today within
academia as well as among corporations around the world for delivering robust,
reliable, and accurate analytics. While R programming was originally seen as
difficult for non-statisticians to learn, the user interface has become more user-
friendly in recent years. It also now allows for extensions and other plugins like
R Studio and R Excel, making the learning process easier and faster for new
business analysts and other users. It has become the industry standard for
statistical analysis and data mining projects and is due to grow in use as more
graduates enter the workforce as R-trained analysts.

What are the Benefits of R Analytics?

MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

Business analytics in R allows users to analyze business data more efficiently.

The following are some of the main benefits realized by companies employing R
in their analytics programs:

Democratizing Analytics Across the Organization: R can help democratize

analytics by enabling business users with interactive data visualization and
reporting tools. R can be used for data science by non data scientists so that
business users and citizen data scientists can make better business decisions. R
analytics can also reduce time spent on data preparation and data wrangling,
allowing data scientists to focus on more complex data science initiatives.

Providing Deeper, More Accurate Insights: Today, most successful companies

are data driven and therefore data analytics affects almost every area of business.
And while there are a whole host of powerful data analytics tools, R can help
create powerful models to analyze large amounts of data. With more precise data
collection and storage through R analytics, companies can deliver more valuable
insights to users. Analytics and statistical engines using R provide deeper, more
accurate insights for the business. R can be used to develop very specific, in-depth
analyses.

Leveraging Big Data: R can help with querying big data and is used by many
industry leaders to leverage big data across the business. With R analytics,
organizations can surface new insights in their large data sets and make sense of
their data. R can handle these big datasets and is arguably as easy if not easier for
most analysts to use as any of the other analytics tools available today.

Creating Interactive Data Visualizations: R is also helpful for data

visualization and >data exploration because it supports the creation of graphs and
diagrams. It includes the ability to create interactive visualizations and 3D charts
and graphs that are helpful for communicating with business users.

How Can R analytics Be Implemented?

While R programming was originally designed for statisticians, it can be implemented for a
variety of uses including predictive analytics, data modeling, and data mining. Businesses can
implement R to create custom models for data collection, clustering, and analytics. R analytics
can provide a valuable way to quickly develop models targeted at understanding specific areas
of the business and delivering tailored insights on day-to-day needs.

R analytics can be used for the following purposes:

MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

 Statistical testing
 Prescriptive analytics
 Predictive analytics
 Time-series analysis
 What-if analysis
 Regression models
 Data exploration
 Forecasting
 Text mining
 Data mining
 Visual analytics
 Web analytics
 Social media analytics
 Sentiment analysis

R can be used to solve real-world business problems by turbocharging an organization's

analytics program. It can be integrated into a business’s analytics platform to help users get the
most out of their data. With an extensive library of R functions and advanced statistical
techniques, R can be used to apply statistical models to your analysis and better understand
trends in the data. It can help predict potential business outcomes, identify opportunities and
risks and create interactive dashboards to gain a comprehensive view of the data. This can lead
to better business decisions and increased revenue.

Advantages to Implement Machine Learning Using R Language

 It provides good explanatory code. For example, if you are at the early stage
of working with a machine learning project and you need to explain the
work you do, it becomes easy to work with R language comparison to
python language as it provides the proper statistical method to work with
data with fewer lines of code.
 R language is perfect for data visualization. R language provides the best
prototype to work with machine learning models.
 R language has the best tools and library packages to work with machine
learning projects. Developers can use these packages to create the best pre-
model, model, and post-model of the machine learning projects. Also, the
packages for R are more advanced and extensive than python language
which makes it the first choice to work with machine learning projects.

Popular R Language Packages Used to Implement Machine Learning

 lattice: The lattice package supports the creation of the graphs displaying
the variable or relation between multiple variables with conditions.
 DataExplorer: This R package focus to automate the data visualization and
data handling so that the user can pay attention to data insights of the
project.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

 Dalex(Descriptive Machine Learning Explanations): This package helps

to provide various explanations for the relation between the input variable
and its output. It helps to understand the complex models of machine
learning
 dplyr: This R package is used to summarize the tabular data of machine
learning with rows and columns. It applies the “split-apply-combine”
approach.
 Esquisse: This R package is used to explore the data quickly to get the
information it holds. It also allows to plot bar graph, histograms, curves,
and scatter plots.
 caret: This R package attempts to streamline the process for creating
predictive models.
 janitor: This R package has functions for examining and cleaning dirty
data. It is basically built for the purpose of user-friendliness for beginners
and intermediate users.
 rpart: This R package helps to create the classification and regression
models using two-stage procedures. The resulting models are represented
as binary trees.

Application Of R in Machine Learning

There are many top companies like Google, Facebook, Uber, etc using the R
language for application of Machine Learning. The application are:
 Social Network Analytics
 To analyze trends and patterns
 Getting insights for behaviour of users
 To find the relationships between the users
 Developing analytical solutions
 Accessing charting components
 Embedding interactive visual graphics

Example of Machine Learning Problems

 Web search like Siri, Alexa, Google, Cortona: Recognize the user’s voice
and fulfill the request made
 Social Media Service: Help people to connect all over the world and also
show the recommendations of the people we may know
 Online Customer Support: Provide high convenience of customer and
efficiency of support agent
 Intelligent Gaming: Use high level responsive and adaptive non player
characters similar to human like intelligence
 Product Recommendation: A software tool used to recommend the
product that you might like to purchase or engage with
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

 Virtual Personal Assistance: It is the software which can perform the task
according to the instructions provided
 Traffic Alerts: Help to switch the traffic alerts according to the situation
provided
 Online Fraud Detection: Check the unusual functions performed by the
user and detect the frauds
 Healthcare: Machine Learning can manage a large amount of data beyond
the imagination of normal human being and help to identify the illness of
the patient according to symptoms
 Real world example: When you search for some kind of cooking recipe on
youTube, you will see the recommendations below with the title “You May
Also Like This”. This is a common use of Machine Learning.

Types of Machine Learning Problems

 Regression: The regression technique helps the machine learning approach

to predict continuous values. For example, the price of a house.
 Classification: The input is divided into one or more classes or categories
for the learner to produce a model to assign unseen modules. For example,
in the case of email fraud, we can divide the emails into two classes i.e
“spam” and “not spam”.
 Clustering: This technique follows the summarization, finding a group of
similar entities. For example, we can gather and take readings of the patients
in the hospital.
 Association: This technique finds co-occurring events or items. For
example, market-basket.
 Anomaly Detection: This technique works by discovering abnormal cases
or behavior. For example, credit card fraud detection.
 Sequence Mining: This technique predicts the next stream event. For
example, click-stream event.
 Recommendation: This technique recommends the item. For example,
songs or movies according to the celebrity in it.

Packages
We will be using, directly or indirectly, the following packages through the chapters:

 caret
 ggplot2
 mlbench
 class
 caTools
 randomForest
 impute
 ranger
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

 kernlab
 class
 glmnet
 naivebayes
 rpart
 rpart.plot

Supervised Learning:
Supervised learning is the types of machine learning in which machines are trained using well
"labelled" training data, and on basis of that data, machines predict the output. The labelled
data means some input data is already tagged with the correct output.

In supervised learning, the training data provided to the machines work as the
supervisor that teaches the machines to predict the output correctly. It applies the
same concept as a student learns in the supervision of the teacher.

Supervised learning is a process of providing input data as well as correct output

data to the machine learning model. The aim of a supervised learning algorithm
is to find a mapping function to map the input variable(x) with the output
variable(y).

In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering, etc.

How Supervised Learning Works?

In supervised learning, models are trained using labelled dataset, where the model learns about
each type of data. Once the training process is completed, the model is tested on the basis of
test data (a subset of the training set), and then it predicts the output.

The working of Supervised learning can be easily understood by the below example and
diagram:
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

Suppose we have a dataset of different types of shapes which includes square, rectangle,
triangle, and Polygon. Now the first step is that we need to train the model for each shape.

o If the given shape has four sides, and all the sides are equal, then it will be labelled as
a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.

Now, after training, we test our model using the test set, and the task of the model is to
identify the shape.

The machine is already trained on all types of shapes, and when it finds a new shape, it
classifies the shape on the bases of a number of sides, and predicts the output.

Steps Involved in Supervised Learning:

o First Determine the type of training dataset
o Collect/Gather the labelled training data.
o Split the training dataset into training dataset, test dataset, and validation dataset.
o Determine the input features of the training dataset, which should have enough
knowledge so that the model can accurately predict the output.
o Determine the suitable algorithm for the model, such as support vector machine,
decision tree, etc.
o Execute the algorithm on the training dataset. Sometimes we need validation sets as the
control parameters, which are the subset of training datasets.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

o Evaluate the accuracy of the model by providing the test set. If the model predicts the
correct output, which means our model is accurate.

Supervised learning is classified into two categories of algorithms:

• Classification: A classification problem is when the output variable is a category, such

as “Red” or “blue” , “disease” or “no disease”.
• Regression: A regression problem is when the output variable is a real value, such as
“dollars” or “weight”.

Supervised learning deals with or learns with “labeled” data. This implies that some data is
already tagged with the correct answer.

Types:-

• Regression
• Logistic Regression
• Classification
• Naive Bayes Classifiers
• K-NN (k nearest neighbors)
• Decision Trees
• Support Vector Machine

Regression:

• A regression is a statistical technique that relates a dependent variable to one or more

independent (explanatory) variables.
• A regression model is able to show whether changes observed in the dependent
variable are associated with changes in one or more of the explanatory variables.
• In regression, we normally have one dependent variable and one or more independent
variables.
• Here we try to “regress” the value of the dependent variable “Y” with the help of the
independent variables.
• In other words, we are trying to understand, how the value of ‘Y’ changes w.r.t
change in ‘X’.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

For the regression analysis is be a successful method, we understand the following

terms:

• Dependent Variable: This is the variable that we are trying to understand or forecast.
• Independent Variable: These are factors that influence the analysis or target variable
and provide us with information regarding the relationship of the variables with the
target variable.

Regression analysis is used for prediction and forecasting. This statistical method is
used across different industries such as,

• Financial Industry- Understand the trend in the stock prices, forecast the prices, and
evaluate risks in the insurance domain
• Marketing- Understand the effectiveness of market campaigns, and forecast pricing
and sales of the product.
• Manufacturing- Evaluate the relationship of variables that determine to define a better
engine to provide better performance
• Medicine- Forecast the different combinations of medicines to prepare generic
medicines for diseases.

Logistic Regression

• Logistic regression is a supervised machine learning algorithm mainly used

for classification tasks where the goal is to predict the probability that an
instance of data belonging to a given class or not.
• It is a kind of statistical algorithm, which analyze the relationship between
a set of independent variables and the dependent binary variables.
• It is a powerful tool for decision-making. For example email spam or not.
• Logistic regression predicts the output of a categorical dependent variable.
Therefore the outcome must be a categorical or discrete value. It can be
either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact
value as 0 and 1, it gives the probabilistic values which lie between 0 and
1.

Type of Logistic Regression:

MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

On the basis of the categories, Logistic Regression can be classified into

three types:

• Binomial: In binomial Logistic regression, there can be only two possible

types of the dependent variables, such as 0 or 1, Pass or Fail, etc.
• Multinomial: In multinomial Logistic regression, there can be 3 or more
possible unordered types of the dependent variable, such as "cat", "dogs",
or "sheep"
• Ordinal: In ordinal Logistic regression, there can be 3 or more possible
ordered types of dependent variables, such as "low", "Medium", or "High".

Classification
In Regression algorithms, we have predicted the output for
continuous values, but to predict the categorical values, we
need Classification algorithms.
What is the Classification Algorithm?
The Classification algorithm is a Supervised Learning
technique that is used to identify the category of new
observations on the basis of training data. In
Classification, a program learns from the given dataset or
observations and then classifies new observation into a
number of classes or groups. Such as, Yes or No, 0 or 1,
Spam or Not Spam, cat or dog, etc. Classes can be called
as targets/labels or categories.
Unlike regression, the output variable of Classification is
a category, not a value, such as "Green or Blue", "fruit or
animal", etc. Since the Classification algorithm is a
Supervised learning technique, hence it takes labeled input
data, which means it contains input with the
corresponding output.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

Naïve Bayes classifer

• Naïve Bayes algorithm is a supervised learning algorithm,
which is based on Bayes theorem and used for solving
classification problems.
• It is mainly used in text classification that includes a high-
dimensional training dataset.
• Naïve Bayes Classifier is one of the simple and most
effective Classification algorithms which helps in building
the fast machine learning models that can make quick
predictions.
• It is a probabilistic classifier, which means it predicts on
the basis of the probability of an object.
• Some popular examples of Naïve Bayes Algorithm
are spam filtration, Sentimental analysis, and classifying
articles.
K-NN (k nearest neighbors)
• K-Nearest Neighbour is one of the simplest Machine
Learning algorithms based on Supervised Learning
technique.
• K-NN algorithm assumes the similarity between the new
case/data and available cases and put the new case into the
category that is most similar to the available categories.
• K-NN algorithm stores all the available data and classifies
a new data point based on the similarity.
• This means when new data appears then it can be easily
classified into a well suite category by using K- NN
algorithm.
• K-NN algorithm can be used for Regression as well as for
Classification but mostly it is used for the Classification
problems.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

• K-NN is a non-parametric algorithm, which means it

does not make any assumption on underlying data.
• It is also called a lazy learner algorithm because it does
not learn from the training set immediately instead it stores
the dataset and at the time of classification, it performs an
action on the dataset.
• KNN algorithm at the training phase just stores the dataset
and when it gets new data, then it classifies that data into
a category that is much similar to the new data.
Example: Suppose, we have an image of a creature that
looks similar to cat and dog, but we want to know either it
is a cat or dog. So for this identification, we can use the
KNN algorithm, as it works on a similarity measure. Our
KNN model will find the similar features of the new data
set to the cats and dogs images and based on the most
similar features it will put it in either cat or dog category.

How does K-NN work?

The K-NN working can be explained on the basis of the
below algorithm:
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

• Step-1: Select the number K of the neighbors

• Step-2: Calculate the Euclidean distance of K number of
neighbors
• Step-3: Take the K nearest neighbors as per the calculated
Euclidean distance.
• Step-4: Among these k neighbors, count the number of
the data points in each category.
• Step-5: Assign the new data points to that category for
which the number of the neighbor is maximum.
• Step-6: Our model is ready.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

• Root Node: Root node is from where the decision tree

starts. It represents the entire dataset, which further gets
divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the
tree cannot be segregated further after getting a leaf node.
• Splitting: Splitting is the process of dividing the decision
node/root node into sub-nodes according to the given
conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the
unwanted branches from the tree.
• Parent/Child node: The root node of the tree is called the
parent node, and other nodes are called the child nodes.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

SUPPORT VECTOR MACHINES

Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point
in the correct category in the future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support
Vector Machine. Consider the below diagram in which there are two different categories
that are classified using a decision boundary or hyperplane:
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

Example: SVM can be understood with the example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so if we want
a model that can accurately identify whether it is a cat or dog, so such a model can be
created by using the SVM algorithm. We will first train our model with lots of images of
cats and dogs so that it can learn about different features of cats and dogs, and then we
test it with this strange creature. So as support vector creates a decision boundary
between these two data (cat and dog) and choose extreme cases (support vectors), it will
see the extreme case of cat and dog. On the basis of the support vectors, it will classify it
as a cat. Consider the below diagram:
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

SVM algorithm can be used for Face detection, image classification, text
categorization, etc.

Types of SVM
SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then such
data is termed as linearly separable data, and classifier is used called as Linear
SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in
n-dimensional space, but we need to find out the best decision boundary that helps to
classify the data points. This best boundary is known as the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset, which
means if there are 2 features (as shown in image), then hyperplane will be a straight line.
And if there are 3 features, then hyperplane will be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.

Support Vectors:
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support the
hyperplane, hence called a Support vector.

Unsupervised learning
It uses machine learning algorithms to analyze and cluster
unlabeled datasets.
These algorithms discover hidden patterns or data groupings
without the need for human intervention.
• Unsupervised machine learning finds all kind of unknown
patterns in data.
• Unsupervised methods help you to find features which can
be useful for categorization.
• It is taken place in real time, so all the input data to be
analyzed and labeled in the presence of learners.
• It is easier to get unlabeled data from a computer than
labeled data, which needs manual intervention.
Working of Unsupervised Learning
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

Clustering Types of Unsupervised Learning Algorithms

Below are the clustering types of Unsupervised Machine
Learning algorithms:
Unsupervised learning problems further grouped into
clustering and association problems.
Clustering

Clustering is an important concept when it comes to

unsupervised learning.
It mainly deals with finding a structure or pattern in a collection
of uncategorized data. Unsupervised Learning Clustering
algorithms will process your data and find natural
clusters(groups) if they exist in the data.
There are different types of clustering you can utilize:
Exclusive (partitioning)
In this clustering method, Data are grouped in such a way that
one data can belong to one cluster only.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

Example: K-means
Agglomerative
In this clustering technique, every data is a cluster. The iterative
unions between the two nearest clusters reduce the number of
clusters.
Example: Hierarchical clustering
Overlapping
In this technique, fuzzy sets is used to cluster data. Each point
may belong to two or more clusters with separate degrees of
membership.
Here, data will be associated with an appropriate membership
value. Example: Fuzzy C-Means
Probabilistic
This technique uses probability distribution to create the
clusters.
Clustering Types
Following are the clustering types of Machine Learning:
• Hierarchical clustering
• K-means clustering
• K-NN (k nearest neighbors)
• Principal Component Analysis
• Singular Value Decomposition
• Independent Component Analysis
Hierarchical Clustering
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

Hierarchical clustering is another unsupervised machine learning algorithm, which is

used to group the unlabeled datasets into a cluster and also known as hierarchical
cluster analysis or HCA.

In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-
shaped structure is known as the dendrogram.

Sometimes the results of K-means clustering and hierarchical clustering may look similar,
but they both differ depending on how they work. As there is no requirement to
predetermine the number of clusters as we did in the K-Means algorithm.

The hierarchical clustering technique has two approaches:

1. Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm

starts with taking all data points as single clusters and merging them until one
cluster is left.
2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is
a top-down approach.

Agglomerative Hierarchical clustering

The agglomerative hierarchical clustering algorithm is a popular example of HCA. To
group the datasets into clusters, it follows the bottom-up approach. It means, this
algorithm considers each dataset as a single cluster at the beginning, and then start
combining the closest pair of clusters together. It does this until all the clusters are
merged into a single cluster that contains all the datasets.

This hierarchy of clusters is represented in the form of the dendrogram.

How the Agglomerative Hierarchical clustering Work?

The working of the AHC algorithm can be explained using the below steps:
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

o Step-1: Create each data point as a single cluster. Let's say there are N data points,
so the number of clusters will also be N.

o Step-2: Take two closest data points or clusters and merge them to form one
cluster. So, there will now be N-1 clusters.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

o Step-3: Again, take the two closest clusters and merge them together to form one
cluster. There will be N-2 clusters.

o Step-4: Repeat Step 3 until only one cluster left. So, we will get the following
clusters. Consider the below images:
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

o Step-5: Once all the clusters are combined into onebig cluster, develop the
dendrogram to divide the clusters as per the problem.

Measure for the distance between two clusters

As we have seen, the closest distance between the two clusters is crucial for the
hierarchical clustering. There are various ways to calculate the distance between two
clusters, and these ways decide the rule for clustering. These measures are
called Linkage methods. Some of the popular linkage methods are given below:
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

1. Single Linkage: It is the Shortest Distance between the closest points of the clusters.
Consider the below image:

2. Complete Linkage: It is the farthest distance between the two points of two different
clusters. It is one of the popular linkage methods as it forms tighter clusters than single-
linkage.

3. Average Linkage: It is the linkage method in which the distance between each pair of
datasets is added up and then divided by the total number of datasets to calculate the
average distance between two clusters. It is also one of the most popular linkage methods.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

4. Centroid Linkage: It is the linkage method in which the distance between the centroid of
the clusters is calculated. Consider the below image:

From the above-given approaches, we can apply any of them according to the type of
problem or business requirement.

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset
into different clusters. Here K defines the number of pre-defined clusters that need to be created
in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and
so on.

It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The

main aim of this algorithm is to minimize the sum of distances between the data point
and their corresponding clusters.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

The below diagram explains the working of the K-means Clustering Algorithm:

How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables
is given below:
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them
into different clusters. It means here we will try to group these datasets into two
different clusters.
o We need to choose some random k points or centroid to form the cluster. These
points can be either the points from the dataset or any other point. So, here we are
selecting the below two points as k points, which are not the part of our dataset.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

Consider the below image:

o Now we will assign each data point of the scatter plot to its closest K-point or
centroid. We will compute it by applying some mathematics that we have studied
to calculate the distance between two points. So, we will draw a median between
both the centroids. Consider the below image:
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

From the above image, it is clear that points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow centroid. Let's color
them as blue and yellow for clear visualization.

o As we need to find the closest cluster, so we will repeat the process by choosing a
new centroid. To choose the new centroids, we will compute the center of gravity
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

of these centroids, and will find new centroids as below:

o Next, we will reassign each datapoint to the new centroid. For this, we will repeat
the same process of finding a median line. The median will be like below image:
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

From the above image, we can see, one yellow point is on the left side of the line, and two
blue points are right to the line. So, these three points will be assigned to new centroids.

As reassignment has taken place, so we will again go to the step-4, which is finding new
centroids or K-points.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

o We will repeat the process by finding the center of gravity of centroids, so the new
centroids will be as shown in the below image:

o As we got the new centroids so again will draw the median line and reassign the
data points. So, the image will be:
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

o We can see in the above image; there are no dissimilar data points on either side
of the line, which means our model is formed. Consider the below image:

As our model is ready, so we can now remove the assumed centroids, and the two final
clusters will be as shown in the below image:

Agglomerative clustering
It is also known as the bottom-up approach or hierarchical agglomerative clustering
(HAC). A structure that is more informative than the unstructured set of clusters
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

returned by flat clustering. This clustering algorithm does not require us to prespecify
the number of clusters. Bottom-up algorithms treat each data as a singleton cluster at
the outset and then successively agglomerate pairs of clusters until all clusters have
been merged into a single cluster that contains all data.

Steps:
 Consider each alphabet as a single cluster and calculate the distance of one
cluster from all the other clusters.
 In the second step, comparable clusters are merged together to form a single
cluster. Let’s say cluster (B) and cluster (C) are very similar to each other
therefore we merge them in the second step similarly to cluster (D) and (E)
and at last, we get the clusters [(A), (BC), (DE), (F)]
 We recalculate the proximity according to the algorithm and merge the two
nearest clusters([(DE), (F)]) together to form new clusters as [(A), (BC),
(DEF)]
 Repeating the same process; The clusters DEF and BC are comparable and
merged together to form a new cluster. We’re now left with clusters [(A),
(BCDEF)].
 At last, the two remaining clusters are merged together to form a single
cluster [(ABCDEF)].
Dendrogram
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

A dendrogram is a diagram that shows the hierarchical relationship between

objects. It is most commonly created as an output from hierarchical
clustering. The main use of a dendrogram is to work out the best way to
allocate objects to clusters. The dendrogram below shows the hierarchical
clustering of six observations shown on the scatterplot to the
left. (Dendrogram is often miswritten as dendogram.

K- Nearest neighbors
K-Nearest Neighbours is one of the most basic yet essential classification algorithms
in Machine Learning. It belongs to the supervised learning domain and finds intense
application in pattern recognition, data mining, and intrusion detection.
It is widely disposable in real-life scenarios since it is non-parametric, meaning, it
does not make any underlying assumptions about the distribution of data (as opposed
to other algorithms such as GMM, which assume a Gaussian distribution of the given
data). We are given some prior data (also called training data), which classifies
coordinates into groups identified by an attribute.
As an example, consider the following table of data points containing two features:
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

Distance Metrics Used in KNN Algorithm

As we know that the KNN algorithm helps us identify the nearest points or the
groups for a query point. But to determine the closest groups or the nearest points
for a query point we need some metric. For this purpose, we use below distance
metrics:
 Euclidean Distance
 Manhattan Distance
 Minkowski Distance

Euclidean Distance

This is nothing but the cartesian distance between the two points which are in the
plane/hyperplane. Euclidean distance can also be visualized as the length of the
straight line that joins the two points which are into consideration. This metric helps
us calculate the net displacement done between the two states of an object.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

Manhattan Distance

This distance metric is generally used when we are interested in the total distance
traveled by the object instead of the displacement. This metric is calculated by
summing the absolute difference between the coordinates of the points in n-
dimensions.

Minkowski Distance

We can say that the Euclidean, as well as the Manhattan distance, are special cases
of the Minkowski distance.
From the formula above we can say that when p = 2 then it is the same as the
formula for the Euclidean distance and when p = 1 then we obtain the formula for
the Manhattan distance.
The above-discussed metrics are most common while dealing with a Machine
Learning problem but there are other distance metrics as well like Hamming
Distance which come in handy while dealing with problems that require overlapping
comparisons between two vectors whose contents can be boolean as well as string
values.
How to choose the value of k for KNN Algorithm?
The value of k is very crucial in the KNN algorithm to define the number of neighbors
in the algorithm. The value of k in the k-nearest neighbors (k-NN) algorithm should
be chosen based on the input data. If the input data has more outliers or noise, a higher
value of k would be better. It is recommended to choose an odd value for k to avoid
ties in classification. Cross-validation methods can help in selecting the best k value
for the given dataset.
Applications of the KNN Algorithm
 Data Preprocessing – While dealing with any Machine Learning problem
we first perform the EDA part in which if we find that the data contains
missing values then there are multiple imputation methods are available as
well. One of such method is KNN Imputer which is quite effective ad
generally used for sophisticated imputation methodologies.
 Pattern Recognition – KNN algorithms work very well if you have trained
a KNN algorithm using the MNIST dataset and then performed the
evaluation process then you must have come across the fact that the
accuracy is too high.
 Recommendation Engines – The main task which is performed by a KNN
algorithm is to assign a new query point to a pre-existed group that has been
created using a huge corpus of datasets. This is exactly what is required in
the recommender systems to assign each user to a particular group and then
provide them recommendations based on that group’s preferences.
Advantages of the KNN Algorithm
 Easy to implement as the complexity of the algorithm is not that high.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

 Adapts Easily – As per the working of the KNN algorithm it stores all the
data in memory storage and hence whenever a new example or data point is
added then the algorithm adjusts itself as per that new example and has its
contribution to the future predictions as well.
 Few Hyperparameters – The only parameters which are required in the
training of a KNN algorithm are the value of k and the choice of the distance
metric which we would like to choose from our evaluation metric.
Disadvantages of the KNN Algorithm
 Does not scale – As we have heard about this that the KNN algorithm is
also considered a Lazy Algorithm. The main significance of this term is that
this takes lots of computing power as well as data storage. This makes this
algorithm both time-consuming and resource exhausting.
 Curse of Dimensionality – There is a term known as the peaking
phenomenon according to this the KNN algorithm is affected by the curse
of dimensionality which implies the algorithm faces a hard time classifying
the data points properly when the dimensionality is too high.
 Prone to Overfitting – As the algorithm is affected due to the curse of
dimensionality it is prone to the problem of overfitting as well. Hence
generally feature selection as well as dimensionality reduction techniques
are applied to deal with this problem.

Principal Components Analysis

Principal Component Analysis is an unsupervised learning algorithm that is used for the
dimensionality reduction in machine learning. It is a statistical process that converts the
observations of correlated features into a set of linearly uncorrelated features with the
help of orthogonal transformation. These new transformed features are called
the Principal Components. It is one of the popular tools that is used for exploratory data
analysis and predictive modeling. It is a technique to draw strong patterns from the given
dataset by reducing the variances.

PCA generally tries to find the lower-dimensional surface to project the high-dimensional
data.

PCA works by considering the variance of each attribute because the high attribute shows
the good split between the classes, and hence it reduces the dimensionality. Some real-
world applications of PCA are image processing, movie recommendation system,
optimizing the power allocation in various communication channels. It is a feature
extraction technique, so it contains the important variables and drops the least important
variable.

The PCA algorithm is based on some mathematical concepts such as:

MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

o Variance and Covariance

o Eigenvalues and Eigen factors

Some common terms used in PCA algorithm:

o Dimensionality: It is the number of features or variables present in the given

dataset. More easily, it is the number of columns present in the dataset.
o Correlation: It signifies that how strongly two variables are related to each other.
Such as if one changes, the other variable also gets changed. The correlation value
ranges from -1 to +1. Here, -1 occurs if variables are inversely proportional to each
other, and +1 indicates that variables are directly proportional to each other.
o Orthogonal: It defines that variables are not correlated to each other, and hence
the correlation between the pair of variables is zero.
o Eigenvectors: If there is a square matrix M, and a non-zero vector v is given. Then
v will be eigenvector if Av is the scalar multiple of v.
o Covariance Matrix: A matrix containing the covariance between the pair of
variables is called the Covariance Matrix.

Steps for PCA algorithm

1. Getting the dataset
Firstly, we need to take the input dataset and divide it into two subparts X and Y, where X
is the training set, and Y is the validation set.
2. Representing data into a structure
Now we will represent our dataset into a structure. Such as we will represent the two-
dimensional matrix of independent variable X. Here each row corresponds to the data
items, and the column corresponds to the Features. The number of columns is the
dimensions of the dataset.
3. Standardizing the data
In this step, we will standardize our dataset. Such as in a particular column, the features
with high variance are more important compared to the features with lower variance.
If the importance of features is independent of the variance of the feature, then we will
divide each data item in a column with the standard deviation of the column. Here we
will name the matrix as Z.
4. Calculating the Covariance of Z
To calculate the covariance of Z, we will take the matrix Z, and will transpose it. After
transpose, we will multiply it by Z. The output matrix will be the Covariance matrix of Z.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

5. Calculating the Eigen Values and Eigen Vectors

Now we need to calculate the eigenvalues and eigenvectors for the resultant covariance
matrix Z. Eigenvectors or the covariance matrix are the directions of the axes with high
information. And the coefficients of these eigenvectors are defined as the eigenvalues.
6. Sorting the Eigen Vectors
In this step, we will take all the eigenvalues and will sort them in decreasing order, which
means from largest to smallest. And simultaneously sort the eigenvectors accordingly in
matrix P of eigenvalues. The resultant matrix will be named as P*.
7. Calculating the new features Or Principal Components
Here we will calculate the new features. To do this, we will multiply the P* matrix to the Z.
In the resultant matrix Z*, each observation is the linear combination of original features.
Each column of the Z* matrix is independent of each other.
8. Remove less or unimportant features from the new dataset.
The new feature set has occurred, so we will decide here what to keep and what to remove.
It means, we will only keep the relevant or important features in the new dataset, and
unimportant features will be removed out.

Applications of Principal Component Analysis

o PCA is mainly used as the dimensionality reduction technique in various AI
applications such as computer vision, image compression, etc.
o It can also be used for finding hidden patterns if data has high dimensions. Some
fields where PCA is used are Finance, data mining, Psychology, etc.

Association
Association rules allow you to establish associations amongst
data objects inside large databases. This unsupervised
technique is about discovering interesting relationships
between variables in large databases.
For example, people that buy a new home most likely to buy
new furniture.
Other Examples:
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

• A subgroup of cancer patients grouped by their gene

expression measurements.
• Groups of shopper based on their browsing and
purchasing histories.
• Movie group by the rating given by movies viewers.
Supervised vs. Unsupervised Machine Learning

Supervised machine Unsupervised machine

Parameters
learning technique learning technique

Algorithms are Algorithms are used

Input Data trained using labeled against data which is
data. not labelled

Unsupervised
Computational Supervised learning is learning is
Complexity a simpler method. computationally
complex
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

Highly accurate and Less accurate and

Accuracy
trustworthy method. trustworthy method.

Collaborative filtering
Collaborative filtering is a technique that can filter out items
that a user might like on the basis of reactions by similar users.
It works by searching a large group of people and finding a
smaller set of users with tastes similar to a particular user.
What is a Recommendation system?
There are a lot of applications where websites collect data from
their users and use that data to predict the likes and dislikes of
their users. This allows them to recommend the content that
they like. Recommender systems are a way of suggesting
similar items and ideas to a user’s specific way of thinking.
There are basically two types of recommender Systems:
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

Collaborative Filtering: Collaborative Filtering recommends

items based on similarity measures between users and/or items.
• The basic assumption behind the algorithm is that users
with similar interests have common preferences.
Content-Based Recommendation:
• It is supervised machine learning used to induce a
classifier to discriminate between interesting and
uninteresting items for the user.
What is Collaborative Filtering?
In Collaborative Filtering, we tend to find similar users and
recommend what similar users like. In this type of
recommendation system, we don’t use the features of the item
to recommend it, rather we classify the users into clusters of
similar types and recommend each user according to the
preference of its cluster.
There are basically four types of algorithms o say techniques to
build Collaborative filtering based recommender systems:
o Memory-Based
o Model-Based
o Hybrid
o Deep Learning
• Memory-based
Memory-based methods use user rating historical data to
compute the similarity between users or items. The idea
behind these methods is to define a similarity measure
between users or items, and find the most similar to
recommend unseen items.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

• Model-based
Model-based CF uses machine learning algorithms to
predict users’ rating of unrated items.
There are many model-based CF algorithms, the most
commonly used are matrix factorization models such as to
applying a SVD to reconstruct the rating matrix, latent
Dirichlet allocation or Markov decision process based
models.
• Hybrid
These aim to combine the memory-based and the model-based
approaches. One of the main drawbacks of the above methods,
is that you’ll find yourself having to choose between historical
user rating data and user or item attributes.
Hybrid methods enable us to leverage both, and hence tend to
perform better in most cases. The most widely used methods
nowadays are factorization machines.
Memory-based CF
There are 2 main types of memory-based collaborative filtering
algorithms: User-Based and Item-Based. While their difference
is subtle, in practice they lead to very different approaches, so
it is crucial to know which is the most convenient for each case.
Let’s go through a quick overview of these methods:
• Item-Based
The idea is similar, but instead, starting from a given movie (or
set of movies) we find similar movies based on other users’
preferences.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

Also, since a single item is enough to recommend other similar

items, this method will not suffer from the cold-start problem.
The algorithm for used-based CF can be summarised as:
1. Compute the similarity between the new user with all
other users (if not already done)
2. Compute the mean rating of all movies of the k most
similar users
3. Recommend the top n rated movies by other users unseen
by the user
Advantages of Collaborative Filtering-Based
Recommender Systems
As we know there are two types of recommender systems the
content-based recommender systems have limited use cases
and have higher time complexity.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

Also, this algorithm is based on some limited content but that

is not the case in Collaborative Filtering based algorithms.
One of the main advantages that these recommender systems
have is that they are highly efficient in providing personalized
content but also able to adapt to changing user preferences.
Measuring Similarity

Social Media Analytics

Social media analytics is the ability to gather and find
meaning in data gathered from social channels to support
business decisions — and measure the performance of
actions based on those decisions through social media.

Social media analytics is broader than metrics such as

likes, follows, retweets, previews, clicks, and
impressions gathered from individual channels. It also
differs from reporting offered by services that support
marketing campaigns such as LinkedIn or Google
Analytics.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

• Social media analytics uses specifically designed

software platforms that work similarly to web search
tools.
• Data about keywords or topics is retrieved through
search queries or web ‘crawlers’ that span channels.
Fragments of text are returned, loaded into a
database, categorized and analyzed to derive
meaningful insights.
Why is social media analytics important?
Social media analytics helps companies address these
experiences and use them to:
• Spot trends related to offerings and brands
• Understand conversations — what is being said and
how it is being received
• Derive customer sentiment towards products and
services
• Gauge response to social media and other
communications
• Identify high-value features for a product or service
• Uncover what competitors are saying and its
effectiveness
• Map how third-party partners and channels may
affect performance
Key capabilities of effective social media
analytics
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

From there, topics or keywords can be selected and

parameters such as date range can be set.
Sources also need to be specified — responses to
YouTube videos, Facebook conversations, Twitter
arguments, Amazon product reviews, comments
from news sites.
• Natural language processing and machine
learning technologies identify entities and
relationships in unstructured data — information not
pre-formatted to work with data analytics. Virtually
all social media content is unstructured. These
technologies are critical to deriving meaningful
insights.
• Segmentation is a fundamental need in social media
analytics. It categorizes social media participants by
geography, age, gender, marital status, parental
status and other demographics. It can help identify
influencers in those categories. Messages, initiatives
and responses can be better tuned and targeted by
understanding who is interacting on key topics.
• Behavior analysis is used to understand the
concerns of social media participants by assigning
behavioral types such as user, recommender,
prospective user and detractor. Understanding these
roles helps develop targeted messages and responses
to meet, change or deflect their perceptions.
• Sentiment analysis measures the tone and intent of
social media comments. It typically involves natural
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

language processing technologies to help

understand entities and relationships to reveal
positive, negative, neutral or ambivalent attributes.
• Share of voice analyzes prevalence and intensity in
conversations regarding brand, products, services,
reputation and more. It helps determine key issues
and important topics. It also helps classify
discussions as positive, negative, neutral or
ambivalent.
• Clustering analysis can uncover hidden
conversations and unexpected insights. It makes
associations between keywords or phrases that
appear together frequently and derives new topics,
issues and opportunities. The people that make
baking soda, for example, discovered new uses and
opportunities using clustering analysis.
• Dashboards and visualization charts, graphs,
tables and other presentation tools summarize and
share social media analytics findings — a critical
capability for communicating and acting on what
has been learned. They also enable users to grasp
meaning and insights more quickly and look deeper
into specific findings without advanced technical
skills.

Mobile Analytics
Mobile analytics involves measuring and analysing
data generated by mobile platforms and properties,
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

such as mobile sites and mobile applications. AT

Internet's analytics solution lets you track, measure
and understand how your mobile users are
interacting with your mobile sites and mobile apps.
Why do companies use mobile analytics?
Mobile analytics gives companies unparalleled
insights into the otherwise hidden lives of app users.
Analytics usually comes in the form of software that
integrates into companies’ existing websites and
apps to capture, store, and analyze the data.
This data is vitally important to marketing, sales,
and product management teams who use it to make
more informed decisions.
Without a mobile analytics solution, companies are
left flying blind. They’re unable to tell what users
engage with, who those users are, what brings them
to the site or app, and why they leave.
Why are mobile analytics important?
• Mobile usage surpassed that of desktop in 2015
and smartphones are fast becoming consumers’
preferred portal to the internet. Consumers spend 70
percent of their media consumption and screen
time on mobile devices, and most of that time in
mobile apps.
• This is a tremendous opportunity for companies to
reach their consumers, but it’s also a highly
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

saturated market. There are more than 6.5 million

apps in the major mobile app stores, millions of web
apps, and more than a billion websites in existence.
• Companies use mobile analytics platforms to gain a
competitive edge in building mobile experiences
that stand out. Mobile analytics tools also give teams
a much-needed edge in advertising.
• As more businesses compete for customers on
mobile, teams need to understand how their ads
perform in detail, and whether app users who
interact with ads end up purchasing.
How do mobile analytics work?
Mobile analytics typically track:
• Page views
• Visits
• Visitors
• Source data
• Strings of actions
• Location
• Device information
• Login / logout
• Custom event data
Companies use this data to figure out what users want in
order to deliver a more satisfying user experience.
For example, they’re able to see:
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

• What draws visitors to the mobile site or app

• How long visitors typically stay
• What features visitors interact with
• Where visitors encounter problems
• What factors are correlated with outcomes like
purchases
• What factors lead to higher usage and long-term
retention
How different teams use mobile analytics:

 Marketing: Tracks campaign ROI, segments users, automates marketing

 UX/UI: Tracks behaviors, tests features, measures user experience
 Product: Tracks usage, A/B test features, debugs, sets alerts
 Technical teams: Track performance metrics such as app crashes

How to implement mobile analytics

Mobile analytics platforms vary widely in features and functionality. Some free
applications have technical limitations and struggle with tracking users as they move
between mobile websites and apps. A top tier mobile analytics platform should be
able to:

 Integrate easily: With a codeless mobile feature, for instance

 Offer a unified view of the customer: Track data across operating systems, devices,
and platforms
 Measure user engagement: For both standard and custom-defined events
 Segment users: Create cohorts based on location, device, demographics, behaviors, and
more
 Offer dashboards: View data and surface insights with customizable reporting
 A/B test: Test features and messaging for performance
 Send notifications: Alert administrators and engage users with behavior-based
messaging such as push notifications and in-app messages
 Out-of-the-box metrics: Insights with minimal client-side coding
 Real-time analytics: Proactively identify user issues
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

 Reliable infrastructure: Guaranteed uptime for consistent access to the platform

The actual installation of mobile analytics involves adding tracking code to the sites and
SDKs to the mobile applications teams want to track. Most mobile analytics platforms will
be set up to automatically track website visits.

Platforms with codeless mobile features will be able to automatically track certain basic
features of apps such as crashes, errors, and clicks, but you’ll want to expand that by
manually tagging additional actions for tracking. With mobile analytics in place, you’ll
have deeper insights into your mobile web and app users which you can use to create
competitive, world-class products and experiences.

The Challenges of Mobile Analytics

Because mobile analytics is a fairly new field of analytics and continues to
change with rapidly changing consumer expectations, there are many
challenges to be faced in implementing it.

Collecting the data necessary for successful mobile analytics is often the
greatest challenge organizations face when attempting to understand
consumer behavior on mobile devices. Many devices do not allow for
cookies to track actions or do not use Javascript which can also help with
website data tracking.

Another challenge to ensuring the usefulness of mobile analytics is correctly

segmenting users based on their mobile behavior. Organizations must
guarantee quality data and tracking so that all the necessary attributes are
correctly captured and never duplicated, including network, device, entry
page, time spent on page, etc. It is possible for dirty data from mobile
analytics, or even algorithmic bias, to lead to incorrect conclusions. To
combat this risk, businesses should put measures in place to eliminate bad
data, biases, and opportunities for misrepresentation of the data.

Finally, implementing mobile analytics may require a culture shift if your

company is not yet fully data driven. To receive a return on your investment
in mobile analytics, make sure to follow best practices for onboarding new
technologies and gaining and maintaining buy-in from key stakeholders.

The Benefits of Mobile Analytics

Despite the challenges of mobile analytics, it’s essential for modern
businesses to invest in and can lead to many opportunities for the business.

 Improved UI/UX: Mobile analytics can help businesses develop

content and campaigns that best meet their target audience’s needs.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

With metrics on user engagement, designers can ensure a smooth,

consistent user experience that is responsive to user behavior.
 Optimized Websites and Apps: Mobile analytics is also useful for
product managers and app developers. It can be used to optimize the
performance of mobile websites and apps and therefore enhance
consumers’ interactions with your business on mobile devices.
 More Customer Loyalty: Every company is always on the lookout
for ways to improve customer retention. Mobile analytics can help
organizations identify those areas for improvement. User retention
on a business’ mobile app shows strong brand loyalty and can
increase the likelihood of repeat customers once downloaded. It’s
also a great way to analyze customer feedback and implement
changes rapidly for more agile app development.
 Analytics Insights: Mobile analytics looks at easy-to-understand
metrics and often provides user-friendly dashboards that non-
technical users like marketers and product managers can quickly
assess and use to improve performance. This can lead to more
targeting marketing and accurate segmentation for overall improved
business results.

Big Data Analytics with BigR

MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

Advanced Machine Learning Mastering Level Learning With Python
No ratings yet
Advanced Machine Learning Mastering Level Learning With Python
81 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
62 pages
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
From Everand
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
WINTON CLEM
No ratings yet
How Can Analytics and Data Science Leverage Machine Learning in Future - Swaraj - MDI
No ratings yet
How Can Analytics and Data Science Leverage Machine Learning in Future - Swaraj - MDI
7 pages
Day 1 Intro To DS and ML - New
No ratings yet
Day 1 Intro To DS and ML - New
41 pages
Machine Learning Algorithms for Data Scientists: An Overview
From Everand
Machine Learning Algorithms for Data Scientists: An Overview
Vinaitheerthan Renganathan
No ratings yet
R Machine Learning Essentials: Chapter No. 1 "Transforming Data Into Actions"
No ratings yet
R Machine Learning Essentials: Chapter No. 1 "Transforming Data Into Actions"
20 pages
Roberts Ryan Machine Learning The Ultimate Beginners Guide F
No ratings yet
Roberts Ryan Machine Learning The Ultimate Beginners Guide F
45 pages
2-ML
No ratings yet
2-ML
80 pages
Machine Learning Unit-1.1
No ratings yet
Machine Learning Unit-1.1
43 pages
Data Analytics PDF
0% (1)
Data Analytics PDF
6 pages
Robotics and Automation: Unit V Ai and Other Research Trends in Robotics
No ratings yet
Robotics and Automation: Unit V Ai and Other Research Trends in Robotics
50 pages
Machine Learning: Abstract
No ratings yet
Machine Learning: Abstract
11 pages
2 Understanding Machine Learning m2 Slides
No ratings yet
2 Understanding Machine Learning m2 Slides
12 pages
Unit1_ml
No ratings yet
Unit1_ml
10 pages
Big Data Assignment
No ratings yet
Big Data Assignment
14 pages
Machine Learning Unit-1.1
No ratings yet
Machine Learning Unit-1.1
29 pages
S1A Intro To Business Analytics
No ratings yet
S1A Intro To Business Analytics
39 pages
Mlunit 1
No ratings yet
Mlunit 1
139 pages
ML Unit-1
No ratings yet
ML Unit-1
139 pages
Chapter 1 Data Science Fundamentals
No ratings yet
Chapter 1 Data Science Fundamentals
34 pages
Data Science_ppt
No ratings yet
Data Science_ppt
45 pages
ML Module2-Chapter 1
No ratings yet
ML Module2-Chapter 1
50 pages
Project Report
No ratings yet
Project Report
29 pages
Machine Learning by Sahil
No ratings yet
Machine Learning by Sahil
15 pages
Evolution of Machine Learning
No ratings yet
Evolution of Machine Learning
7 pages
Basics of Data Science KPK
No ratings yet
Basics of Data Science KPK
38 pages
U20cs604 Machine Learning Unit I
No ratings yet
U20cs604 Machine Learning Unit I
33 pages
MAchine learning notes
No ratings yet
MAchine learning notes
41 pages
ML CH 1
No ratings yet
ML CH 1
53 pages
Machine Learning Algorithms With R in Business Module 2 Transcript
No ratings yet
Machine Learning Algorithms With R in Business Module 2 Transcript
142 pages
Machine Learning
No ratings yet
Machine Learning
15 pages
Introduction To Analytics
No ratings yet
Introduction To Analytics
40 pages
5c - 6a An Introduction To ML Alpaydin PP 1-12
No ratings yet
5c - 6a An Introduction To ML Alpaydin PP 1-12
12 pages
Machine Learning: Bilal Khan
No ratings yet
Machine Learning: Bilal Khan
26 pages
Data Science
No ratings yet
Data Science
33 pages
Introduction to Data Science Using R
From Everand
Introduction to Data Science Using R
Prema Alla
No ratings yet
Machine Learning Fundamentals: Concepts, Models, and Applications
From Everand
Machine Learning Fundamentals: Concepts, Models, and Applications
Amar Sahay
No ratings yet
ML Notes
No ratings yet
ML Notes
60 pages
Kadir
No ratings yet
Kadir
84 pages
DS Module 1
No ratings yet
DS Module 1
112 pages
Module1 Introduction
No ratings yet
Module1 Introduction
35 pages
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Machine Learning Report
No ratings yet
Machine Learning Report
73 pages
Chapter 1- Introduction
No ratings yet
Chapter 1- Introduction
17 pages
ML_Lec_1
No ratings yet
ML_Lec_1
49 pages
Evolution of Analytics 108240
No ratings yet
Evolution of Analytics 108240
41 pages
Abdul Kadir
No ratings yet
Abdul Kadir
97 pages
Data Science Using Python
No ratings yet
Data Science Using Python
12 pages
Executive Program in Data Science & Data Analytics Along With Python
No ratings yet
Executive Program in Data Science & Data Analytics Along With Python
21 pages
DAIOT UNIT 5 (1) Own
No ratings yet
DAIOT UNIT 5 (1) Own
13 pages
Fds Module 1
No ratings yet
Fds Module 1
65 pages
R_LabManual_6-8_Pgms
No ratings yet
R_LabManual_6-8_Pgms
12 pages
Unit 2 AI & ML
No ratings yet
Unit 2 AI & ML
86 pages
Boost Your Productivity With AI Tools
From Everand
Boost Your Productivity With AI Tools
Daniel Basso
No ratings yet
Data Science IV
No ratings yet
Data Science IV
126 pages
UNIT 1 DSML (1)
No ratings yet
UNIT 1 DSML (1)
11 pages
e4f1fb7f-a61e-4090-9018-344695f0d7d4 (2)
No ratings yet
e4f1fb7f-a61e-4090-9018-344695f0d7d4 (2)
30 pages
Introducion to ML
No ratings yet
Introducion to ML
29 pages
Module 1
No ratings yet
Module 1
22 pages
Full Download Applied Analytics through Case Studies Using SAS and R: Implementing Predictive Models and Machine Learning Techniques Deepti Gupta PDF DOCX
100% (3)
Full Download Applied Analytics through Case Studies Using SAS and R: Implementing Predictive Models and Machine Learning Techniques Deepti Gupta PDF DOCX
65 pages
Computer System Software
No ratings yet
Computer System Software
18 pages
Verifying and Troubleshooting AV & IPS Updates Status and Versions
No ratings yet
Verifying and Troubleshooting AV & IPS Updates Status and Versions
4 pages
4cp0 02 Pef 20220825
No ratings yet
4cp0 02 Pef 20220825
8 pages
Netezza Cheat Sheet
No ratings yet
Netezza Cheat Sheet
9 pages
Sports Equipment Inventory Management System
75% (12)
Sports Equipment Inventory Management System
47 pages
Siemens SIMATIC Step 7 Programmer
No ratings yet
Siemens SIMATIC Step 7 Programmer
72 pages
Supplier integration project
No ratings yet
Supplier integration project
16 pages
Final SOP PDF
No ratings yet
Final SOP PDF
3 pages
CSS G12 Module 3 Q1
100% (3)
CSS G12 Module 3 Q1
21 pages
AOS Assignment 1: Features of Operating System (OS)
No ratings yet
AOS Assignment 1: Features of Operating System (OS)
3 pages
Samuel P. Tregelles - The Hope of Christ's Second Coming (1864)
No ratings yet
Samuel P. Tregelles - The Hope of Christ's Second Coming (1864)
97 pages
Cix Practice Paper Ai 2
No ratings yet
Cix Practice Paper Ai 2
6 pages
Bugreport Vayu - Global TKQ1.221013.002 2024 04 16 20 23 15 Dumpstate - Log 27166
No ratings yet
Bugreport Vayu - Global TKQ1.221013.002 2024 04 16 20 23 15 Dumpstate - Log 27166
29 pages
Module 1 Introduction To New Media
No ratings yet
Module 1 Introduction To New Media
14 pages
Langsam Breakdown
No ratings yet
Langsam Breakdown
5 pages
Msieditor, Journal Editor, 1jurnal Prakoso Setyo Sambodo
No ratings yet
Msieditor, Journal Editor, 1jurnal Prakoso Setyo Sambodo
13 pages
Morse Code Challenge - Booklet
No ratings yet
Morse Code Challenge - Booklet
5 pages
User Guide
No ratings yet
User Guide
6 pages
Ebook Computer Repair For PC Owners Ver 1.04
88% (8)
Ebook Computer Repair For PC Owners Ver 1.04
132 pages
Important Short and Long Questions of Database
No ratings yet
Important Short and Long Questions of Database
1 page
Java Client Proxy Scenario in PIXI
No ratings yet
Java Client Proxy Scenario in PIXI
10 pages
AI-Unit 5
No ratings yet
AI-Unit 5
32 pages
5IT01 01 Que 20170517
No ratings yet
5IT01 01 Que 20170517
20 pages
Xith Ip Mysql Part
No ratings yet
Xith Ip Mysql Part
13 pages
MSFT Microsoft Surface Pro 11th Edition Fact Sheet
No ratings yet
MSFT Microsoft Surface Pro 11th Edition Fact Sheet
17 pages
CF-Assignment 2 Wix-Blog
No ratings yet
CF-Assignment 2 Wix-Blog
13 pages
Scrum Templates V2
No ratings yet
Scrum Templates V2
7 pages
Gti250W Toolbox Manual: Software and Hardware To Adjust The Power Curve of The GTI250W Grid Tie Inverter
No ratings yet
Gti250W Toolbox Manual: Software and Hardware To Adjust The Power Curve of The GTI250W Grid Tie Inverter
10 pages
Design and Implementation of Voice Recognition System
No ratings yet
Design and Implementation of Voice Recognition System
7 pages