Unit 5(DS)
Unit 5(DS)
UNIT V
5.1 Data Mining V/S Data Science
5.2 Experimentation, Evaluation and Project Deployment Tools
5.3 Predictive Analytics and Segmentation using Clustering
5.4 Applied Mathematics and Informatics, Exploratory Data Analysis
Data Mining: Data Mining is a technique to extract important and vital information and
knowledge from a huge set/libraries of data. It derives insight by carefully extracting,
reviewing, and processing the huge data to find out pattern and co-relations which can be
important for the business. It is analogous to the gold mining where golds are extracted
from rocks and sands.
Below is a table of differences between Data Science and Data Mining:
S.No. Data Science Data Mining
It deals with the all types of data i.e. structured, It mainly deals with the structured
5
unstructured or semi-structured. forms of the data.
Experimentation:
1. A/B Testing
the impact changes we make to our product. It could be as simple as changing the layout on a web
page or the color of a button and measuring the effect this change has on a key metric such as
click-through rates. In general, we can take two different approaches to AB testing. We can use
the Frequentist approach and the Bayesian approach, each of which has its own advantages and
disadvantages.
Frequentist
I would say frequentist AB testing is by far the most common type of AB testing done and
follows directly from the principles of frequentist statistics. The goal here is to measure the causal
effect of our treatment by seeing if the difference between our metric in the A and B groups is
statistically significant at some significance level, 5 or 1 per cent is typically chosen. More
specifically, we will need to define a null and alternate hypothesis and determine if we can or
cannot reject the null. Depending on the type of metric we choose we might use a different
statistical test but chi-square and t-tests are commonly used in practice. A key point about the
frequentist approach is that the parameter or metric we compute is a constant. Therefore, there is
Bayesian
The key difference in the Bayesian approach is that our metric is a random variable and therefore
has a probability distribution. This is quite useful as we can now incorporate uncertainty about our
estimates and make probabilistic statements which are often much more intuitive to people than
the frequentist interpretation. Another advantage of using a Bayesian approach is that we may
reach a solution faster compared to AB testing as we do not necessarily need to assign equal
numbers of data to each variant. This means that a Bayesian approach may converge to a solution
RDD is another technique available taken from economics and is particularly suited when we
have a continuous natural cut-off point. Those just below the cut-off do not get the treatment and
those just above the cut-off do get the treatment. The idea is that these two groups of people are
very similar so the only systematic difference between them is whether they were treated or not.
Because the groups are considered very similar these assignments essentially approximates
randomized selection. For example is that certificates of merit were only given to individuals who
scored above a certain threshold on a test. Using this approach we can now just compare the
average outcome between the two groups at the threshold to see if there is a statistically
significant effect.
It may help to visualise RDD. Below is a graph which shows the average outcome below and
above the threshold. Essentially all we are doing is measuring the difference between the two blue
I will go into a bit more detail on Diff in Diff as I recently used this technique on a project. The
problem was one that many data scientists might come across in daily work and was related to
preventing churn. It is very common for companies to try and identify customers who are likely to
churn and then to design interventions to prevent it. Now identifying churn is a problem for
machine learning and I won't go into that here. What we care about is whether we can come up
with a way to measure the effectiveness of our churn intervention. Being able to empirically
measure the effectiveness of our decisions is incredibly important as we want to quantify the
effects of our features (usually in monetary terms) and it is a vital part of making informed
decisions for the business. One such intervention would be to send an email to those at risk of
churning reminding them of their account and in effect to try and make them more engaged with
our product. This is the basis of the problem we will be looking at here.
1. Hold-Out
In this method, the mostly large dataset is randomly divided to three subsets:
2. Cross-Validation
When only a limited amount of data is available, to achieve an unbiased estimate of the model
performance we use k-fold cross-validation. In k-fold cross-validation, we divide the data into k
subsets of equal size. We build models k times, each time leaving out one of the subsets from training
and use it as the test set. If k equals the sample size, this is called "leave-one-out".
Model Deployment
The concept of deployment in data science refers to the application of a model for
prediction using a new data. Building a model is generally not the end of the project. Even
if the purpose of the model is to increase knowledge of the data, the knowledge gained
will need to be organized and presented in a way that the customer can use it. Depending
on the requirements, the deployment phase can be as simple as generating a report or as
complex as implementing a repeatable data science process. In many cases, it will be the
customer, not the data analyst, who will carry out the deployment steps. For example, a
credit card company may want to deploy a trained model or set of models (e.g., neural
networks, meta-learner) to quickly identify transactions, which have a high probability of
being fraudulent. However, even if the analyst will not carry out the deployment effort it is
important for the customer to understand up front what actions will need to be carried out
in order to actually make use of the created models?
Matlab provides you the solution for analyzing data, developing algorithms, and for
creating models. It can be used for data analytics and wireless communications.
Java is an object-oriented programming language. The compiled Java code can be run on
any Java supported platform without recompiling it. Java is simple, object-oriented,
architecture-neutral, platform-independent, portable, multi-threaded, and secure.
Python is a high-level programming language and provides a large standard library. It has
the features of object-oriented, functional, procedural, dynamic type, and automatic
memory management.
Predictive analytics tools give users deep, real-time insights into an almost endless array
of business activities. Tools can be used to predict various types of behaviour and patterns,
such as how to allocate resources at particular times, when to replenish stock or the best
moment to launch a marketing campaign, basing predictions on an analysis of data
collected over a period of time.
Virtually all predictive analytics adopters use tools provided by one or more external
developers. Many such tools are tailored to meet the needs of specific enterprises and
departments. Major predictive analytics software and service providers include:
•Acxiom
•IBM
•Information Builders
•Microsoft
•SAP
•SAS Institute
•Tableau Software
•Teradata
•TIBCO Software
Cluster analysis is used in a variety of applications. For example it can be used to identify
consumer segments, or competitive sets of products, or groups of assets whose prices co-move,
or for geo-demographic segmentation, etc. In general it is often necessary to split our data into
segments and perform any subsequent analysis within each segment in order to develop
(potentially more refined) segment-specific insights. This may be the case even if there are no
intuitively “natural” segments in our data.
There is not one process for clustering and segmentation. However, we have to start somewhere,
so we will use the following process:
While one can cluster data even if they are not metric, many of the statistical methods available
for clustering require that the data are so: this means not only that all data are numbers, but also
that the numbers have an actual numerical meaning, that is, 1 is less than 2, which is less than 3
etc. The main reason for this is that one needs to define distances between observations (see step
4 below), and often (“black box” mathematical) distances (e.g. the “Euclideal distance”) are
defined only with metric data.
However, one could potentially define distances also for non-metric data. In general, a “best
practice” for segmentation is to creatively define distance metrics between our observations.
This is an optional step. Note that for this data, while 6 of the “survey” data are on a similar
scale, namely 1-7, there is one variable that is about 2 orders of magnitude larger: the Income
variable.
Having some variables with a very different range/scale can often create problems: most of the
“results” may be driven by a few large values, more so than we would like. To avoid such
issues, one has to consider whether or not to standardize the data by making some of the initial
raw attributes have, for example, mean 0 and standard deviation
The decision about which variables to use for clustering is a critically important decision that
will have a big impact on the clustering solution. So we need to think carefully about the
variables we will choose for clustering. Good exploratory research that gives us a good sense of
what variables may distinguish people or products or assets or regions is critical. Clearly this is a
step where a lot of contextual knowledge, creativity, and experimentation/iterations are needed.
Moreover, we often use only a few of the data attributes for segmentation (the segmentation
attributes) and use some of the remaining ones (the profiling attributes) For example, in
market research and market segmentation, one may use attitudinal data for segmentation (to
segment the customers based on their needs and attitudes towards the products/services) and then
demographic and behavioural data for profiling the segments found.
Remember that the goal of clustering and segmentation is to group observations based on how
similar they are. It is therefore crucial that we have a good understanding of what makes two
observations (e.g. customers, products, companies, assets, investments, etc) “similar”.
If the user does not have a good understanding of what makes two observations (e.g. customers,
products, companies, assets, investments, etc) “similar”, no statistical method will be able to
discover the answer to this question.
Most statistical methods for clustering and segmentation use common mathematical measures of
distance. Typical measures are, for example, the Euclidean distance or the Manhattan distance
Having defined what we mean “two observations are similar”, the next step is to get a first
understanding of the data through visualizing for example individual attributes as well as the
pairwise distances (using various distance metrics) between the observations. If there are indeed
multiple segments in our data, some of these plots should show “mountains and valleys”, with
the mountains being potential segments.
Visualization is very important for data analytics, as it can provide a first understanding of the
data.
Two widely used methods: the Kmeans Clustering Method, and the Hierarchical Clustering
Method. Like all clustering methods, these two also require that we have decided how to
measure the distance/similarity between our observations. Explaining how these methods work is
beyond our scope. The only difference to highlight is that Kmeans requires the user to define
how many segments to create, while Hierarchical Clustering does not.
Having decided (for now) how many clusters to use, we would like to get a better understanding
of who the customers in those clusters are and interpret the segments.
Data analytics is used to eventually make decisions, and that is feasible only when we are
comfortable (enough) with our understanding of the analytics results, including our ability to
clearly interpret them.
To this purpose, one needs to spend time visualizing and understanding the data within each of
the selected segments. For example, one can see how the summary statistics (e.g. averages,
standard deviations, etc) of the profiling attributes differ across the segments.
The segmentation process outlined so far can be followed with many different approaches, for
example:
Much like any data analysis, segmentation is an iterative process with many variations of data,
methods, number of clusters, and profiles generated until a satisfying solution is reached.
Data Analytics is an iterative process, therefore we may need to return to our original raw data at
any point and select new raw attributes as well as new clusters.
Machine learning is powered by four critical concepts and is Statistics, Linear Algebra,
Probability, and Calculus. While statistical concepts are the core part of every model,
calculus helps us learn and optimize a model. Linear algebra comes exceptionally handy
when we are dealing with a huge dataset and probability helps in predicting the livelihood
of events that will be occurring. These are the mathematical concepts that you will
encounter in your data science and machine learning career quite frequently.
1. Linear Algebra
2. Calculus
Some of the necessary topics to ace the calculus part in data science are Differential and Integral
divergence, curvature, and quadratic approximations are all important concepts you can learn and
implement.
3. Probability Theory
In machine learning, there are three major sources of uncertainty: noisy data, limited
coverage of the problem area, and of course imperfect models. However, with the help of
the right probability tools, we can estimate the solution to the problem.
Probability is essential for hypothesis testing and distributions like the Gaussian
4. Descriptive statistics
Descriptive statistics is a critical concept that every aspiring data scientist needs to learn to
understand machine learning when working with classifications like logistic regression,
5. Discrete Maths
Many applications necessitate the use of discrete numbers. The fundamentals of discrete
math for machine learning unless you wish to work with relational domains, graphical
Informatics is: A collaborative activity that involves people, processes, and technologies
to apply trusted data in a useful and understandable way.
The relationship between informatics and data science depends on your perspective. Individuals
who started their work in clinical care or informatics see the powerful tools and analytic
techniques of data science as part of informatics that help researchers and informatics
practitioners understand the data more effectively and leverage that for patient care. Individuals
who started their careers in the computational work of statistics or computer science look to
informatics as a subset of data science that helps them focus attention on collecting data and
understanding how to apply knowledge gained from data science techniques to improve care.
Informatics is:
•Less technical and less theoretical than data analytics
•Much less math based
•More focused on end users and tailoring systems to satisfy the needs of end users within a
specific discipline, such as health
•Focused on design thinking skills that encourage a bias toward action and the notion that
it is acceptable to make changes or corrections as new ideas and approaches come to
fruition
•A collaborative field where informatics specialists work with peers to identify, frame and
solve human computer interaction issues within the framework of a content discipline,
such as health
Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets
and summarize their main characteristics, often employing data visualization methods. The main
purpose of EDA is to help look at data before making any assumptions. It can help identify
obvious errors, as well as better understand patterns within the data, detect outliers or anomalous
events, find interesting relations among the variables.
Data scientists can use exploratory analysis to ensure the results they produce are valid and
applicable to any desired business outcomes and goals.
Univariate non-graphical. This is simplest form of data analysis, where the data being
analyzed consists of just one variable. Since it’s a single variable, it doesn’t deal with
causes or relationships. The main purpose of univariate analysis is to describe the data
and find patterns that exist within it.
Univariate graphical. Non-graphical methods don’t provide a full picture of the data.
Graphical methods are therefore required. Common types of univariate graphics
include:
o Stem-and-leaf plots, which show all data values and the shape of the
distribution.
o Histograms, a bar plot in which each bar represents the frequency (count) or
proportion (count/total count) of cases for a range of values.
o Box plots, which graphically depict the five-number summary of minimum, first
quartile, median, third quartile, and maximum.
Multivariate nongraphical: Multivariate data arises from more than one variable.
Multivariate non-graphical EDA techniques generally show the relationship between
two or more variables of the data through cross-tabulation or statistics.
Multivariate graphical: Multivariate data uses graphics to display relationships
between two or more sets of data. The most used graphic is a grouped bar plot or bar
chart with each group representing one level of one of the variables and each bar within
a group representing the levels of the other variable.
Other common types of multivariate graphics include:
Scatter plot, which is used to plot data points on a horizontal and a vertical axis to show
how much one variable is affected by another.
Multivariate chart, which is a graphical representation of the relationships between
factors and a response.
Run chart, which is a line graph of data plotted over time.
Bubble chart, which is a data visualization that displays multiple circles (bubbles) in a
two-dimensional plot.
Heat map, which is a graphical representation of data where values are depicted by
color.