0% found this document useful (0 votes)
4 views

Unit 5(DS)

The document outlines the curriculum for a Data Science course at the Royal Education Society’s College of Computer Science and Information Technology, covering topics such as the differences between Data Science and Data Mining, experimentation methods like A/B testing, and model evaluation techniques. It emphasizes the importance of predictive analytics and clustering in data analysis, detailing various methodologies for effective data segmentation and model deployment. Additionally, it lists essential tools and programming languages used in data science, including R, Python, and SQL.

Uploaded by

omkarsanmukrao83
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Unit 5(DS)

The document outlines the curriculum for a Data Science course at the Royal Education Society’s College of Computer Science and Information Technology, covering topics such as the differences between Data Science and Data Mining, experimentation methods like A/B testing, and model evaluation techniques. It emphasizes the importance of predictive analytics and clustering in data analysis, detailing various methodologies for effective data segmentation and model deployment. Additionally, it lists essential tools and programming languages used in data science, including R, Python, and SQL.

Uploaded by

omkarsanmukrao83
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Royal Education Society’s

College of Computer Science and Information Technology, Latur

Department of Computer Science


Academic Year (2021-22)
Choice Based Credit System (CBCS Pattern)

Class / Semester: BSC CS TY/ V Name of Paper: Data Science (BCS-503)


Faculty Name : Shazia Farheen

UNIT V
5.1 Data Mining V/S Data Science
5.2 Experimentation, Evaluation and Project Deployment Tools
5.3 Predictive Analytics and Segmentation using Clustering
5.4 Applied Mathematics and Informatics, Exploratory Data Analysis

5.1 Data Mining V/S Data Science

Difference between Data Science and Data Mining


Data Science: Data Science is a field or domain which includes and involves working with
a huge amount of data and uses it for building predictive, prescriptive and prescriptive
analytical models. It’s about digging, capturing, (building the model) analyzing(validating
the model) and utilizing the data(deploying the best model).
It is an intersection of Data and computing. It is a blend of the field of Computer Science,
Business and Statistics together.

Data Mining: Data Mining is a technique to extract important and vital information and
knowledge from a huge set/libraries of data. It derives insight by carefully extracting,
reviewing, and processing the huge data to find out pattern and co-relations which can be
important for the business. It is analogous to the gold mining where golds are extracted
from rocks and sands.
Below is a table of differences between Data Science and Data Mining:
S.No. Data Science Data Mining

1 Data Science is an area. Data Mining is a technique.

It is about collection, processing, analyzing and


It is about extracting the vital and
2 utilizing of data into various operations. It is
valuable information from the data.
more conceptual.

It is a field of study just like the Computer It is a technique which is a part of


3 Science, Applied Statistics or Applied the Knowledge Discovery in Data
Mathematics. Base processes (KDD).

The goal is to make data more vital


The goal is to build data-dominant products for a
4 and usable i.e. by extracting only
venture.
important information.

It deals with the all types of data i.e. structured, It mainly deals with the structured
5
unstructured or semi-structured. forms of the data.

It is a super set of Data Mining as data science


It is a sub set of Data Science as
consists of Data scrapping, cleaning,
6 mining activities which is in a
visualization, statistics and many more
pipeline of the Data science.
techniques.

It is mainly used for business


7 It is mainly used for scientific purposes.
purposes.

It is more involved with the


8 It broadly focuses on the science of the data.
processes.

5.2 Experimentation, Evaluation and Project Deployment Tools

Experimentation:
1. A/B Testing

AB testing is an extremely common method of experimentation used in industry to understand

the impact changes we make to our product. It could be as simple as changing the layout on a web

page or the color of a button and measuring the effect this change has on a key metric such as

click-through rates. In general, we can take two different approaches to AB testing. We can use
the Frequentist approach and the Bayesian approach, each of which has its own advantages and

disadvantages.

Frequentist

I would say frequentist AB testing is by far the most common type of AB testing done and

follows directly from the principles of frequentist statistics. The goal here is to measure the causal

effect of our treatment by seeing if the difference between our metric in the A and B groups is

statistically significant at some significance level, 5 or 1 per cent is typically chosen. More

specifically, we will need to define a null and alternate hypothesis and determine if we can or

cannot reject the null. Depending on the type of metric we choose we might use a different

statistical test but chi-square and t-tests are commonly used in practice. A key point about the

frequentist approach is that the parameter or metric we compute is a constant. Therefore, there is

no probability distribution associated with it.

Bayesian

The key difference in the Bayesian approach is that our metric is a random variable and therefore

has a probability distribution. This is quite useful as we can now incorporate uncertainty about our

estimates and make probabilistic statements which are often much more intuitive to people than

the frequentist interpretation. Another advantage of using a Bayesian approach is that we may

reach a solution faster compared to AB testing as we do not necessarily need to assign equal

numbers of data to each variant. This means that a Bayesian approach may converge to a solution

faster using fewer resources.

2. Regression Discontinuity Design (RDD)

RDD is another technique available taken from economics and is particularly suited when we

have a continuous natural cut-off point. Those just below the cut-off do not get the treatment and
those just above the cut-off do get the treatment. The idea is that these two groups of people are

very similar so the only systematic difference between them is whether they were treated or not.

Because the groups are considered very similar these assignments essentially approximates

randomized selection. For example is that certificates of merit were only given to individuals who

scored above a certain threshold on a test. Using this approach we can now just compare the

average outcome between the two groups at the threshold to see if there is a statistically

significant effect.

It may help to visualise RDD. Below is a graph which shows the average outcome below and

above the threshold. Essentially all we are doing is measuring the difference between the two blue

lines beside the cutoff point.

Source: Example: RDD cutoff: Threshold at 50 on the x-axis

3. Difference in Differences (Diff in Diff)

I will go into a bit more detail on Diff in Diff as I recently used this technique on a project. The

problem was one that many data scientists might come across in daily work and was related to

preventing churn. It is very common for companies to try and identify customers who are likely to

churn and then to design interventions to prevent it. Now identifying churn is a problem for

machine learning and I won't go into that here. What we care about is whether we can come up
with a way to measure the effectiveness of our churn intervention. Being able to empirically
measure the effectiveness of our decisions is incredibly important as we want to quantify the

effects of our features (usually in monetary terms) and it is a vital part of making informed

decisions for the business. One such intervention would be to send an email to those at risk of

churning reminding them of their account and in effect to try and make them more engaged with

our product. This is the basis of the problem we will be looking at here.

Evaluation and Project Deployment Tools


Model Evaluation
Model Evaluation is an integral part of the model development process. It helps to find the best
model that represents our data and how well the chosen model will work in the future. Evaluating
model performance with the data used for training is not acceptable in data science because it can
easily generate overoptimistic and over fitted models. There are two methods of evaluating models
in data science, Hold-Out and Cross-Validation. To avoid overfitting, both methods use a test set
(not seen by the model) to evaluate model performance.

1. Hold-Out

In this method, the mostly large dataset is randomly divided to three subsets:

1. Training set is a subset of the dataset used to build predictive models.


2. Validation set is a subset of the dataset used to assess the performance of model built in
the training phase. It provides a test platform for fine tuning model's parameters and
selecting the best-performing model. Not all modeling algorithms need a validation set.
3. Test set or unseen examples is a subset of the dataset to assess the likely future performance
of a model. If a model fit to the training set much better than it fits the test set, overfitting is
probably the cause.

2. Cross-Validation

When only a limited amount of data is available, to achieve an unbiased estimate of the model
performance we use k-fold cross-validation. In k-fold cross-validation, we divide the data into k
subsets of equal size. We build models k times, each time leaving out one of the subsets from training
and use it as the test set. If k equals the sample size, this is called "leave-one-out".
Model Deployment

The concept of deployment in data science refers to the application of a model for
prediction using a new data. Building a model is generally not the end of the project. Even
if the purpose of the model is to increase knowledge of the data, the knowledge gained
will need to be organized and presented in a way that the customer can use it. Depending
on the requirements, the deployment phase can be as simple as generating a report or as
complex as implementing a repeatable data science process. In many cases, it will be the
customer, not the data analyst, who will carry out the deployment steps. For example, a
credit card company may want to deploy a trained model or set of models (e.g., neural
networks, meta-learner) to quickly identify transactions, which have a high probability of
being fraudulent. However, even if the analyst will not carry out the deployment effort it is
important for the customer to understand up front what actions will need to be carried out
in order to actually make use of the created models?

Model deployment methods:


In general, there is four way of deploying the models in data science.
1. Data science tools (or cloud)
2. Programming language (Java, C, VB)
3. Database and SQL script (TSQL, PL-SQL,)
4. PMML (Predictive Model Markup Language)

Tools for Data Science

Following are some tools required for data science:


◦Data Analysis tools: R, Python, Statistics, SAS, Jupyter, R Studio, MATLAB, Excel,
RapidMiner.
◦Data Warehousing: ETL, SQL, Hadoop, Informatica/Talend, AWS Redshift
◦Data Visualization tools: R, Jupyter, Tableau, Cognos.
◦Machine learning tools: Spark, Mahout, Azure ML studio.

Matlab provides you the solution for analyzing data, developing algorithms, and for
creating models. It can be used for data analytics and wireless communications.

Java is an object-oriented programming language. The compiled Java code can be run on
any Java supported platform without recompiling it. Java is simple, object-oriented,
architecture-neutral, platform-independent, portable, multi-threaded, and secure.
Python is a high-level programming language and provides a large standard library. It has
the features of object-oriented, functional, procedural, dynamic type, and automatic
memory management.

5.3 Predictive Analytics and Segmentation using Clustering


Predictive analytics is a category of data analytics aimed at making predictions about
future outcomes based on historical data and analytics techniques such as statistical
modelling and machine learning. The science of predictive analytics can generate future
insights with a significant degree of precision. With the help of sophisticated predictive
analytics tools and models, any organization can now use past and current data to reliably
forecast trends and behaviors milliseconds, days, or years into the future.

Predictive analytics tools give users deep, real-time insights into an almost endless array
of business activities. Tools can be used to predict various types of behaviour and patterns,
such as how to allocate resources at particular times, when to replenish stock or the best
moment to launch a marketing campaign, basing predictions on an analysis of data
collected over a period of time.

Virtually all predictive analytics adopters use tools provided by one or more external
developers. Many such tools are tailored to meet the needs of specific enterprises and
departments. Major predictive analytics software and service providers include:

•Acxiom
•IBM
•Information Builders
•Microsoft
•SAP
•SAS Institute
•Tableau Software
•Teradata
•TIBCO Software

Cluster Analysis and Segmentation


Clustering techniques are used to group data/observations in a few segments so that data within
any segment are similar while data across segments are different. Defining what we mean when
we say “similar” or “different” observations is a key part of cluster analysis which often requires
a lot of contextual knowledge and creativity beyond what statistical tools can provide.

Cluster analysis is used in a variety of applications. For example it can be used to identify
consumer segments, or competitive sets of products, or groups of assets whose prices co-move,
or for geo-demographic segmentation, etc. In general it is often necessary to split our data into
segments and perform any subsequent analysis within each segment in order to develop
(potentially more refined) segment-specific insights. This may be the case even if there are no
intuitively “natural” segments in our data.

Clustering and Segmentation

There is not one process for clustering and segmentation. However, we have to start somewhere,
so we will use the following process:

Clustering and Segmentation in 9 steps

1. Confirm data is metric


2. Scale the data
3. Select Segmentation Variables
4. Define similarity measure
5. Visualize Pair-wise Distances
6. Method and Number of Segments
7. Profile and interpret the segments
8. Robustness Analysis

Step 1: Confirm data is metric

While one can cluster data even if they are not metric, many of the statistical methods available
for clustering require that the data are so: this means not only that all data are numbers, but also
that the numbers have an actual numerical meaning, that is, 1 is less than 2, which is less than 3
etc. The main reason for this is that one needs to define distances between observations (see step
4 below), and often (“black box” mathematical) distances (e.g. the “Euclideal distance”) are
defined only with metric data.

However, one could potentially define distances also for non-metric data. In general, a “best
practice” for segmentation is to creatively define distance metrics between our observations.

Step 2: Scale the data

This is an optional step. Note that for this data, while 6 of the “survey” data are on a similar
scale, namely 1-7, there is one variable that is about 2 orders of magnitude larger: the Income
variable.
Having some variables with a very different range/scale can often create problems: most of the
“results” may be driven by a few large values, more so than we would like. To avoid such
issues, one has to consider whether or not to standardize the data by making some of the initial
raw attributes have, for example, mean 0 and standard deviation

Step 3: Select Segmentation Variables

The decision about which variables to use for clustering is a critically important decision that
will have a big impact on the clustering solution. So we need to think carefully about the
variables we will choose for clustering. Good exploratory research that gives us a good sense of
what variables may distinguish people or products or assets or regions is critical. Clearly this is a
step where a lot of contextual knowledge, creativity, and experimentation/iterations are needed.

Moreover, we often use only a few of the data attributes for segmentation (the segmentation
attributes) and use some of the remaining ones (the profiling attributes) For example, in
market research and market segmentation, one may use attitudinal data for segmentation (to
segment the customers based on their needs and attitudes towards the products/services) and then
demographic and behavioural data for profiling the segments found.

Step 4: Define similarity measure

Remember that the goal of clustering and segmentation is to group observations based on how
similar they are. It is therefore crucial that we have a good understanding of what makes two
observations (e.g. customers, products, companies, assets, investments, etc) “similar”.

If the user does not have a good understanding of what makes two observations (e.g. customers,
products, companies, assets, investments, etc) “similar”, no statistical method will be able to
discover the answer to this question.

Most statistical methods for clustering and segmentation use common mathematical measures of
distance. Typical measures are, for example, the Euclidean distance or the Manhattan distance

Step 5: Visualize Pair-wise Distances

Having defined what we mean “two observations are similar”, the next step is to get a first
understanding of the data through visualizing for example individual attributes as well as the
pairwise distances (using various distance metrics) between the observations. If there are indeed
multiple segments in our data, some of these plots should show “mountains and valleys”, with
the mountains being potential segments.

Visualization is very important for data analytics, as it can provide a first understanding of the
data.

Step 6: Method and Number of Segments


There are many statistical methods for clustering and segmentation. In practice one may use
various approaches and then eventually select the solution that is statistically robust
interpretable, and actionable - among other criteria.

Two widely used methods: the Kmeans Clustering Method, and the Hierarchical Clustering
Method. Like all clustering methods, these two also require that we have decided how to
measure the distance/similarity between our observations. Explaining how these methods work is
beyond our scope. The only difference to highlight is that Kmeans requires the user to define
how many segments to create, while Hierarchical Clustering does not.

Step 7: Profile and interpret the segments

Having decided (for now) how many clusters to use, we would like to get a better understanding
of who the customers in those clusters are and interpret the segments.

Data analytics is used to eventually make decisions, and that is feasible only when we are
comfortable (enough) with our understanding of the analytics results, including our ability to
clearly interpret them.

To this purpose, one needs to spend time visualizing and understanding the data within each of
the selected segments. For example, one can see how the summary statistics (e.g. averages,
standard deviations, etc) of the profiling attributes differ across the segments.

Step 8: Robustness Analysis

The segmentation process outlined so far can be followed with many different approaches, for
example:

 using different subsets of the original data


 using variations of the original segmentation attributes
 using different distance metrics
 using different segmentation methods
 using different numbers of clusters

Much like any data analysis, segmentation is an iterative process with many variations of data,
methods, number of clusters, and profiles generated until a satisfying solution is reached.

Data Analytics is an iterative process, therefore we may need to return to our original raw data at
any point and select new raw attributes as well as new clusters.

5.4 Applied Mathematics and Informatics, Exploratory Data Analysis

Applied mathematics is the application of mathematical methods by different fields such


as physics, engineering, medicine, biology, finance, business, computer science, and
industry. Thus, applied mathematics is a combination of mathematical science and
specialized knowledge. The term "applied mathematics" also describes the professional
specialty in which mathematicians work on practical problems by formulating and
studying mathematical models.

Machine learning is powered by four critical concepts and is Statistics, Linear Algebra,
Probability, and Calculus. While statistical concepts are the core part of every model,
calculus helps us learn and optimize a model. Linear algebra comes exceptionally handy
when we are dealing with a huge dataset and probability helps in predicting the livelihood
of events that will be occurring. These are the mathematical concepts that you will
encounter in your data science and machine learning career quite frequently.

1. Linear Algebra

Linear algebra is applied in machine learning algorithms in loss functions, regularisation,

covariance matrices, Singular Value Decomposition (SVD), Matrix Operations, and

support vector machine classification. It is also applied in machine learning algorithms

like linear regression.

2. Calculus

Some of the necessary topics to ace the calculus part in data science are Differential and Integral

Calculus, Partial Derivatives, Vector-Values Functions, Directional Gradients.

Multivariate calculus is utilized in algorithm training as well as in gradient descent. Derivatives,

divergence, curvature, and quadratic approximations are all important concepts you can learn and

implement.

3. Probability Theory
In machine learning, there are three major sources of uncertainty: noisy data, limited

coverage of the problem area, and of course imperfect models. However, with the help of

the right probability tools, we can estimate the solution to the problem.

Probability is essential for hypothesis testing and distributions like the Gaussian

distribution and the probability density function.

4. Descriptive statistics

Descriptive statistics is a critical concept that every aspiring data scientist needs to learn to

understand machine learning when working with classifications like logistic regression,

distributions, discrimination analysis, and hypothesis testing.

5. Discrete Maths

Discrete mathematics is concerned with non-continuous numbers, most often integers.

Many applications necessitate the use of discrete numbers. The fundamentals of discrete

math for machine learning unless you wish to work with relational domains, graphical

models, combinatorial problems, structured prediction, and so on


Computer science[edit]

Informatics is: A collaborative activity that involves people, processes, and technologies
to apply trusted data in a useful and understandable way.

The relationship between informatics and data science depends on your perspective. Individuals
who started their work in clinical care or informatics see the powerful tools and analytic
techniques of data science as part of informatics that help researchers and informatics
practitioners understand the data more effectively and leverage that for patient care. Individuals
who started their careers in the computational work of statistics or computer science look to
informatics as a subset of data science that helps them focus attention on collecting data and
understanding how to apply knowledge gained from data science techniques to improve care.
Informatics is:
•Less technical and less theoretical than data analytics
•Much less math based
•More focused on end users and tailoring systems to satisfy the needs of end users within a
specific discipline, such as health
•Focused on design thinking skills that encourage a bias toward action and the notion that
it is acceptable to make changes or corrections as new ideas and approaches come to
fruition
•A collaborative field where informatics specialists work with peers to identify, frame and
solve human computer interaction issues within the framework of a content discipline,
such as health

Exploratory data analysis:

Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets
and summarize their main characteristics, often employing data visualization methods. The main
purpose of EDA is to help look at data before making any assumptions. It can help identify
obvious errors, as well as better understand patterns within the data, detect outliers or anomalous
events, find interesting relations among the variables.

Data scientists can use exploratory analysis to ensure the results they produce are valid and
applicable to any desired business outcomes and goals.

Types of exploratory data analysis

There are four primary types of EDA:

 Univariate non-graphical. This is simplest form of data analysis, where the data being
analyzed consists of just one variable. Since it’s a single variable, it doesn’t deal with
causes or relationships. The main purpose of univariate analysis is to describe the data
and find patterns that exist within it.
 Univariate graphical. Non-graphical methods don’t provide a full picture of the data.
Graphical methods are therefore required. Common types of univariate graphics
include:
o Stem-and-leaf plots, which show all data values and the shape of the
distribution.
o Histograms, a bar plot in which each bar represents the frequency (count) or
proportion (count/total count) of cases for a range of values.
o Box plots, which graphically depict the five-number summary of minimum, first
quartile, median, third quartile, and maximum.
 Multivariate nongraphical: Multivariate data arises from more than one variable.
Multivariate non-graphical EDA techniques generally show the relationship between
two or more variables of the data through cross-tabulation or statistics.
 Multivariate graphical: Multivariate data uses graphics to display relationships
between two or more sets of data. The most used graphic is a grouped bar plot or bar
chart with each group representing one level of one of the variables and each bar within
a group representing the levels of the other variable.
Other common types of multivariate graphics include:

 Scatter plot, which is used to plot data points on a horizontal and a vertical axis to show
how much one variable is affected by another.
 Multivariate chart, which is a graphical representation of the relationships between
factors and a response.
 Run chart, which is a line graph of data plotted over time.
 Bubble chart, which is a data visualization that displays multiple circles (bubbles) in a
two-dimensional plot.
 Heat map, which is a graphical representation of data where values are depicted by
color.

You might also like