0% found this document useful (0 votes)
34 views

Data analytics_1

Data analytics is the science of analyzing raw data to draw conclusions and optimize business performance through automated techniques and algorithms. It encompasses various types of analysis, including descriptive, diagnostic, predictive, and prescriptive analytics, and is utilized across multiple sectors such as healthcare, retail, and hospitality. The document also distinguishes between data analytics and data analysis, discusses different modeling types, and highlights the importance of evaluation metrics in machine learning.

Uploaded by

Sandeep Venupure
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Data analytics_1

Data analytics is the science of analyzing raw data to draw conclusions and optimize business performance through automated techniques and algorithms. It encompasses various types of analysis, including descriptive, diagnostic, predictive, and prescriptive analytics, and is utilized across multiple sectors such as healthcare, retail, and hospitality. The document also distinguishes between data analytics and data analysis, discusses different modeling types, and highlights the importance of evaluation metrics in machine learning.

Uploaded by

Sandeep Venupure
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Chapter 1

Data Analytics
What Is Data Analytics?
Data analytics is the science of analyzing raw data to make conclusions about that information.
Many of the techniques and processes of data analytics have been automated into mechanical
processes and algorithms that work over raw data for human consumption.
KEY POINTS
 Data analytics is the science of analyzing raw data to make conclusions about that information.
 The techniques and processes of data analytics have been automated into mechanical processes
and algorithms that work over raw data for human consumption.
 Data analytics help a business optimize its performance.
Understanding Data Analytics
Data analytics is a broad term that encompasses many diverse types of data analysis. Any type of
information can be subjected to data analytics techniques to get insight that can be used to
improve things. Data analytics techniques can reveal trends and metrics that would otherwise be
lost in the mass of information. This information can then be used to optimize processes to
increase the overall efficiency of a business or system.
For example, manufacturing companies often record the runtime, downtime, and work queue for
various machines and then analyze the data to better plan the workloads so the machines operate
closer to peak capacity.
Data analytics can do much more than point out bottlenecks in production. Gaming companies
use data analytics to set reward schedules for players that keep the majority of players active in
the game. Content companies use many of the same data analytics to keep you clicking,
watching, or re-organizing content to get another view or another click.
Data analytics is important because it helps businesses optimize their performances.
Implementing it into the business model means companies can help reduce costs by identifying
more efficient ways of doing business and by storing large amounts of data. A company can also
use data analytics to make better business decisions and help analyze customer trends and
satisfaction, which can lead to new—and better—products and services.
Data Analysis Steps
The process involved in data analysis involves several different steps:
1. The first step is to determine the data requirements or how the data is grouped. Data may be
separated by age, demographic, income, or gender. Data values may be numerical or be divided
by category.
2. The second step in data analytics is the process of collecting it. This can be done through a
variety of sources such as computers, online sources, cameras, environmental sources, or
through personnel.
3. Once the data is collected, it must be organized so it can be analyzed. This may take place on a
spreadsheet or other form of software that can take statistical data.
4. The data is then cleaned up before analysis. This means it is scrubbed and checked to ensure
there is no duplication or error, and that it is not incomplete. This step helps correct any errors
before it goes on to a data analyst to be analyzed.

Types of Data Analytics


Data analytics is broken down into basic types.
1. Descriptive analytics: This describes what has happened over a given period of time. Have
the number of views gone up? Are sales stronger this month than last?
2. Diagnostic analytics: This focuses more on why something happened. This involves more
diverse data inputs and a bit of hypothesizing. Did the weather affect beer sales? Did that latest
marketing campaign impact sales?
3. Predictive analytics: This moves to what is likely going to happen in the near term. What
happened to sales the last time we had a hot summer? How many weather models predict a hot
summer this year?
4. Prescriptive analytics: This suggests a course of action. If the likelihood of a hot summer is
measured as an average of these five weather models is above 58%, we should add an evening
shift to the brewery and rent an additional tank to increase output.
5. Mechanistic (most amount of effort): Understand the exact changes in variables
that lead to changes in other variables for individual objects.
6. Exploratory Data Analysis (EDA) is an approach to analyze the data using visual techniques. It is
used to discover trends, patterns, or to check assumptions with the help of statistical summary
and graphical representations.
Data analytics underpins many quality control systems in the financial world, including
the ever-popular Six Sigma program. If you aren’t properly measuring something—
whether it's your weight or the number of defects per million in a production line—it is
nearly impossible to optimize it.
Some of the sectors that have adopted the use of data analytics include the travel and hospitality
industry, where turnarounds can be quick. This industry can collect customer data and figure out
where the problems, if any, lie and how to fix them.
Healthcare combines the use of high volumes of structured and unstructured data and uses data
analytics to make quick decisions. Similarly, the retail industry uses copious amounts of data to
meet the ever-changing demands of shoppers. The information retailers collect and analyze can
help them identify trends, recommend products, and increase profits.
Benefits of Data Analytics

1. Decision making improves


Companies may use the information they obtain from data analytics to guide their decisions,
leading to improved results. Data analytics removes a lot of guesswork from preparing
marketing plans, deciding what material to make, creating goods, and more. With advanced
data analytics technologies, new data can be constantly gathered and analyzed to enhance your
understanding of changing circumstances.

2. Marketing becomes more effective


When businesses understand their customers better, they will be able to sell to them more
efficiently. Data analytics also gives businesses invaluable insights into how their marketing
campaigns work so that they can fine-tune them for better results.

3. Customer service improves


Data analytics provides businesses with deeper insight into their clients, helping them to
customize customer experience to their needs, offer more customization, and create better
relationships with them.

4. The efficiency of operations increases


Data analytics will help businesses streamline their operations, save resources, and improve the
bottom line. When businesses obtain a better idea of what the audience needs, they spend less
time producing advertisements that do not meet the desires of the audience.

Who Is Using Data Analytics?


Data analytics has been adopted by several sectors, such as the travel and hospitality industry,
where turnarounds can be quick. This industry can collect customer data and figure out where the
problems, if any, lie and how to fix them. Healthcare is another sector that combines the use of
high volumes of structured and unstructured data and data analytics can help in making quick
decisions. Similarly, the retail industry uses copious amounts of data to meet the ever-changing
demands of shoppers.
Data Analytics vs Data Analysis
Data analysis, data analytics. Two terms for the same concept? Or different, but related,
terms?

It’s a common misconception that data analysis and data analytics are the same thing.
The generally accepted distinction is:

 Data analytics is the broad field of using data and tools to make business
decisions.
 Data analysis, a subset of data analytics, refers to specific actions.
To explain this confusion—and attempt to clear it up—we’ll look at both terms,
examples, and tools.

What is data analysis?


Consider data analysis one slice of the data analytics pie. Data analysis consists of
cleaning, transforming, modeling, and questioning data to find useful information. (It’s
generally agreed that other slices are other activities, from collection to storage to
visualization.)

The act of data analysis is usually limited to a single, already prepared dataset. You’ll
inspect, arrange, and question the data. Today, in the 2020s, a software or “machine”
usually does a first round of analysis, often directly in one of your databases or tools.
But this is augmented by a human who investigates and interrogates the data with
more context.

When you’re done analyzing a dataset, you’ll turn to other data analytics activities to:

 Give others access to the data


 Present the data (ideally with data visualization or storytelling)
 Suggest actions to take based on the data

Which is better?
Brack Nelson, Marketing Manager at Incrementors SEO Services, suggests that the
outcome of data analytics is more encompassing and beneficial than the output of data
analysis alone.

Mathematical model
A mathematical model is an abstract model that uses mathematical language
to describe the behavior of a system.
Mathematical modelling is the process of describing a real world problem in
mathematical terms, usually in the form of equations, and then using these equations
both to help understand the original problem, and also to discover new features about
the problem.

Suppose you are building a rectangular sandbox for your neighbor's toddler to play in,
and you have two options available based on the building materials you have. The
sandbox can have a length of 8 feet and a width of 5 feet, or a length of 7 feet and a
width of 6 feet, and you want the sandbox to have as large an area as possible. In other
words, you want to determine which dimensions will result in the larger area of a
rectangle. Thankfully, in mathematics, we have a formula for the area (A) of a rectangle
based on its length (l) and width (w).

 A=l×w
Awesome! We can use this formula to figure out which dimensions will make a bigger
sandbox!
We can calculate the two areas by plugging in our lengths and widths for each choice:

5. A1 = 8 × 5 = 40 square feet
6. A2 = 7 × 6 = 42 square feet
We see that a length of 7 feet and a width of 6 feet will result in the larger area of 42
square feet. Problem solved!

Mathematical models are of different types

Linear Models
A linear model is an equation that describes a relationship between two quantities that
show a constant rate of change.

Example
The table below shows the cost of an ice cream cone yy with xx toppings. Write an
equation that models the relationship between xx and y.y.

Toppings x Cost y
x y

2 $3.50

3 $3.75

5 $4.25

Answer
Adding one additional topping costs
If an ice cream cone with two toppings costs $3.50 each topping costs $0.25, then a cone
without any toppings must cost $3.00. Therefore, the rate of change is 0.25, the initial value
is 3, and y=0.25x + 3.
x is number of topping

Non Linear Models


If a regression equation doesn’t follow the rules for a linear model, then it must be a nonlinear
model. It’s that simple! A nonlinear model is literally not linear.

A nonlinear model describes nonlinear relationships in experimental data.


Nonlinear regression models are generally assumed to be parametric, where the
model is described as a nonlinear equation.
Example
The nonlinear regression example below models the relationship between density and electron
mobility.

The equation for the nonlinear regression analysis is too long for the fitted line plot:

Electron Mobility = (1288.14 + 1491.08 * Density Ln + 583.238 * Density Ln^2 + 75.4167 *


Density Ln^3) / (1 + 0.966295 * Density Ln + 0.397973 * Density Ln^2 + 0.0497273 * Density
Ln^3)

Empirical models
Empirical models are only supported by experimental data.
Empirical modelling is a generic term for activities that create models
by observation and experiment. It relies on observation rather than their theory.
Example
Mechanistic Model
A mechanistic model uses a theory to predict what will happen in the real
world. The alternative approach, empirical modeling, studies real-world
events to develop a theory.
Mechanistic models are useful if you have good data for making predictions.
For example, if you're designing a new plane, there's lots of information
on how plane design affects the plane's interaction with air pressure, wind
speed and gravity. You'd want to make some empirical tests before taking
passengers aboard, but mechanistic models can give you a good start.
Deterministic models
Deterministic models assume there's no variation in results. which does not
have any probabilistic (random) elements. Its output is determined when the set of
inputs and relationships in the model have been specified.
A deterministic model assumes certainty in all aspects.
Example, the conversion between Celsius and Kelvin is deterministic,
because the formula is not random…it is an exact formula that will always
give you the correct answer (assuming you perform the calculations
correctly):
Kelvin = Celsius + 273.15.
Stochastic Models
For a model to be stochastic, it must have a random variable where a level of
uncertainty exists. Due to the uncertainty present in a stochastic model, the results
provide an estimate of the probability of various outcomes.

Example
Stochastic investment models attempt to forecast the variations of prices,
returns on assets (ROA), and asset classes—such as bonds and stocks—
over time.
Black Box Model
In science, computing, and engineering, a black box is a device, system, or
object which produces useful information without revealing any information
about its internal workings. The explanations for its conclusions remain
opaque or “black.”
Financial analysts, hedge fund managers, and investors may use software
that is based on a black-box model in order to transform data into a useful
investment strategy.
Black box is shorthand for models that are sufficiently complex that they are
not straightforwardly interpretable to humans.
 A black box model receives inputs and produces outputs but its
workings are unknowable.
 Black box models are increasingly used to drive decision-making in the
financial markets.
 Technology advances, particularly in machine learning capabilities,
make it impossible for a human mind to analyze or understand
precisely how black box models produce their conclusions.
Machine learning techniques that have greatly contributed to the growth and
sophistication of black box models are closely related, particularly relevant to
machine learning.

Descriptive modeling

Descriptive modeling is a mathematical process that describes real-world events and the
relationships between factors responsible for them. The process is used by consumer-
driven organizations to help them target their marketing and advertising efforts.

In descriptive modeling, customer groups are clustered according to demographics,


purchasing behavior, expressed interests and other descriptive factors. Statistics can
identify where the customer groups share similarities and where they differ. The most active
customers get special attention because they offer the greatest ROI (return on investment).

The main aspects of descriptive modeling include:

 Customer segmentation: Partitions a customer base into groups with various


impacts on marketing and service.
 Value-based segmentation: Identifies and quantifies the value of a customer to
the organization.
 Behavior-based segmentation: Analyzes customer product usage and
purchasing patterns.
 Needs-based segmentation: Identifies ways to capitalize on motives that drive
customer behavior.

Descriptive modeling can help an organization to understand its customers, but predictive
modeling is necessary to facilitate the desired outcomes.

Evaluation Metrics
Introduction
Evaluation metrics are tied to machine learning tasks. There are
different metrics for the tasks of classification and regression. Some
metrics, like precision-recall, are useful for multiple tasks.
Classification and regression are examples of supervised learning,
which constitutes a majority of machine learning applications. Using
different metrics for performance evaluation, we should be able to
improve our model’s overall predictive power before we roll it out for
production on unseen data. Without doing a proper evaluation of the
Machine Learning model by using different evaluation metrics, and
only depending on accuracy, can lead to a problem when the
respective model is deployed on unseen data and may end in poor
predictions.
In the next section, I’ll discuss the Classification evaluation metrics
that could help in the generalization of the ML classification model.
Classification Metrics
Classification is about predicting the class labels given input data. In
binary classification, there are only two possible output classes(i.e.,
Dichotomy). In multiclass classification, more than two possible
classes can be present. I’ll focus only on binary classification.

A very common example of binary classification is spam detection,


where the input data could include the email text and metadata
(sender, sending time), and the output label is either “spam” or “not
spam.” (See Figure) Sometimes, people use some other names also
for the two classes: “positive” and “negative,” or “class 1” and
“class 0.”

Figure — Email spam detection is a binary classification problem (source: From Book — Evaluating Machine Learning Model — O’Reilly)

There are many ways for measuring classification performance.


Accuracy, confusion matrix, log-loss, and AUC-ROC are some of the
most popular metrics. Precision-recall is a widely used metrics for
classification problems.
Accuracy
Accuracy simply measures how often the classifier correctly
predicts. We can define accuracy as the ratio of the number of
correct predictions and the total number of predictions.

When any model gives an accuracy rate of 99%, you might think
that model is performing very good but this is not always true and
can be misleading in some situations. I am going to explain this with
the help of an example.

Consider a binary classification problem, where a model can achieve


only two results, either model gives a correct or incorrect prediction.
Now imagine we have a classification task to predict if an image is a
dog or cat as shown in the image. In a supervised learning
algorithm, we first fit/train a model on training data, then test the
model on testing data. Once we have the model’s predictions from
the X_test data, we compare them to the true y_values (the correct
labels).
We feed the image of the dog into the training model. Suppose the
model predicts that this is a dog, and then we compare the
prediction to the correct label. If the model predicts that this image
is a cat and then we again compare it to the correct label and it
would be incorrect.

We repeat this process for all images in X_test data. Eventually,


we’ll have a count of correct and incorrect matches. But in reality, it
is very rare that all incorrect or correct matches hold equal value.
Therefore one metric won’t tell the entire story.
Accuracy is useful when the target class is well balanced but is not
a good choice for the unbalanced classes. Imagine the scenario
where we had 99 images of the dog and only 1 image of a cat
present in our training data. Then our model would always predict
the dog, and therefore we got 99% accuracy. In reality, Data is
always imbalanced for example Spam email, credit card fraud, and
medical diagnosis. Hence, if we want to do a better model
evaluation and have a full picture of the model evaluation, other
metrics such as recall and precision should also be considered.

Confusion Matrix is a performance measurement for the machine


learning classification problems where the output can be two or
more classes. It is a table with combinations of predicted and actual
values.

A confusion matrix is defined as thetable that is often used


to describe the performance of a classification model on a
set of the test data for which the true values are known.
It is extremely useful for measuring the Recall, Precision, Accuracy,
and AUC-ROC curves.

Let’s try to understand TP, FP, FN, TN with an example of pregnancy


analogy.
True Positive: We predicted positive and it’s true. In the image, we
predicted that a woman is pregnant and she actually is.

True Negative: We predicted negative and it’s true. In the image, we


predicted that a man is not pregnant and he actually is not.

False Positive (Type 1 Error)- We predicted positive and it’s false. In


the image, we predicted that a man is pregnant but he actually is
not.

False Negative (Type 2 Error)- We predicted negative and it’s false.


In the image, we predicted that a woman is not pregnant but she
actually is.
We discussed Accuracy, now let’s discuss some other metrics of the
confusion matrix

Precision —Precision explains how many of the correctly predicted


cases actually turned out to be positive. Precision is useful in the
cases where False Positive is a higher concern than False Negatives.
The importance of Precision is in music or video recommendation
systems, e-commerce websites, etc. where wrong results could lead
to customer churn and this could be harmful to the business.

Precision for a label is defined as the number of true positives


divided by the number of predicted positives.

2. Recall (Sensitivity) — Recall explains how many of the actual


positive cases we were able to predict correctly with our model. It is
a useful metric in cases where False Negative is of higher concern
than False Positive. It is important in medical cases where it doesn’t
matter whether we raise a false alarm but the actual positive cases
should not go undetected!

Recall for a label is defined as the number of true positives divided


by the total number of actual positives.
3. F1 Score — It gives a combined idea about Precision and Recall
metrics. It is maximum when Precision is equal to Recall.

F1 Score is the harmonic mean of precision and recall.

The F1 score punishes extreme values more. F1 Score could be an


effective evaluation metric in the following cases:
 When FP and FN are equally costly.
 Adding more data doesn’t effectively change the outcome
 True Negative is high

4. AUC-ROC — The Receiver Operator Characteristic (ROC) is a


probability curve that plots the TPR(True Positive Rate) against the
FPR(False Positive Rate) at various threshold values and separates
the ‘signal’ from the ‘noise’.

The Area Under the Curve (AUC) is the measure of the ability of a
classifier to distinguish between classes. From the graph, we simply
say the area of the curve ABDE and the X and Y-axis.

From the graph shown below, the greater the AUC, the better is the
performance of the model at different threshold points between
positive and negative classes. This simply means that When AUC is
equal to 1, the classifier is able to perfectly distinguish between all
Positive and Negative class points. When AUC is equal to 0, the
classifier would be predicting all Negatives as Positives and vice
versa. When AUC is 0.5, the classifier is not able to distinguish
between the Positive and Negative classes.

Image Source— https://www.analyticsvidhya.com/blog/2020/06/auc-roc-curve-machine-learning/

Working of AUC —In a ROC curve, the X-axis value shows False
Positive Rate (FPR), and Y-axis shows True Positive Rate (TPR).
Higher the value of X means higher the number of False
Positives(FP) than True Negatives(TN), while a higher Y-axis value
indicates a higher number of TP than FN. So, the choice of the
threshold depends on the ability to balance between FP and FN.

5. Log Loss — Log loss (Logistic loss) or Cross-Entropy Loss is one of


the major metrics to assess the performance of a classification
problem.

For a single sample with true label y∈{0,1} and a probability


estimate p=Pr(y=1), the log loss is:

Conclusion
Understanding how well a machine learning model will perform on
unseen data is the main purpose behind working with these
evaluation metrics. Metrics like accuracy, precision, recall are good
ways to evaluate classification models for balanced datasets, but if
the data is imbalanced then other methods like ROC/AUC perform
better in evaluating the model performance.

ROC curve isn’t just a single number but it’s a whole curve that
provides nuanced details about the behavior of the classifier. It is
also hard to quickly compare many ROC curves to each other.
Class Imbalance

Data are said to suffer the Class Imbalance Problem when the class
distributions are highly imbalanced. In this context, many classification
learning algorithms have low predictive accuracy for the infrequent
class. Cost-sensitive learning is a common approach to solve this problem.

Class imbalanced datasets occur in many real-world applications where the


class distributions of data are highly imbalanced. For the two-class case,
without loss of generality, one assumes that the minority or rare class is
the positive class, and the majority class is the negative class. Often the
minority class is very infrequent, such as 1% of the dataset. If one applies
most traditional (cost-insensitive) classifiers on the dataset, they are likely
to predict everything as negative (the majority class). This was often
regarded as a problem in learning from highly imbalanced datasets.

You might also like