Data analytics_1
Data analytics_1
Data Analytics
What Is Data Analytics?
Data analytics is the science of analyzing raw data to make conclusions about that information.
Many of the techniques and processes of data analytics have been automated into mechanical
processes and algorithms that work over raw data for human consumption.
KEY POINTS
Data analytics is the science of analyzing raw data to make conclusions about that information.
The techniques and processes of data analytics have been automated into mechanical processes
and algorithms that work over raw data for human consumption.
Data analytics help a business optimize its performance.
Understanding Data Analytics
Data analytics is a broad term that encompasses many diverse types of data analysis. Any type of
information can be subjected to data analytics techniques to get insight that can be used to
improve things. Data analytics techniques can reveal trends and metrics that would otherwise be
lost in the mass of information. This information can then be used to optimize processes to
increase the overall efficiency of a business or system.
For example, manufacturing companies often record the runtime, downtime, and work queue for
various machines and then analyze the data to better plan the workloads so the machines operate
closer to peak capacity.
Data analytics can do much more than point out bottlenecks in production. Gaming companies
use data analytics to set reward schedules for players that keep the majority of players active in
the game. Content companies use many of the same data analytics to keep you clicking,
watching, or re-organizing content to get another view or another click.
Data analytics is important because it helps businesses optimize their performances.
Implementing it into the business model means companies can help reduce costs by identifying
more efficient ways of doing business and by storing large amounts of data. A company can also
use data analytics to make better business decisions and help analyze customer trends and
satisfaction, which can lead to new—and better—products and services.
Data Analysis Steps
The process involved in data analysis involves several different steps:
1. The first step is to determine the data requirements or how the data is grouped. Data may be
separated by age, demographic, income, or gender. Data values may be numerical or be divided
by category.
2. The second step in data analytics is the process of collecting it. This can be done through a
variety of sources such as computers, online sources, cameras, environmental sources, or
through personnel.
3. Once the data is collected, it must be organized so it can be analyzed. This may take place on a
spreadsheet or other form of software that can take statistical data.
4. The data is then cleaned up before analysis. This means it is scrubbed and checked to ensure
there is no duplication or error, and that it is not incomplete. This step helps correct any errors
before it goes on to a data analyst to be analyzed.
It’s a common misconception that data analysis and data analytics are the same thing.
The generally accepted distinction is:
Data analytics is the broad field of using data and tools to make business
decisions.
Data analysis, a subset of data analytics, refers to specific actions.
To explain this confusion—and attempt to clear it up—we’ll look at both terms,
examples, and tools.
The act of data analysis is usually limited to a single, already prepared dataset. You’ll
inspect, arrange, and question the data. Today, in the 2020s, a software or “machine”
usually does a first round of analysis, often directly in one of your databases or tools.
But this is augmented by a human who investigates and interrogates the data with
more context.
When you’re done analyzing a dataset, you’ll turn to other data analytics activities to:
Which is better?
Brack Nelson, Marketing Manager at Incrementors SEO Services, suggests that the
outcome of data analytics is more encompassing and beneficial than the output of data
analysis alone.
Mathematical model
A mathematical model is an abstract model that uses mathematical language
to describe the behavior of a system.
Mathematical modelling is the process of describing a real world problem in
mathematical terms, usually in the form of equations, and then using these equations
both to help understand the original problem, and also to discover new features about
the problem.
Suppose you are building a rectangular sandbox for your neighbor's toddler to play in,
and you have two options available based on the building materials you have. The
sandbox can have a length of 8 feet and a width of 5 feet, or a length of 7 feet and a
width of 6 feet, and you want the sandbox to have as large an area as possible. In other
words, you want to determine which dimensions will result in the larger area of a
rectangle. Thankfully, in mathematics, we have a formula for the area (A) of a rectangle
based on its length (l) and width (w).
A=l×w
Awesome! We can use this formula to figure out which dimensions will make a bigger
sandbox!
We can calculate the two areas by plugging in our lengths and widths for each choice:
5. A1 = 8 × 5 = 40 square feet
6. A2 = 7 × 6 = 42 square feet
We see that a length of 7 feet and a width of 6 feet will result in the larger area of 42
square feet. Problem solved!
Linear Models
A linear model is an equation that describes a relationship between two quantities that
show a constant rate of change.
Example
The table below shows the cost of an ice cream cone yy with xx toppings. Write an
equation that models the relationship between xx and y.y.
Toppings x Cost y
x y
2 $3.50
3 $3.75
5 $4.25
Answer
Adding one additional topping costs
If an ice cream cone with two toppings costs $3.50 each topping costs $0.25, then a cone
without any toppings must cost $3.00. Therefore, the rate of change is 0.25, the initial value
is 3, and y=0.25x + 3.
x is number of topping
The equation for the nonlinear regression analysis is too long for the fitted line plot:
Empirical models
Empirical models are only supported by experimental data.
Empirical modelling is a generic term for activities that create models
by observation and experiment. It relies on observation rather than their theory.
Example
Mechanistic Model
A mechanistic model uses a theory to predict what will happen in the real
world. The alternative approach, empirical modeling, studies real-world
events to develop a theory.
Mechanistic models are useful if you have good data for making predictions.
For example, if you're designing a new plane, there's lots of information
on how plane design affects the plane's interaction with air pressure, wind
speed and gravity. You'd want to make some empirical tests before taking
passengers aboard, but mechanistic models can give you a good start.
Deterministic models
Deterministic models assume there's no variation in results. which does not
have any probabilistic (random) elements. Its output is determined when the set of
inputs and relationships in the model have been specified.
A deterministic model assumes certainty in all aspects.
Example, the conversion between Celsius and Kelvin is deterministic,
because the formula is not random…it is an exact formula that will always
give you the correct answer (assuming you perform the calculations
correctly):
Kelvin = Celsius + 273.15.
Stochastic Models
For a model to be stochastic, it must have a random variable where a level of
uncertainty exists. Due to the uncertainty present in a stochastic model, the results
provide an estimate of the probability of various outcomes.
Example
Stochastic investment models attempt to forecast the variations of prices,
returns on assets (ROA), and asset classes—such as bonds and stocks—
over time.
Black Box Model
In science, computing, and engineering, a black box is a device, system, or
object which produces useful information without revealing any information
about its internal workings. The explanations for its conclusions remain
opaque or “black.”
Financial analysts, hedge fund managers, and investors may use software
that is based on a black-box model in order to transform data into a useful
investment strategy.
Black box is shorthand for models that are sufficiently complex that they are
not straightforwardly interpretable to humans.
A black box model receives inputs and produces outputs but its
workings are unknowable.
Black box models are increasingly used to drive decision-making in the
financial markets.
Technology advances, particularly in machine learning capabilities,
make it impossible for a human mind to analyze or understand
precisely how black box models produce their conclusions.
Machine learning techniques that have greatly contributed to the growth and
sophistication of black box models are closely related, particularly relevant to
machine learning.
Descriptive modeling
Descriptive modeling is a mathematical process that describes real-world events and the
relationships between factors responsible for them. The process is used by consumer-
driven organizations to help them target their marketing and advertising efforts.
Descriptive modeling can help an organization to understand its customers, but predictive
modeling is necessary to facilitate the desired outcomes.
Evaluation Metrics
Introduction
Evaluation metrics are tied to machine learning tasks. There are
different metrics for the tasks of classification and regression. Some
metrics, like precision-recall, are useful for multiple tasks.
Classification and regression are examples of supervised learning,
which constitutes a majority of machine learning applications. Using
different metrics for performance evaluation, we should be able to
improve our model’s overall predictive power before we roll it out for
production on unseen data. Without doing a proper evaluation of the
Machine Learning model by using different evaluation metrics, and
only depending on accuracy, can lead to a problem when the
respective model is deployed on unseen data and may end in poor
predictions.
In the next section, I’ll discuss the Classification evaluation metrics
that could help in the generalization of the ML classification model.
Classification Metrics
Classification is about predicting the class labels given input data. In
binary classification, there are only two possible output classes(i.e.,
Dichotomy). In multiclass classification, more than two possible
classes can be present. I’ll focus only on binary classification.
Figure — Email spam detection is a binary classification problem (source: From Book — Evaluating Machine Learning Model — O’Reilly)
When any model gives an accuracy rate of 99%, you might think
that model is performing very good but this is not always true and
can be misleading in some situations. I am going to explain this with
the help of an example.
The Area Under the Curve (AUC) is the measure of the ability of a
classifier to distinguish between classes. From the graph, we simply
say the area of the curve ABDE and the X and Y-axis.
From the graph shown below, the greater the AUC, the better is the
performance of the model at different threshold points between
positive and negative classes. This simply means that When AUC is
equal to 1, the classifier is able to perfectly distinguish between all
Positive and Negative class points. When AUC is equal to 0, the
classifier would be predicting all Negatives as Positives and vice
versa. When AUC is 0.5, the classifier is not able to distinguish
between the Positive and Negative classes.
Working of AUC —In a ROC curve, the X-axis value shows False
Positive Rate (FPR), and Y-axis shows True Positive Rate (TPR).
Higher the value of X means higher the number of False
Positives(FP) than True Negatives(TN), while a higher Y-axis value
indicates a higher number of TP than FN. So, the choice of the
threshold depends on the ability to balance between FP and FN.
Conclusion
Understanding how well a machine learning model will perform on
unseen data is the main purpose behind working with these
evaluation metrics. Metrics like accuracy, precision, recall are good
ways to evaluate classification models for balanced datasets, but if
the data is imbalanced then other methods like ROC/AUC perform
better in evaluating the model performance.
ROC curve isn’t just a single number but it’s a whole curve that
provides nuanced details about the behavior of the classifier. It is
also hard to quickly compare many ROC curves to each other.
Class Imbalance
Data are said to suffer the Class Imbalance Problem when the class
distributions are highly imbalanced. In this context, many classification
learning algorithms have low predictive accuracy for the infrequent
class. Cost-sensitive learning is a common approach to solve this problem.