1.4
1.4
Probability Statistics II
Data Handling Process
Exploratory Data Analysis (EDA)
● An approach/philosophy for data analysis that employs a variety
of techniques (mostly graphical) to
1. maximize insight into a data set;
2. uncover underlying structure;
3. extract important variables;
4. detect outliers and anomalies;
5. test underlying assumptions;
6. develop parsimonious models; and
7. determine optimal factor settings.
● The EDA approach is an approach or an
attitude/philosophy about how a data analysis should be
carried out.
● EDA is not identical to statistical graphics although the
two terms are used almost interchangeably.
● Most EDA techniques are graphical in nature with a few
quantitative techniques.
What do EDA Tools possess?
● Consists of various techniques of:
1. Plotting the raw data (such as data traces, histograms,
bihistograms, probability plots, lag plots, block plots, and
Youden plots.
2. Plotting simple statistics such as mean plots, standard
deviation plots, box plots, and main effects plots of the raw
data.
3. Positioning such plots so as to maximize our natural
pattern-recognition abilities, such as using multiple plots per
page.
Data Analysis Approaches
● Three popular data analysis approaches are:
1. Classical
2. Exploratory (EDA)
3. Bayesian
● These three approaches are similar in that they all start with a general
science/engineering problem and all yield science/engineering conclusions.
● The difference is the sequence and focus of the intermediate steps.
● For Classical analysis, the sequence is
● Problem => Data => Model => Analysis => Conclusions
● For EDA, the sequence is
● Problem => Data => Analysis => Model => Conclusions
● For Bayesian, the sequence is
Differences
● For classical analysis, data collection is followed by the imposition
of a model (normality, linearity, etc.) and the analysis, estimation,
and testing that follows are focused on parameters of that model.
● For EDA, the data collection is not followed by a model imposition;
rather it is followed immediately by analysis with a goal of inferring
what model would be appropriate.
● For a Bayesian analysis, the analyst attempts to incorporate
scientific/engineering knowledge/expertise into the analysis by
imposing a data independent distribution on the parameters of the
selected model; the analysis thus consists of formally combining
both the prior distribution on the parameters and the collected data
to jointly make inferences and/or test assumptions about the model
parameters.
Techniques
● Classical techniques are generally quantitative in nature.
● They include ANOVA, t tests, chi-squared tests, and F tests.
● EDA techniques are generally graphical.
● They include scatter plots, character plots, box plots,
histograms, bihistograms, probability plots, residual plots,
and mean plots.
Conditional Probaility
Multiplication of Probabilities
Independent Events
Bayes’ Theorem
• The conditional probabilities commonly provide the probability of an
event (such as failure) given a condition (such as high or low
contamination).
• But after a random experiment generates an outcome, we are naturally
interested in the probability that a condition was present (high
contamination) given an outcome (a semiconductor failure).
• Thomas Bayes addressed this essential question in the 1700s and
developed the fundamental result known as Bayes’ theorem.
Bayes Theorem for Multiple Events
PMF of Discrete RV
• The probability distribution of a random variable X is a description of
the probabilities associated with the possible values of X.
• For a discrete random variable, the distribution is often specified by
just a list of the possible values along with the probability of each.
• In some cases, it is convenient to express the probability in terms of
a formula.
CDF of Discrete RV
Counter in Dictionary
empty_dict = {}# key, value in dictionary
grades = {“Joel”:80, “Tim”:95}
joels_grade = grades[“Joel”]
def mean(x):
return sum(x) / len(x)
def mean(xs: List[float]) -> float: return
sum(xs) / len(xs) mean(num_friends)
0,0 x1
Classify a set of points
class A
Clustering
Regression
Dimensionality Reduction
● In machine learning, whether the algorithm is classification or
regression, data are used as inputs and fed to the learner for
decision-making.
● Ideally, there is no need for feature extraction or selection as a
separate process; the classifier (or regressor) must use any
features, removing the irrelevant ones.
● In most learning algorithms, the complexity is based on the number
of input dimensions, as well as on the size of the data sample, and
for reduced memory and computation, we are interested in
reducing the dimensionality of the problem.
● Dimension reduction also reduces the complexity of the
learning algorithm during testing.
Dimensionality Reduction
● Also, if an input is not informative, we can save the cost by
extracting it.
● Simple models are more robust on small datasets.
● Simple models have less variance; that is, they diverge less
reliant on specific samples, including outliers, noise, etc.
● If data can be represented with fewer features, we can gain a
better idea of the process that motivates the data, and this
allocates knowledge extraction.
● If data can be described by fewer dimensions without loss of
information, it can be plotted and analyzed visually for structure
and outliers.
Dimensionality Reduction
● In situations where the data have a huge number of features,
it is always necessary to decrease its dimension or to find a
lower-dimensional depiction conserving some of its properties.
● Therefore, dimensionality reduction (or manifold learning):
1. Speeds up succeeding operations on the data
2. Better visualization of data for tentative analysis by
mapping the input data into two- or three-dimensional spaces.
3. Extracting features to produce a smaller and more efficient,
informative, or valuable set of features
Classical Example
● If the goal of the analysis is to compute summary
statistics plus determine the best linear fit for Y as a
function of X, the results might be given as:
N = 11
Mean of X = 9.0
Mean of Y = 7.5
Intercept = 3
Slope = 0.5
Residual standard deviation = 1.237
Correlation = 0.816
● The above quantitative analysis, although valuable,
gives us only limited insight into the data.
Simple Scatter Plot Gives Us Insights
1) The data set "behaves like" a linear curve
with some scatter;
2) There is no justification for a more
complicated model (e.g., quadratic);
3) There are no outliers;
4) The vertical spread of the data appears to
be of equal height irrespective of the
X-value; this indicates that the data are
equally-precise throughout and so a
"regular" (that is, equi-weighted) fit is
appropriate.
Again obtain statistics results and also plot
● Draw Scatter Plots in the Lab for previous and these below
from matplotlib import pyplot as plt
years = [1950, 1960, 1970, 1980, 1990, 2000, 2010]
gdp = [300.2, 543.3, 1075.9, 2862.5, 5979.6, 10289.7, 14958.3]
# create a line chart, years on x-axis, gdp on y-axis
plt.plot(years, gdp, color='green', marker='o',
linestyle='solid')
# add a title
plt.title("Nominal GDP")
# add a label to the y-axis
plt.ylabel("Billions of $")
plt.show()
from collections import Counter
#A Counter is a dict subclass for counting hashable objects.
import matplotlib.pyplot as plt
num_friends = [100, 49, 41, 40, 25,24,55,5,10,14,18,17,20]
friend_counts = Counter(num_friends)
xs = range(101)# largest value is 100
ys = [friend_counts[x] for x in xs] # number of friends
plt.bar(xs, ys)
plt.axis([0, 101, 0, 25])
plt.title("Histogram of Friend Counts")
plt.xlabel("# of friends")
plt.ylabel("# of people")
plt.show()
num_points = len(num_friends)
largest_value = max(num_friends)
smallest_value = min(num_friends)
sorted_values = sorted(num_friends)
smallest_value = sorted_values[0]#second_smallest_value [1],[2]
friends = [ 70, 65, 72, 63, 71, 64, 60, 64, 67]
minutes = [175, 170, 205, 120, 220, 130, 105, 145, 190]
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']
plt.scatter(friends, minutes)
# label each point
for label, friend_count, minute_count in zip(labels, friends, minutes):
plt.annotate(label,xy=(friend_count, minute_count),
#Put label with its point
xytext=(5, -5),
# but slightly offset
textcoords='offset points')
plt.title("Daily Minutes vs. Number of Friends")
plt.xlabel("# of friends")
plt.ylabel("daily minutes spent on the site")
plt.show()
Continuous Random Variable
• A continuous random variable is a random variable with an interval
(either finite or infinite) of real numbers for its range.
• The model provides for any precision in length measurements.
• Because the number of possible values of X is uncountably infinite, X
has a distinctly different distribution from the discrete random
variables.
• A probability density function or PDF f(x) can be used to describe the
probability distribution of a continuous random variable X.
• If an interval is likely to contain a value for X, its probability is large
and it corresponds to large values for f(x).
• The probability that X is between a and b is determined as the integral
of f(x) from a to b.
PDF of Continuous Random Variable
Mean of Continuous Random Variable
Normal Distribution
•Normal Distribution is the most widely used model for a
continuous measurement is a normal random variable.
•Whenever a random experiment is replicated, the random
variable that equals the average (or total) result over the
replicates tends to have a normal distribution as the number
of replicates becomes large.
•De Moivre presented this fundamental result, known as the
central limit theorem, in 1733.
•Unfortunately, his work was lost for some time, and Gauss
independently developed a normal distribution nearly 100
years later.
Example
•Assume that the deviation (or error) in the length of a
machined part is the sum of a large number of infinitesimal
effects, such as temperature and humidity drifts, vibrations,
cutting angle variations, cutting tool wear, bearing wear,
rotational speed variations, mounting and fixture variations,
variations in numerous raw material characteristics, and
variation in levels of contamination.
•If the component errors are independent and equally likely to
be positive or negative, the total error can be shown to have
an approximate normal distribution.
● Random variables with different means and variances can be modeled
by normal probability density functions with appropriate choices of the
center and width of the curve.
● The value of E(X) = μ determines the center of the probability density
function, and the value of V (X) = σ2 determines the width.
Normal probability density functions for
2
selected values of the parameters μ and σ
Probability that X > 13 for a normal
2
random variable with μ = 10 and σ = 4
Standardizing a Normal Random Variable
• Creating a new random variable by this transformation is
referred to as standardizing.
• The random variable Z represents the distance of X from its
mean in terms of standard deviations.
• It is the key step to calculating a probability for an arbitrary
normal random variable.
Central Limit Theorem
•The simplest form of the central limit theorem states that the
sum of n independently distributed random variables will
tend to be normally distributed as n becomes large.
•It is a necessary and sufficient condition that none of the
variances of the individual random variables are large in
comparison to their sum.
•There are more general forms of the central theorem that
allow infinite variances and correlated random variables,
and there is a multivariate version of the theorem.
Expectation
Variance
Covariance
● The covariance between two RVs X and Y measures the
degree to which X and Y are (linearly) related.
● Covariance is defined as
Joint Probability Distributions and the
Sign of Covariance Between X and Y
Gaussian (Normal) Distribution Revisit
Inverse Variation of Normal Distribution
Why Normal Distribution
● First, many distributions we wish to model are truly close to
being normal distributions.
● The central limit theorem shows that the sum of many
independent random variables is approximately normally
distributed.
● This means that in practice many complicated systems can be
modeled successfully.
Why Normal Distribution
● Normal Distribution is unique in that it is the distribution that
maximizes entropy (or uncertainty) among all distributions
with the same mean and variance. This means that, for any
given variance, the normal distribution is the most "spread
out" or least informative distribution i.e. no prior knowledge.
● When comparing it to other distributions with the same
variance (such as a uniform distribution, exponential
distribution, etc.), the normal distribution is often considered
the "most typical" or "standard" distribution because of this
maximization of entropy.
PDF of Bivariate Gaussian Distribution
Correlation
● Covariances can be between negative and positive infinity.
● Sometimes it is more convenient to work with a normalized
measure, with a finite lower and upper bound.
● The (Pearson) correlation coefficient between X and Y is
defined as
where −1 ≤ ρ ≤ 1
Why Correlation?
● One can also show that corr[X; Y ] = 1 if and only if Y = aX + b
for some parameters a and b, i.e., if there is a linear
relationship between X and Y.
● The regression coefficient is given by a = (Cov [X; Y ] / V[X])
● The correlation reflects the noisiness and direction of a linear
relationship, but not the slope of that relationship, nor many
aspects of nonlinear relationships.
Different sets of (x; y) points, with the
correlation coefficient of x and y for each set
Correlation Matrix
● In the case of a vector x of related random variables,
and the correlation matrix is given by
Uncorrelated does not imply independent
● If X and Y are independent, meaning p(X; Y ) = p(X)p(Y ),
then Cov [X; Y ] = 0, and hence corr [X; Y ] = 0.
● So independent implies uncorrelated.
● However, the converse is not true: uncorrelated does not
imply independent.
● For example, let X ∼ U(−1; 1) and Y = X2.
● Clearly Y is dependent on X (in fact, Y is uniquely
determined by X), yet corr [X; Y ] = 0.
Correlation does not imply causation
● It is well known that “correlation does not imply causation”.
● We see a strong correlation between these signals. Indeed, it is
sometimes claimed that “eating ice cream causes murder”. This is just
a spurious correlation, due to a hidden common cause, namely the
weather. Hot weather increases ice cream sales, for obvious reasons.
Simpson’s Paradox
● Says that a statistical trend or relationship that appears in several
different groups of data can disappear or reverse sign when these
groups are combined.
● This results in counterintuitive behavior if we misinterpret claims of
statistical dependence in a causal way.
Plot of the cdf and pdf for the standard normal, N(0; 1)
import scipy as sp
import scipy.stats as stats import numpy as np
from matplotlib import pyplot as plt
from sklearn.datasets.samples_generator import make_blobs
#Create a dataset
X, y = make_blobs(n_samples = 300, centers = 4, cluster_std = 0.60, random_state = 0)
plt.scatter(X[:,0], X[:,1])
X_mean = sp.mean(X[:,1])
print(‘Mean = ‘,X_mean)
X_mean = np.mean(X[:,1])
print(‘Mean = ‘,X_mean)
X_SD = sp.std(X[:,1])
print(‘SD = ‘,X_SD)
X_SD = np.std(X[:,1])
print(‘SD = ‘,X_SD)
X_median = sp.median(X[:,1])
print(‘Median = ‘,X_median)
X_median = np.median(X[:,1])
print(‘Median = ‘,X_median)
X_skewness = stats.skew(X[:,1])
print(‘Skewness = ‘,X_skewness)
X_kurtosis = stats.kurtosis(X[:,1])
print(‘Kurtosis = ‘,X_kurtosis)
Steps during EDA
∙ Descriptive statistics, charts, plots, and visualizations can be
utilized to look at the various data attributes and find relations
and correlations.
∙ Once data is collected, you need to make sure it is in a
useable format.
∙ Some algorithms require features in a specific format; some
algorithms can deal with target variables and features like
strings, integers, etc.
∙ Data preprocessing, cleaning, wrangling, and performing
initial exploratory data analysis is carried out.
To do Tasks in EDA
1. Explore, describe, and visualize data attributes.
2. Choose data and attribute subsets, which seem the most crucial
for the problem.
3. Make widespread assessments to find relationships and
associations and test hypotheses.
4. Note missing data points, if any.
(Data quality analysis is the final step in the data understanding stage
in which the quality of data is analyzed in the datasets and potential
shortcomings, errors, and issues are determined.)
Data Quality Analysis
∙ The data can be checked to determine if any pattern is obvious or if
a few data points are massively different from the rest of the data.
∙ Plotting data in different dimensions might help.
∙ The focus on data quality analysis includes the following
Missing values
Inconsistent values
Wrong information due to data errors (manual/automated)
Wrong metadata information
Next Steps
1. Data preparation for the model
2. Data integration – merging different datasets together
(attributes)
3. Data wrangling
Data Wrangling
● The process of data wrangling includes data processing,
normalization, cleaning, and formatting.
● Data in its raw form is hardly utilized by machine learning
techniques to build models.
Major Tasks in Data Wrangling
∙ Managing missing values (remove rows, impute missing values)
∙ Managing data inconsistencies (delete rows, attributes, fix
inconsistencies)
∙ Correcting inappropriate metadata and annotations
∙ Managing unclear attribute values
∙ Arranging and formatting data into necessary formats (CSV,
JSON, relational)
Next: Feature scaling and feature extraction
● In this stage important features or attributes are extracted from the
raw data or new features are created from existing features.
● Data features frequently should be scaled or normalized to avoid
producing biases with machine learning algorithms.
● Moreover, it is often necessary to choose a subset of all existing
features based on feature quality and importance.
● In situations where the data have a huge number of features, it is
always necessary to decrease its dimension or to find a
lower-dimensional depiction conserving some of its properties.
Types of Features
● Consider two features, one describing a person’s age and the other their house
number.
● Both features map into the integers, but the way we utilize those features can be
rather different.
● In house numbers there is no linear scale compared to age. So, both are numbers
but different.
● Statistical Features:
● Numerous statistical features can be extracted from each subsample data point,
as they are the main distinguishing values to describe the distribution of the data.
● The features are the minimum, maximum, mean, median, mode, standard
deviation, variance, first quartile, third quartile, and interquartile range (IQR) of
the data vector
Statistics or Aggregates
● The varieties of calculations on features are generally stated as
statistics or aggregates.
● Three main types are shape statistics, statistics of dispersion, and
statistics of central tendency.
● Each of these can be represented either as a tangible property of a
given sample (sample statistics) or a hypothetical property of an
unknown population.
● The statistical values—namely, mean, standard deviation,
skewness, and kurtosis—are generally utilized to reduce the
dimension of data.
● The first and second-order statistics are critical in data analysis.
● On the other hand, second-order statistics are not enough for many time
series data.
● Hence, higher-order statistics should also be used for a better description
of the data.
● Although the first and second-order statistics designate mean and
variance, the higher-order statistics designate higher-order moments.
● Higher-order statistics (HOS) denote the cumulants with orders of three
and higher-order computed numbers, which are linear combinations of
lower-order moments and lower-order cumulants.
Structured Features
● We create an instance vector from the features.
● Defining an instance with its vector of feature values is called an abstraction,
which is the result of filtering out redundant information.
● Features that work on structured instance spaces are called structured
features.
● These can be built either prior to learning a model or simultaneously with it.
● Significant characteristic of structured features is that they involve local
variables that denote objects other than the instance itself.
● Nevertheless, it is possible to employ other forms of aggregation over local
variables.
● E.g. Propositionalisation where the features can be translated from first-order
logic to propositional logic without local variables.
● Main challenge here is how to deal with combinatorial explosion of the
number of potential features.
Feature Transformations
● The objective is to improve the effectiveness of a feature by
eliminating, changing, or adding information.
● The best-known feature transformations are those that turn a
feature of one type into another of the next type down this list.
● Transformations also change the scale of quantitative features or
add a scale (or order) to ordinal, categorical, and Boolean
features.
● The simplest feature transformations are entirely deductive in
the sense that they achieve a well-defined result.
Binarization
● Binarization transforms a categorical feature into a set of Boolean
features, one for each value of the categorical feature.
● This loses information since the values of a single categorical
feature are mutually exclusive but are sometimes required if a
model cannot handle more than two feature values.
NumPy Boolean Indexing
92