Unit 3 Data Exploration (P)
Unit 3 Data Exploration (P)
Data Exploration
Understanding data
Data preparation
Key part of 2 phases (Data Understanding & Data
Preparation in CRISP-DM)
Data quality report
Data mining tasks
Interpreting data mining results
Missing values:
• 60%
excessive
Outliers: values
lie far away from
the central
tendency of a
feature
http://commons.wikimedia.org/wiki/File:Iris_versicolor_3.jpg#mediaviewer/File:Iris_versicolor_3.jpg
w
i 1
i xi
x n
Median:
Middle value if odd number of values, or average of the middle
w
i 1
i
Mode:
Value that occurs most frequently in the data
Unimodal, bimodal, trimodal
If each data value occurs only once, then there is no mode.
A feature
characterized by a
multimodal
distribution has
two
or more very
commonly
occurring ranges of
values that are
clearly
separated.
2/20/2024 internal use
Properties of Normal Distribution Curve
X={1; 11.5; 6; 7.2; 4; 8; 9; 10; 6.8; 8.3; 2; 2; 10; 1}. What are Q1, Q2 and Q3?.
Datapoint:
Correlation analysis
Covariance analysis
Χ2 (chi-square) test
2
(Observed Expected)
2
Expected
The larger the Χ2 value, the more likely the variables are related
The cells that contribute the most to the Χ2 value are those whose actual
count is very different from the expected count
Correlation does not imply causality
# of hospitals and # of car-theft in a city are correlated
Both are causally linked to the third variable: population (?)
rA, B
i 1 (ai A)(bi B)
i 1
(ai bi ) n A B
(n 1) A B (n 1) A B
where n is the number of tuples, A and B are the respective means of A and B, σA and σB are the
respective standard deviation of A and B, and Σ(aibi) is the sum of the AB cross-product.
If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the
stronger correlation. rA,B = 0: independent; rAB < 0: negatively correlated
Correlation Analysis (viewed as linear relationship)
Correlation coefficient:
where n is the number of tuples, A and B are the respective mean or expected values of
A and B, σA and σB are the respective standard deviation of A and B.
Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their expected
values. Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is
likely to be smaller than its expected value.
Independence: CovA,B = 0 but the converse is not true:
Some pairs of random variables may have a covariance of 0 but are not independent. Only under
some additional assumptions (e.g., the data follow multivariate normal distributions) does a
covariance of 0 imply independence
Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5, 10), (4,
11), (6, 14).
Question: If the stocks are affected by the same industry trends, will their prices rise or fall
together?
E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
46
2/20/2024 internal use
Histograms Often Tell More than Boxplots
47
2/20/2024 internal use
Data Visualization - Distribution plot
Scatter plot
Scatter multiple
Scatter matrix
Bubble plot
Density chart
Parallel chart
Deviation chart
Andrews curves
Some data preparation techniques change the way data is represented just to
make it more compatible with certain machine learning algorithms.
Normalization
Binning
Sampling
Normalization techniques can be used to change a continuous feature to fall
within a specified range while maintaining the relative differences between the
values for the feature.
The equal-width binning algorithm splits the range of the feature values into b bins
each of size range/b.
Equal-frequency binning first sorts the continuous feature values into ascending order
and then places an equal number of instances into each bin, starting with bin #1.
The number of instances placed in each bin is simply the total number of instances divided by the
number of bins, b.
A data quality issue is loosely defined as anything unusual about the dataset.
The most common data quality issues are:
- missing values
- irregular cardinality
- outliers
More on Data Preparation
where ai is a specific value of feature a, and lower and upper are the lower and upper
thresholds.