Ebook - Data Analytics Course
Ebook - Data Analytics Course
Module 1
1
Table of Contents
1. Statistical Modeling
2. Measures of Central
Tendency
3. Measures of Variability
4. Correlation & Regression
5. The Normal Distribution
2
STATISTICAL
MODELING
What is the Statistical
Modeling?
What are the advantages?
What are the
disadvantages?
3
STATISTICAL
MODELING :
DEFINITION
✔ A statistical model is a simplification of a real-world situation,
usually describing a real-world situation using equations. It
can be used to make predictions about a real-world problem.
By analyzing and refining the model an improved
understanding may be obtained.
4
STATISTICAL
MODELING :
ADVANTAGES
• The model is quick and easy
to produce
• Helps our understanding of
the real-world problem
• Helps us to make predictions
• Be controlled a situation –
e.g., railway timetables, air
traffic control etc.
5
STATISTICAL MODELING :
DISADVANTAGES
✔ The model simplifies the
situation and only describes a
part of the real-world problem.
✔ The model may only work in
certain situations, or for a
particular range of values.
6
Variables:
1. Qualitative variables: Non – numerical –
e.g. red, blue, or long, short etc.
2. Quantitative variables: Numerical –e.g.,
length, age, time, number of coins in
REPRESENTATI pocket, etc.
ON OF SAMPLE 3. Continuous variables: Can take any value
DATA within a given range – e.g height, time,
age, etc.
4. Discrete variables: Can only take certain
values –e.g., shoe size, cost in $ and p,
number of coins
7
Frequency Distribution:
✔ A distribution is best thought of as a table. Thus, a
frequency distribution can be thought of as a
frequency table, i.e., a list of discrete values and their
frequencies.
FREQUENCY
DISTRIBUTIO Example:
NS The number of M&Ms is counted in several bags, and
recorded in the frequency distribution/table below:
Number of M&Ms:
37 38 39 40 41 42 43
Frequency:
3 8 11 19 13 7 2
8
Cumulative Frequency:
✔ Add up the frequencies as you go down/along the list
Example:
FREQUENCY The number of M&Ms is counted in several bags, and
DISTRIBUTIO recorded in the frequency distribution/table below:
NS Number of M&Ms:
42 43
37 38 39 40 41
Frequency: 3 8 11 19 13 7 2
Cumulative Frequency: 3 11 22 41 54 61 63
9
MEASURES OF
CENTRAL TENDENCY
10
✔ Central tendency is defined as “the statistical measure
that identifies a single value as representative of an
entire distribution”
WHAT IS A
✔ It aims to provide an accurate description of the entire
CENTRAL data. It is the single value that is most
typical/representative of the collected data
TENDENCY
? ✔ The term “number crunching” is used to illustrate this
aspect of data description
11
✔ Is the most used measure of central tendency
12
CENTRAL
TENDENCY :
THE MEAN
The Arithmetic Mean:
✔ Arithmetic Mean (or simply,
“mean”) is the average. It is
computed by adding all the
values in the data set (+)
divided ( \ ) by the number
of observations in it.
✔ If we have the raw data,
mean is given by the
formula:
13
CENTRAL TENDENCY : THE MEAN
Example:
We have Numbers of friends of 11 Facebook users dot according to the
formula calculate the mean:
N = 22 +40+ 116 57 + 93 + 103 + 108 + 93 + 121 +252 =1063
M = 1063 / 11 =96.64
14
CENTRAL ✔ The median is the middle number in a sorted,
TENDENCY ascending or descending, list of numbers and can be
more descriptive of that data set than the average.
: THE
MEDIAN ✔ The median is sometimes used as opposed to the
mean when there are outliers in the sequence that
might skew the average of the values
15
CENTRAL
TENDENCY Example :
: THE ✔ The median of 4, 1, and 7 is 4 because when
the numbers are put in order (1 , 4, 7) , the
MEDIAN number 4 is in the middle.
16
✔ Is the value that appears most often in a set of data
values
CENTRAL Example :
If X is a discrete random variable, the mode is the
TENDENCY value x (i.e., X = x) at which the probability mass
function takes its maximum value.
: THE MODE - In other words, it is the value that is most likely to be
sampled.
17
✔ The mode is not necessarily unique to a
given discrete distribution, since the
CENTRAL probability mass function may take the same
maximum value at several points x1, x2, etc.
TENDENCY o Bimodal:
: THE MODE -Having two modes
o Multimodal :
-Having several modes
18
All the Distributions:
CENTRAL
TENDENCY :
THE MODE
19
Mode:
MEASURES You should use the mode if the data is qualitative (colour
etc.) or if quantitative (numbers) with a clearly defined
mode (or bi-modal). It is not much use if the distribution is
OF CENTRAL fairly even
TENDENCY- Median:
When to use You should use this for quantitative data (numbers), when
the data is skewed, i.e., when the median, mean and mode
mode, median are probably not equal, and when there might be extreme
values (outliers)
and mean :
Mean :
This is for quantitative data (numbers) and uses all pieces
of data. It gives a true measure, and should only be used if
the data is fairly symmetrical (not skewed), i.e., the mean
could not be affected by extreme values (outliers)
20
MEASURES OF
VARIABILITY
21
MEASURES
OF • Range & Interquartile Range
VARIABILIT • Variance & Standard Deviation
Y • Measurement of relationship between
variables
22
✔ Range:
The range is the largest number minus the
smallest (including outliers)
MEASURES
OF Example:
VARIABILIT Number of friends of 11 Facebook users:
Y: RANGE 22, 40, 53, 57, 93,
108, 116, 121, 252
98, 103,
23
MEASURES
OF Example:
VARIABILITY: The difference between the score representing
INTERQUARTI the 75th percentile and the score representing
the 25th percentile is the interquartile range.
LE RANGE This value give us the range of the middle 50%
of the values in the data set.
24
MEASURES OF VARIABILITY: VARIANCE AND
STANDARD DEVIATION
✔ The standard deviation is the square root of the average squared deviation from the mean.
The average squared deviation from the mean is also known as the variance
25
MEASURES OF ✔ Computers are used
extensively for
VARIABILITY: calculating the
UNDERSTANDIN standard deviation and
G AND other statistics.
CALCULATING ✔ However, calculating
the standard deviation
THE STANDARD by hand once or twice
DEVIATION can be helpful in
developing an
understanding of its
meaning!
26
Example: Calculating the variance and standard deviation:
MEASURES OF
VARIABILITY: Consider the observations 8, 25, 7, 5, 8, 3, 10, 12, 9
UNDERSTANDIN 1. Determine n, which Is the number of data values
2. Calculate the arithmetic mean, which is the sum of scores
G AND divided by n.
CALCULATING 3. Calculate the mean = (8+25+7+5+8+3+10+12+9)/9 or 9.67
4. Subtract the mean from each individual score to find the
1. THE STANDARD
8+3+10+12+9 individual deviations
DEVIATION 5. Square the individual deviations
6. Find the sum of the squares of the deviation…can you see
why we squared them before adding the values?
7. Divide the sum of the squares of the deviations by n-1. Well
done this is the Variance!
8. Take the square root of the variance to obtain the standard
deviation, which has the same units as the original data
27
MEASURES OF
VARIABILITY:
UNDERSTANDING
AND
CALCULATING THE
STANDARD
DEVIATION
28
MEASURES OF VARIABILITY: UNDERSTANDING
AND CALCULATING THE STANDARD DEVIATION
29
MEASURES OF
VARIABILITY:
UNDERSTANDING
AND CALCULATING
THE STANDARD
DEVIATION
30
MEASURMENTS
OF
RELATIONSHIPS
BETWEEN
VARIABLES
31
Graph of Two Measurement Variables
/EXAMPLE 1
1. What is your height (inches)?
Notice we have two different measurement
2. What is your weight (lbs)? variables. It would be inappropriate to put
Scatterplot: these two variables on side-by-side boxplots
because they do not have the same units of
measurement. Comparing height to weight is
like comparing apples to oranges. However,
we do want to put both of these variables on
one graph so that we can determine if there
is an association (relationship) between them.
33
Graph of Two Measurement Variables
/EXAMPLE 3
1. About how many hours do you typically study each week?
2. About how many hours do you typically exercise each week?
34
CORRELATION &
REGRESSION
35
Remember!
Significance
✔ Descriptive methods (that describe attributes of a
data set) and
36
✔ Many relationships between two measurement
variables tend to fall close to a straight line.
✔ In other words, the two variables exhibit a linear
relationship.
✔ It is also helpful to have a single number that will
measure the strength of the linear relationship
Correlation between the two variables. This number is
the correlation.
✔ The correlation is a single number that indicates
how close the values fall to a straight line.
✔ Correlation quantifies both the strength and
direction of the linear relationship between the two
measurement variables.
37
EXAMPLE VARIABLES CORRELATION(r)
CORRELATION FOR Example 1 Height and r=0.541
EXAMPLES 1-3 Weight
38
✔ The correlation of a sample is represented by the
FEATURES letter r
39
✔ A negative correlation indicates a negative linear
association. The strength of the negative linear
association increases as the correlation becomes
FEATURES closer to -1
40
✔ The correlation is independent of the original units
FEATURES of the two variables. This is because the correlation
depends only on the relationship between the
OF standard scores of each variable
CORRELATI
✔ The correlation is calculated using every
ON observation in the data set
41
✔ As you compare the scatterplots of the data from the
three examples with their actual correlations, you
should notice that findings are consistent for each
example.
FEATURES ✔ In Example 1 , the scatterplot shows a positive
association between weight and height. However,
OF there is still quite a bit of scatter around the pattern.
Consequently, a correlation of .541 is reasonable. It is
common for a correlation to decrease as sample size
CORRELATI increases.
ON ✔ In Example 2 , the scatterplot shows a negative
association between monthly rent and distance from
campus. Since the data points are very close to a
straight line it is not surprising the correlation is -.903.
✔ In Example 3 , the scatterplot does not show any
strong association between exercise hours/week and
study hours/week. This lack of association is
supported by a correlation of .109.
42
✔ A statistically significant relationship is one that is
large enough to be unlikely to have occurred in the
sample if there's no relationship in the population.
43
✔ There are three key caveats that must be
KEY recognized with regard to correlation.
CAVEATS ✔ It is impossible to prove causal relationships with
correlation. However, the strength of the evidence
WITH for such a relationship can be evaluated by
examining and eliminating important alternate
CORRELATI explanations for the correlation seen.
ONS ✔ Outliers can substantially inflate or deflate the
correlation.
✔ Correlation describes the strength and direction of
the linear association between variables. It does not
describe non-linear relationships
44
✔ It is often tempting to suggest that, when the
correlation is statistically significant, the change in
CORRELATI one variable causes the change in the other
variable
ON AND
CAUSATION ✔ However, outside of randomized experiments,
there are numerous other possible reasons that
might underlie the correlation
45
Check for the possibility that the response (y) might
be directly affecting the explanatory(x) variable
(rather than the other way around).
46
Check whether changes in the explanatory(x)
CORRELATI variable contribute, along with other variables, to
changes in the response(y).
ON AND
CAUSATION For example, the amount of dry brush in a forest
does not cause a forest fire; but it will contribute to
it if a fire is ignited.
47
Check for confounders or common causes that may
affect both the explanatory and response variables.
For example, there is a moderate association
between whether a baby is breastfed or bottle-fed
CORRELATI and the number of incidences of gastroenteritis
recorded on medical charts (with the breastfed
ON AND babies showing more cases).
48
Check whether the association between the variables
might be just a matter of coincidence.
This is where a check for the degree of statistical
significance would be important.
However, it is also important to consider whether the
CORRELATI search for significance was a priori or a posteriori.
For example, a story in the national news one year
ON AND reported that at a hospital in Potsdam, New York, 15
babies in a row were all boys.
CAUSATION Does that indicate that something at that hospital was causing
more male than female births? Clearly, the answer is no, even if
the chance of having 15 boys in a row is quite low (about 1
chance in 33,000). But there are over 5000 hospitals in the
United States and the story would be just as newsworthy if it
happened at any one of them at any time of the year and for
either 15 boys in a row or for 15 girls in a row. Thus, it turns out
that we actually expect a story like this to happen once or twice
a year somewhere in the United States every year.
49
Check whether both variables may have changed
together over time or space.
50
EFFECT OF OUTLIERS ON CORRELATION
Scatterplot of the relationship between the Infant Mortality The correlation is 0.73 but
Rate and the Percent of Juveniles Not Enrolled in School for looking at the plot it can be
each of the 50 states plus the District of Columbia: observed that for the 50
states alone the relationship
is not nearly as strong as a
0.73 correlation would
suggest. Here, the District of
Columbia (identified by the X)
is a clear outlier in the scatter
plot being several standard
deviations higher than the
other values for both the
explanatory (x) variable and
the response (y) variable.
Without Washington D.C. in
the data, the correlation
drops to about 0.5.
51
Correlations measure linear association
The degree to which relative standing on the x list of
numbers (as measured by standard scores) are
associated with the relative standing on the y list.
Since means and standard deviations, and hence
CORRELATI standard scores, are very sensitive to outliers, the
correlation will be as well.
ON AND
In general, the correlation will either increase or
OUTLIERS decrease, based on where the outlier is relative to
the other points remaining in the data set. An
outlier in the upper right or lower left of a
scatterplot will tend to increase the correlation
while outliers in the upper left or lower right will
tend to decrease a correlation
https://www.youtube.com/watch/XZ9vrVmvbj8
https://www.youtube.com/watch/7YsQ9xwjryo
52
Regression is a descriptive method used
with two different measurement
variables to find the best straight line
(equation) to fit the data points on the
scatterplot.
REGRESSI
ON A key feature of the regression
equation is that it can be used to make
predictions. In order to carry out a
regression analysis, the variables need
to be designated as either the:
53
Explanatory or Predictor Variable = x (on horizontal axis)
54
REVIEW: EQUATION OF A LINE
55
REVIEW:
EQUATION
OF A LINE
56
EXAMPLE 1: Consider the following
two variables for a
EXAMPLE sample of ten Stat 100
students:
OF x = quiz score
REGRESSIO y = exam score
57
Can we predict the exam score based on the quiz score for
students who come from this same population?
EXAMPLE 1:
to make that prediction we notice that the points
EXAMPLE generally fall in a linear pattern so we can use the
equation of a line that will allow us to put in a specific
value for x (quiz) and determine the best estimate of the
OF corresponding y (exam).
REGRESSIO The line represents our best guess at the average value
of y for a given x value and the best line would be one
N that has the least variability of the points around it (i.e.
we want the points to come as close to the line as
possible).
EQUATION
Remembering that the standard deviation measures the
deviations of the numbers on a list about their average,
we find the line that has the smallest standard deviation
for the distance from the points to the line. That line is
called the regression line or the least squares line.
58
EXAMPLE 1: EXAMPLE OF REGRESSION
EQUATION
• Least squares essentially find the line that will be the closest to all the data points
than any other possible line.
59
EXAMPLE 1: EXAMPLE OF REGRESSION EQUATION
60
The Normal Distribution
61
THE NORMAL DISTRIBUTION
Bell Shaped
Symmetrical
Mean (μ), Median and Mode are EQUAL
62
THE NORMAL
DISTRIBUTION
63
THE NORMAL
DISTRIBUTION
64
THE STANDARDIZED NORMAL
DISTRIBUTION
Any normal distribution (with any mean and standard deviation combination) can be transformed
into the standardized normal distribution (Z)
Need to transform X units (NORMAL DISTRIBUTION. WITH UNITS) into Z (standardized normal
distribution. NO UNITS ). E.g.:
• NORMAL DISTRIBUTION: “X-units”-> 3km , 2m , 1 year, 10min, 400gr, 45 km/h et.c. …
• STANDARDIZED NORMAL DISTRIBUTION: No units at all!!!
Recipe:
• 1st step: calculate μ and σ of the datasample (value X),
• 2nd step: subtract from every value (X) it’s mean value (μ): X-μ
• 3rd step: divide with σ! (consider that σ it is NOT equal to 0)
The standardized normal distribution (Z) has a mean of 0 and a standard deviation of 1
65
TRANSLATION TO THE
STANDARDIZED NORMAL
DISTRIBUTION
66
THE STANDARDIZED
NORMAL DISTRIBUTION
67
THE
EXAMPLE:
STANDARDIZ If X is distributed normally
ED NORMAL with mean of $100 and
standard deviation of $50,
DISTRIBUTIO the Z value for X = $200 is:
N
This says that X = $200 is
two standard deviations (2
increments of $50 units)
above the mean of $100.
68
COMPARING X AND Z UNITS
69
FINDING NORMAL
PROBABILITIES
• Probability is
measured by the
area under the curve
70
PROBABILITY AS AREA UNDER
THE CURVE
71
The general Exercise:
The results of an examination were
normal Normally distributed. 10% of the
candidates had more than 70 marks
distribution and 20% had fewer than 35 marks.
72
The general normal distribution
• Solution:
73
• Exercise:
The weights of chocolate bars are
The general normally distributed with mean 205 g
and standard deviation 2⋅6 g. The
normal stated weight of each bar is 200 g. (a)
Find the probability that a single bar is
distribution underweight.
(b) Four bars are chosen at random.
Find the probability that fewer than two
bars are underweight.
74
The general normal distribution
• Solution:
75
Simply put, a z-score (also called a standard
score) gives you an idea of how far from
the mean a data point is.
But more technically it’s a measure of
how many standard deviations below or
Z-SCORE: What above the population mean a raw
score is
is a Z-Score? A z-score can be placed on a normal
distribution curve. Z-scores range from -3
standard deviations (which would fall to the
far left of the normal distribution curve) up
to +3 standard deviations (which would fall
to the far right of the normal distribution
curve).
In order to use a z-score, you need to know
the mean μ and the population standard
deviation σ.
76
The basic z score formula for a sample is:
z = (x – μ) / σ
77
• You may also see the z
Z-SCORE score formula shown to the
FORMULA: ONE left. This is the same
formula as z = x – μ / σ,
SAMPLE except that x̄ (the sample
mean) is used instead of μ
(the population mean) and
s (the sample standard
deviation) is used instead of
σ (the population standard
deviation). However, the
steps for solving it are the
same.
78
Example question: You take
the SAT and score 1100. The
How to Calculate mean score for the SAT is
1026 and the standard
a Z-Score deviation is 209. How well
did you score on the test
compared to
the average test taker?
79
How to Calculate a Z-Score
80
How to ✔ Step 4: Find the answer using a calculator:
Calculate (1100 – 1026) / 209 = .354. This means that your
a Z-Score score was .354 std devs above the mean.
81
When you have multiple samples and want to describe the standard deviation of
those sample means (the standard error), you would use this z score formula:
z = (x – μ) / (σ / √n)
82
THE STANDARDIZED
NORMAL TABLE
83
THE
STANDARDIZED
NORMAL TABLE
84
GENERAL
PROCEDURE To find P(a < X < b) when X is distributed
normally:
FOR FINDING
NORMAL
PROBABILITI Draw the normal curve for the problem in
ES terms of X
85
Let X represent the
time it takes, in
FINDING seconds to
NORMAL download an image
file from the internet.
PROBABILIT Suppose X is normal
IES with a mean of 18.0
seconds and a
standard deviation of
5.0 seconds. Find
P(X < 18.6)
86
SOLUTION:
FINDING
P(Z<0.12)
87
FINDING
✔ Suppose X is
NORMAL normal with mean
18.0 and standard
UPPER TAIL deviation 5.0
PROBABILITI ✔ Now Find P(X >
ES 18.6)
88
FINDING NORMAL UPPER TAIL
PROBABILITIES
• DON’T FORGET->
P(Z<a)+P(Z>=a)=1 ->
P(Z>=a)=1-P(Z<a)
89
• Suppose X is
FINDING normal with mean
18.0 and standard
NORMAL deviation 5.0. Find
PROBABILITI P(18 < X < 18.6) :
ES BETWEEN
TWO VALUES
90
SOLUTION: FINDING P(0<Z<0.12)
91
PROBABILITIES IN THE LOWER
TAIL
92
PROBABILITIES IN
THE LOWER TAIL
93
EMPIRICAL
RULES
What can we say about
the distribution of values
around the mean? For any
normal distribution:
94
THE EMPIRICAL
RULE
95
Introduction to
Probabilities
Module 1
96
97
Basic probability
concepts
• Probability: the chance that an uncertain
event will occur (always between zero and
one)
• Impossible event: an event that has no
chance of occurring (probability = 0)
• Certain event : an event that is sure to occur
(probability = 1)
98
Assessing probability
•
99
Example of empirical probability
•
100
Events
Each possible outcome of a variable is an event.
• Simple event
• An event described by a single characteristic
• E.g. A day in January from all days in 2013
• Joint event
• An event described by two or more characteristics
• E.g. A day in January that is also a Wednesday from all days in 2013
• Complement of an event A (denoted as A’)
• All events that are not part of event A
• E.g. All days from 2013 that are not in January
101
Sample space
• The sample space is the collection of all possible
events
• E.g. All 6 faces of a die:
102
Organizing & Visualizing Events
103
• Simple probability refers to the probability of a simple
Definition: Simple event.
e.g. P(Jan.)
Probability e.g. P(Wed.)
104
Definition:
Joint
probability
• Joint probability refers
to the probability of an
occurrence of two or
more events (joint
event).
• E.g. P(Jan. and
Wed.)
• E.g. P(Not Jan. and
Not Wed.)
105
Mutually exclusive events
106
Collectively exhaustive events
• Events A , B ,C , D are collectively exhaustive (but not mutually exclusive - a weekday can
be in January or in spring)
• events A and B are collectively exhaustive and also mutually exclusive
107
Computing joint and marginal probabilities
• Where B1, B2, ..Bk are k mutually exclusive and collectively exhaustive
events
108
Joint
Probability
Example
109
Marginal
probabilit
y example
110
Marginal and
Joint
probabilities in
a contingency
table
111
Probability Summary so far
0 = impossible
112
General Addition Rule
• General addition rule:
113
General
addition
rule
example
114
Computing conditional probabilities
115
Conditiona • Of the cards on an used car lot, 70% have air
conditioning (AC) and 40% have a GPS. 20%
of the cars have both.
l
probability • What is the probability that a car has a GPS,
given that it has AC?
example • E.g. we want to find P(GPS | AC)
116
Conditional Probability Example (Continued)
• Of the cars on a used car lot, 70% have air conditioning (AC) and 40%
have a GPS and 20% of the cars have both.
117
118
119
Conditional
Probability Example
(Continued)
120
• Two events are independent if and only if:
121
Multiplications
rules
• Multiplication rule for two events A and
B:
122
Statistics
Practical Examples
Module 1
123
Statistical Modeling
• A statistical model is our try to describe a real-world problem using
mathematical equations. This model can be used in order to make
predictions, while in the future we aim to analyze and refine this model
in order to produce even better predictions.
• However: The model simplifies the situation and only describes a part
of the real-world problem and sometimes may only work for a particular
range of values.
124
Type of variables
• Discrete: Can only take certain values
• E.g. shoe size (is just a single number, we don’t have to pass from all the numbers
before to end up having our current show size which could be 40 (EU point system),
cost in $ and euro (Say that 10$ is 11.5 euros, so it’s just a conversion), number of
coins (I have 2 euros in my pocket, is not mandatory that I started by having 1 euro + 10
cents + … to get to have 2 euros), number of people within a store for a time period
of 5 minutes (it might be 0 people but it might be 1 person, 2, 4, 6, … the point is that
we might see that 2 people enter the store, so we won’t count 1 + 1 but rather the
number 2)
• Continuous: Can only take values within a given range
• E.g Age, Height, time (For example with age, we can’t just say we are 25 years old, our
life begins – let’s just say- at 0 years old and gradually we end up being 25 years old
while next year we will be 26 years old, 27 years old,… So the point here is that we start
from a value and we MUST pass from all the values until we end up in our current age
which is 25 years old), temperature
• Quantitative: Numerical values
• E.g. the length of a trouser, our age, the time, the number of coins in a pocket, the
temperature, score within a game, etc.
• Qualitative: Non-numerical values
• E.g. red, blue, yellow, short, long, cold, hold, light, dark, etc.
125
Terms
• Values = Data Points = Numbers
When you hear these terms, you should always consider something like
this:
1, 2, 5, 8, 7, 3, 10, …
Or red, blue, green, yellow, etc.
126
Frequency Distribution Exercise: We gather the Instagram followers
for 3 people. Find the Frequency
Distribution (or else called Frequency
Table):
100, 100, 90
• Frequency distribution or
Frequency table is how many 10 90
times you see each value within
a set of numbers: 0
2 1
• Example:
• The number of M&Ms:
127
Cumulative Frequency
• Add up the frequencies as you go down/along the list:
Example:
• The number of Instagram followers:
100 90 10 120
Frequency 3 2 1 5
Table
Cumulative 3 3+2 = 5 5+1 = 6 (or 6+5 = 11 (or
So, the Frequency
point is that we start with the first number not 3+2+1
having= a6)number
3+2+1+5
before =(so
11) the same
frequency) and the next number is the addition of the frequency of numbers that we found
before
128
Measures of Central Tendency
•
129
Mean
• Calculate the summary of
numbers divided by the total
numbers seen:
• Remember!
• Σ (Greek letter) tells you to sum
up all the numbers seen
130
Mean - Example
• I have the following numbers:
•1 2 5 6 2 3 1
131
Median
• Is the middle number in a sorted, ascending or descending, list of
numbers.
• The median is sometimes used as opposed to the mean when there
are outliers in the sequence that might skew the average of the values
Example:
• Step 1: Put the numbers in order
• I have the following numbers: 1 3 2 5 10 5 4
• Ascending order: 1 2 3 4 5 5 10
• Descending order: 10 5 5 4 3 2 1
132
Median
• Step 2: Question yourself. Is the middle number obvious? Can you find a number that has
equal number of values from both sides?
• 1 2 3 4 5 5 10
Example 2:
I have the following numbers: 1 4 2 3 5 2
1) 122345
2) 122345
This time the middle number is not obvious, but a set of middle numbers are obvious! So, in
this case:
Take the summary of these two values and divide them by 2. This will be our median.
133
Mode
It’s how many times you see each unique number within a set of values.
134
Mode – Example
Example:
143256785
• Step 1: Unique values:
• 12345678
• Step 2: How many times you see each one?
1 2 3 4 5 6 7 8
1 1 1 1 2 1 1 1
• Step 3: What numbers are most frequently seen?
1 2 3 4 5 6 7 8
1 1 1 1 2 1 1 1
So, 5 is the mode.
135
Mode – Example (Bimodal)
Example:
1433256785
• Step 1: Unique values:
• 12345678
• Step 2: How many times you see each one?
1 2 3 4 5 6 7 8
1 1 2 1 2 1 1 1
• Step 3: What numbers are most frequently seen?
1 2 3 4 5 6 7 8
136
Mode – Example (Multimodal)
Example:
14332567785
• Step 1: Unique values:
• 12345678
• Step 2: How many times you see each one?
1 2 3 4 5 6 7 8
1 1 2 1 2 1 2 1
• Step 3: What numbers are most frequently seen?
1 2 3 4 5 6 7 8
137
Terms
Extreme
• Some values that are far from
values = others
Outliers
• 1, 2, 1, 5, 10, 100, 5
For • 100 is far from the other
numbers, so we call it an extreme
example: value or an outlier.
138
When you should use each measure
Mode:
You should use the mode if the data is qualitative (colour etc.) or if quantitative
(numbers) with a clearly defined mode (or bi-modal). It is not much use if the distribution
is fairly even
Median:
You should use this for quantitative data (numbers), when the data is skewed, i.e., when
the median, mean and mode are probably not equal, and when there might be extreme
values (outliers)
Mean:
This is for quantitative data (numbers) and uses all pieces of data. It gives a true
measure, and should only be used if the data is fairly symmetrical (not skewed), i.e., the
mean could not be affected by extreme values (outliers)
139
Measures of Variability
• Range
• Variance and Standard Deviation
140
Range
• The range is the largest number minus the smallest (including outliers)
• Example:
141
Variance
142
Standard
deviatio
n
143
Explanation of each term
144
Explanation of each term
145
Explanation of each term
146
Explanation of each term
147
Skewed
Distribution
148
Normal Distribution
149
Normal Distribution
150
Standardized Normal Distribution
• We transform the X unit (e.g. $,
people, etc.) of the values of
variable X into Z units.
151
A typical exercise - Examples
152
What we need to know for these exercises
1) Mean
2) Population’s Standard Deviation
153
Example for
Case A
Find the probability
that variable X < 200
1) We transform the X
value to Z-Value (which
will call from now on
Z-Score).
2) Because we have a
negative z-score, so
below the mean, we
transform -1.92 to 1.92
(this value we are gonna
look in the Z-Table) and
we will subtract the
result from 1.
154
Example for
Case A (2) Let’s say we have a data set which is normally distributed with a
mean of 150 and a standard deviation of 20. Find the
probability that a value is less than 200:
Find the probability
that variable X < 200 1) Z = (200-150)/20 = 50/20 = 2.5
2) P(X<200) = P(Z<2.5) = Φ(2.5) = 0.9938
1) We transform the X
value to Z-Value (which
will call from now on
Z-Score)
2) 2.5 will be the value
we are going to look in
the Z-Table
3) This is the probability
that this initial case will
happen
155
Example for
Case B
Find the probability
that variable X >= 100
1) We transform the X P(X>= 100) where μ = 55 and σ = 20.
value to Z-Value (which P(X>=100) = P(Z>=2.25) = 1 – Φ(2.25) = 1 – 0.9875
will call from now on = 0.0125 = 1.25%
Z-Score).
2) Because we have a
the >= (greater than),
we need to find locate
the number 2.25 from
the Z-table and then that
value subtract it from 1.
156
Example for
Case C
Find the probability that
variable 50 < X < 100
1) In that case we need P(X>50) and P(X<100) where μ = 55 and σ = 20.
to find 2 probabilities. P(X>50) = P(Z>-0.25) = 1 – Φ(-0.25) (Z-table for negative
I. P(X>50) and P(X<100) values)
2) We transform the X = 1 – 0.4013 = 0.5987 = 59.87%
value to Z-Value (which P(X<100) = P(Z< 2.25) = Φ(2.25) = 0.9875 = 98.75%
will call from now on
Z-Score) for each one. The final result is:
3) Look what we do in P(Z<2.25) – P(Z>-0.25) = 0.9875 – 0.5987 = 0.3888 =
each of the previous 38.88 %
cases based on sign (< or
>)
157
Correlation
✔ Many relationships between two measurement variables tend to fall close to a straight line.
✔ In other words, the two variables exhibit a linear relationship.
✔ It is also helpful to have a single number that will measure the strength of the linear
relationship between the two variables. This number is the correlation.
✔ The correlation is a single number that indicates how close the values fall to a straight line.
✔ Correlation quantifies both the strength and direction of the linear relationship between the
two measurement variables.
✔ The range of possible values for a correlation is between -1 to +1.
✔ Correlation describes the strength and direction of the linear association between variables.
It does not describe non-linear relationships
✔ Correlation is heavily impacted by outliers!
158
Type of Association (or Relationship)
159
Positive correlation
161
No association
The following two questions were asked on a
survey of 220 Stat 100 students:
1. About how many hours do you typically
study each week?
2. About how many hours do you typically
exercise each week?
Finally, we notice that as the number of
hours spent exercising each week increases
there is really no pattern to the behavior of
hours spent studying including visible
increases or decreases in values.
Consequently, we say that that there is
essentially no association between the two
variables.
The scatterplot does not show any strong
association between exercise hours/week
and study hours/week. This lack of
association is supported by a correlation of
0.109.
162
Scatterplot of the relationship between the Infant
Mortality Rate and the Percent of Juveniles Not Enrolled
Outliers & in School for each of the 50 states plus the District of
Columbia:
Correlation
The correlation is 0.73 but looking
at the plot it can be observed that
for the 50 states alone the
relationship is not nearly as strong
as a 0.73 correlation would suggest.
Here, the District of Columbia
(identified by the X) is a clear
outlier in the scatter plot being
several standard deviations higher
than the other values for both the
explanatory (x) variable and the
response (y) variable. Without
Washington D.C. in the data, the
correlation drops to about 0.5.
163
Causation vs Correlation
• Causation is when we have two measurement variables and the one cause the other
to happen. So, we have a cause-and-effect relationship between those two variables.
• Correlation is when we have two measurement variables and the one is somehow
related to the other. So, this means that they simply have a relationship, but one event
doesn’t necessarily cause the other event to happen.
• Causation explicitly applies to cases where action A causes outcome B. On the other
hand, correlation is simply a relationship. Action A relates to Action B—but one
event doesn’t necessarily cause the other event to happen.
164
Causation vs correlation
When there is causation, there is correlation, but when there is correlation that
doesn’t imply causation!!
There are two main reasons why correlation isn’t causation. These problems are
important to identify for drawing sound scientific conclusions from research:
• The third variable problem means that a confounding variable affects both variables
to make them seem causally related when they are not. For example, ice cream sales
and violent crime rates are closely correlated, but they are not causally linked with
each other. Instead, hot temperatures, a third variable, affects both variables
separately.
• The directionality problem is when two variables correlate and might actually have a
causal relationship, but it’s impossible to conclude which variable causes changes in
the other. For example, vitamin D levels are correlated with depression, but it’s not
clear whether low vitamin D causes depression, or whether depression causes
reduced vitamin D intake.
• You’ll need to use an appropriate research design to distinguish between correlational
and causal relationships.
165
Regression
• Regression is a descriptive method used with two different numerical variables to find
the best straight line (which in fact is a mathematical equation) to fit the data points
when looking at a scatterplot.
• Regression is used to make predictions, where X is the
explanatory/predictor/independent variable, and Y is the
response/outcome/dependent variable.
166
Regression
• y = a + bx where:
167
Regression
168
Regression
169
Regression -
Example
• Least
squares essentially find
the line that will be the
closest to all the data
points than any other
possible line.
• Based on the plot, some
of the points lie above
the line, where other
points lie below the line.
In fact, the total distance
for the points above the
line, is exactly equal to
the total distance from
the line to the points that
fall below it.
170
Statistical Distributions
where x is the random variable, f(x) is the function, and p or 1-p are the
probabilities of different outcomes of function f(x) for the variable x.
171
Types of Statistical Distributions
Discrete
• Refers to Discrete variables meaning that we don’t have values within a given
range, but we only refer to specific numbers (1,4,6) without having to go
through 2,3,5, etc., or even text (e.g. red, white, tall, slim, etc.)
Continuous
• Refers to Continuous variables meaning that we deal with values that are within
a specific range. For example, the temperature is a continuous variable because
when the temperatures is 37 Celsius, in order to get there, we have been also in
36.5, 36, 34, etc. Celsius.
172
Most important Distributions
• Bernoulli Distribution
• Binomial Distribution
• Geometric Distribution
• Uniform Distribution
• Normal Distribution
• Poisson Distribution
173
Binomial Distribution
174
There are two parameters in the distribution, the success probability p and the number of trials n. The PMF
is defined using the combination formula:
Binomial
Distribution
175
A binomial distribution graph where the probability of success does
not equal the probability of failure looks like:
Binomial
distiribution Now, when probability of success = probability of failure, in such a
situation the graph of binomial distribution looks like:
176
Bernoulli Distribution
• Binomial distribution is a discrete distribution. The assumptions of Bernoulli distribution
include:
• Only two outcomes (e.g. 0 and 1)
• One one trial
• Bernoulli distribution describes a random variable that only contains two outcomes. For
example, when tossing a coin one time, you can only get “Head” or “Tail”. We can also
generalize it by defining the outcomes as “success” and “failure”.
• Example of Bernoulli distribution:
• We toss a coin so we have ½ chances to get “head” and ½ chances to get “tail”. So
two outcomes (“Head”, “Tail”) and each one (which is a single outcome) has the
probability of ½.
• We roll a dice and if I got six then for me it’s “Success” and everything else is “Failure”.
So, we have two outcomes and so Bernoulli Distribution can be used.
177
Bernoulli Distribution
• The probability mass function of a random variable x
that follows the Bernoulli Distribution is:
178
Bernoulli
Distribution
179
Geometric Distribution
180
Geometric distribution • When the random variable x is the number of
failures before the first success, the PMF is:
181
Geometric
Distribution
182
Uniform distribution
• Uniform distribution models a random variable whose outcomes are equally likely to
happen. The outcomes can be discrete, like the outcomes getting from tossing a dice,
or continuous, like the waiting time for a bus to arrive. Thus, Uniform distribution can
be a discrete or continuous distribution depending on the random variable. The
assumptions are:
1) there are n outcomes (discrete), or a range for the outcomes to be at
(continuous);
2) All values in the outcome set or the range are equally likely to occur.
183
• The discrete uniform distribution is straightforward, and it
is easy to calculate. For a continuous Uniform distribution
that is uniformly distributed at [a, b], the probability
density function (PDF) is:
Uniform
distribution
184
Uniform distribution
185
Normal Distribution
186
Probability function
187
Normal
distribution
188
Poisson distribution
189
• The assumptions Poisson distribution are:
1) any successful event should not influence the outcome of other
successful events (observing one car at the first second doesn’t affect the
chance of observing another car the next second);
2) the probability of success p, is the same across all intervals (there
is no difference between this hour with other hours to observe cars
passing by);
3) the probability of success p in an interval goes to zero as the
interval gets smaller (if we are discussing how many cars will pass in a
millisecond, the probability is close to zero because the time is too short);
Poisson • The PMF of Poisson distribution can be derived from the PMF of
distribution binomial distribution:
191
Useful Content
192
Z-Table
for Positive
Z-Values
193
Z-Table
for Negative
Z-Values
194
Excel
Module 1
195
Week 1
• Intro
• Insert/Delete
• Format cells/Conditional Formatting
196
What is Excel?
197
Basic Understanding
198
Key Features
199
Quick access toolbar and ribbon menu
200
Mini Toolbar
201
Insert/Delete a new Column/Row
When you have some data, and you want to insert/delete a new row/column in a
specific cell then you should do the following:
• Right-click on the cell that you want
• Select Insert/Delete
202
Workbook vs Worksheet
203
Shortcuts for Start/End of a
row/column
• If we have a table with data in a worksheet, then we might need to go to the first
(last) cell that is filled with data. In order to do this, we have the following shortcuts:
• Ctrl + Home (Start of the table – First Cell)
• Ctrl + End (End of the table – Last Cell)
• Ctrl + Up Arrow (First cell of the column)
• Ctrl + Down Arrow (Last cell of the row)
• Ctrl + Right (Last cell of the row)
• Ctrl + Left (First cell of the row)
• Cases:
• If we want to go to the very first cell (Most of the time is A1) of the table that is
filled with data just press Ctrl + Home
• If we want to go the last cell of the table is filled with data (right down corner) just
press Ctrl + End
204
Questions
205
Questions
• You would like to add some frequently used buttons to the Quick Access
Toolbar. How can you do this?
• Drag-and-drop the button to the Quick Access Toolbar
• Right-click the button on the ribbon and select Add
• Double-click the button on the ribbon and select Move
• Left-click the button on the ribbon and select Add
• You want to add an average in the worksheet shown below. You want to
learn more about the average function in Excel and think that a video will
assist you. How can you access training from the workbook?
• Click help from the menu bar
• Click Featured help
• Click tell me what you want to do
• Click tell me more
206
• Ctrl + C = Copy
• Ctrl + V = Paste
• Ctrl + X = Cut
• Ctrl + A = Select All
• Ctrl + Z = Undo last(s) edit
• Ctrl + Y = Redo last edit (if you pressed Undo
before)
• Ctrl + B = Bold text
• Ctrl + I = Italic text
• Ctrl + U = Underlined text
• Ctrl + N = New Workbook
• Ctrl + F = Find text and Replace it (optional)
• Ctrl + S = Save
Handful shortcuts •
•
Ctrl + D = Delete cell content
Ctrl + P = Print
- Windows
207
Notes
• Notes are useful when you want to give further explanation to a cell.
In order to create a note:
• Right-click on the cell where you want to create the note
• Select New Note
• Write your note (it is auto saved)
208
Workbook Statistics
209
Data Types and Formatting
• Date
• Text
• General
• Currency
210
Handful tip for understanding the data type
• Let’s say we create a column that has data with several dates. In order to
understand whether we write the date in the correct format, we should pay
attention to this:
• Do you see that some “dates” are in the right side of the cell (or
there is a blank space on the beginning of the cell) and some
others are in the left side of the cell (or there is a blank space on
the end of the cell)?
• In this case, the cells that are in the right side are in correct
format as far as the date data type is concerned, while the
2nd and the 4th cell are not of data type Date, but data type
Text!
• The same is happening for Number, when we deal with
numbers then we will see the on them right side of the cell.
211
Format Cells
212
Format cells
213
Smart replacing
Drag one cell or two cells in order that Excel will generate for you what you want in an
automatic way.
Case 1: You need to generate the numbers from 1-100.
Solution 1:
• Type the numbers 1 and 2
• Select both cells
• Drag the cell to the bottom until you reach the row 100 (see on the left)
Case 2: You need to generate the number 1 multiple times
Solution 2:
• Type the preferable number (1 in our example)
• Select the cell
• Drag the cell to the bottom until you reach the number of values that you want to generate
214
Conditional Formatting – Highlight
Cells Rules
Conditional Formatting enables you to highlight cells with a certain color,
depending on the cell’s value.
Steps
• Select a range of cells
• On the Home tab, in the Styles group, click Conditional Formatting
• Click highlight cells rules, e.g. Greater than
• Enter a random value, e.g. 80, and select a formatting style.
If you change a cell value of the column that you have apply Conditional
Formatting then it would automatically colored, or not, based on the
condition you have applied.
215
Conditional Formatting -
Top/Bottom Rules
216
Conditional Formatting –
With Formulas
Use a formula to determine which cells to format. Formulas that apply conditional
formatting must evaluate to TRUE or FALSE.
Steps
• Select a range of cells
• On the Home tab, in the Styles group, click Conditional Formatting
• Click New Rule
• Select Use formula to determine which cells to format
• Enter the formula e.g. ISODD(A1) (Always write the formula for the upper-left cell
in the selected range because Excel automatically apply the formula to the rest of
the cells)
• Select a formatting style and press OK
217
Freeze a row or column heading in Excel
You can freeze the row and column headings. Therefore, no matter how
much you scroll you will still see these rows and columns. This trick is
rather simple to implement.
Steps
• Click on the View tab and select one of the following options:
• Freeze Panes: Freezes the top row and first column
• Free Top Row: Freezes the top row
• Freeze First Column: Freezes first column
• Your first row and/or column will be visible as you scroll.
218
Clear Format
219
Sort Data
220
Filter Data
221
Remove Duplicate Data
Duplicate Data
When we deal with the same data more than once and this is not the
default behavior. For example:
- We gather data for each of our customers
Alexandra Ath, Mary Tze, Nick Gre, Mary Tze, etc.
It is obvious that Mary Tze is found twice and this is wrong because we
need to have each customer once.
222
Week 2
• Intro to Functions
• Count/CountIF
• Logical Operators
223
Remove Duplicate
Data (2)
In Data tab 🡪 Data Tools 🡪 Remove
Duplicates
224
Formulas/Functions
225
Formulas/Functions (2)
• In the parentheses, you need to write the cells that you prefer to calculate the summary of:
Select the first cell and Close the parentheses This is the result!
drag your cursor until and then press Enter.
you have selected all
the preferable cells.
226
Formulas/Functions – Operator Precedence
1) Parentheses
2) Multiplication / Division
3) Addition / Subtraction
227
All Functions
• In order to see the available functions of Excel
with detailed descriptions:
1. Go to tab Formulas
2. Insert Function
3. Select the preferable category
4. Press ok when choosing the preferable
function
228
Count and Sum Functions
• Count
• Countif/Countifs
• Sum/sumif/sumifs
229
COUNTIF -
Example
Use the COUNTIF function
in Excel to count cells that
are equal to a value, count
cells that are greater than or
equal to a value, etc.
• The COUNTIF function
below counts the number
of cells that are equal to
20.
230
COUNTIF –
Example 2
The COUNTIF function
below counts the number of
cells that are greater than or
equal to value in cell C1.
231
COUNTIF – Text Tricks
232
COUNTIFS –
Example
To count cells between two
numbers, use the
COUNTIFS function (with
the letter S at the end).
233
Logical Operators and Round
234
IF - Examples
235
AND/OR - Examples
OR AND
236
Nested IF
237
Week 3
• Functions (2)
• VLookUp/Index-Match
• Correlation
238
Most important Statistical Functions
• Average
• Averageif
• Median
• Mode
• Stdev
• Min/max
239
Generate Random values
240
VLOOKUP – Exact Match
241
VLOOKUP – Approximate Match
242
VLOOKUP –
First Match
• If the leftmost column of
the table contains
duplicates, the VLOOKUP
function matches the first
instance.
• Tip! the VLOOKUP
function is case-insensitive
so it looks up MIA or Mia
or mia or miA, etc. As a
result, the VLOOKUP
function returns the salary
of Mia Clark (first
instance).
243
Concatenate text
244
MATCH and INDEX
245
MATCH and INDEX (2)
246
1st way:
Correlation
The correlation coefficient (a value
between -1 and +1) tells you how
strongly two variables are related to
each other. We can use the CORREL
function or the Analysis Toolpak
add-in (we are going to talk about it in
the next slide) in Excel to find the
correlation coefficient between two
variables.
Steps
On the Data tab, in the Analysis 2nd way:
Group, click Data Analysis
Select Correlation and click OK
For example, select the range A1:C6
as the Input Range
Check Labels in first row
Select cell A8 as the Output Range
247
Week 4
• Charts
• Linear Regression
• File Types
248
Data
Analysis
249
Analysis ToolPak
250
Create a chart
251
Create
Histogram
• First, enter the bin numbers (upper levels) in
the range C4:C8.
• On the Data tab, in the Analysis group, click
Data Analysis.
• Select Histogram and click OK
• Select the range A2:A19
• Click in the Bin Range box and select the
range C4:C8.
• Click the Output Range option button, click
in the Output Range box and select cell F3.
• Check Chart Output.
• Click OK
• Click the legend on the right side and press
Delete.
• Properly label your bins.
• To remove the space between the bars, right
click a bar, click Format Data Series and
change the Gap Width to 0%.
• To add borders, right click a bar, click Format
Data Series, click the Fill & Line icon, click
Border and select a color.
252
Line charts are used to display trends over time. Use a line chart if you have text labels, dates or a few numeric labels on the horizontal axis. Use a scatter plot (XY chart) to show scientific XY
data.
Chart •
•
On the Insert tab, in the Charts group, click the Line symbol
Tip! Only if you have numeric labels, empty cell A1 before you create the line chart. By doing this, Excel does not recognize the numbers in column A as a data series and automatically places
these numbers on the horizontal (category) axis. After creating the chart, you can enter the text Year into cell A1 if you like.
253
Create a Combination
Chart
A combination chart is a chart that combines two
or more chart types in a single chart.
Steps
Select the range A1:C13
On the Insert tab, in the Charts group, click the
Combo symbol.
Click Create Custom Combo Chart
For the Rainy Days series choose Clustered
Column as the chart type
For the Profit series, choose Line as the chart type
Pot the Profit series on the secondary axis
Click OK
254
Add Sparklines!
255
Linear Regression Analysis
Below you can find our data. The big question is: is
there a relation between Quantity Sold (Output) and
Price and Advertising (Input). In other words: can we
predict Quantity Sold if we know Price and
Advertising?
Steps
• On the Data tab, in the Analysis group, click Data Analysis.
• Select Regression and click OK
• Select the Y Range (A1:A8). This is the predictor variable (also called
dependent variable).
• Select the X Range(B1:C8). These are the explanatory variables (also
called independent variables). These columns must be adjacent to each
other.
• Check Labels.
• Click in the Output Range box and select cell A11.
• Check Residuals.
• Click OK.
256
Linear Regression Analysis –
Summary Output
• R Square
R Square equals 0.962, which is a very good fit. 96% of the variation in Quantity Sold
is explained by the independent variables Price and Advertising. The closer to 1, the
better the regression line (read on) fits the data.
• Significance F and P-values
To check if your results are reliable (statistically significant), look at Significance F
(0.001). If this value is less than 0.05, you're OK. If Significance F is greater than
0.05, it's probably better to stop using this set of independent variables. Delete a
variable with a high P-value (greater than 0.05) and rerun the regression until
Significance F drops below 0.05.
Most or all P-values should be below below 0.05. In our example this is the case.
(0.000, 0.001 and 0.005).
• Coefficients
The regression line is: y = Quantity Sold = 8536.214 -835.722 * Price + 0.592 *
Advertising. In other words, for each unit increase in price, Quantity Sold decreases
with 835.722 units. For each unit increase in Advertising, Quantity Sold increases
with 0.592 units. This is valuable information.
You can also use these coefficients to do a forecast. For example, if price equals $4
and Advertising equals $3000, you might be able to achieve a Quantity Sold of
8536.214 -835.722 * 4 + 0.592 * 3000 = 6970.
• Residuals
The residuals show you how far away the actual data points are fom the predicted
data points (using the equation). For example, the first data point equals 8500. Using
the equation, the predicted data point equals 8536.214 -835.722 * 2 + 0.592 * 2800 =
8523.009, giving a residual of 8500 - 8523.009 = -23.009.
257
File types - CSV
A CSV (Comma Separated Values) file is a special type of file that you
can create or edit in Excel. Rather than storing information in columns,
CSV files store information separated by commas. When text and
numbers are saved in a CSV file, it's easy to move them from one
program to another.
258
File types - Database
259
Week 5
• Pivot Tables
• Pivot Charts
• Tables
260
Pivot Tables
A pivot table allows you to extract the significance from a large, detailed
data set.
Our data set consists of 213 records and 6 fields. Order ID, Product,
Category, Amount, Date and Country.
261
Insert a Pivot Table
262
Drag fields
263
Drag Fields (2)
264
Change Summary
Calculation
265
Two-dimensional Pivot Table
266
Pivot Tables - Slicers
Use slicers in Excel to quickly and easily filter pivot tables.
Steps
• Click any cell inside the pivot table.
• On the Analyze tab, in the Filter group, click Insert Slicer.
• Check Country and click OK.
• Click United States to find out which products we export the most to the United
States.
Let’s insert a second slicer:
Click any cell inside the pivot table.
On the Analyze tab, in the Filter group, click Insert Slicer.
Check Product and click OK.
Click the Multi-Select button to select multiple products.
267
A pivot chart is the visual representation of a pivot table
in Excel. Pivot charts and pivot tables are connected with
Pivot Chart each other. Below you can find a two-dimensional pivot
table.
268
Insert & Filter Pivot
Chart
Insert
• Click any cell inside the pivot table.
• On the Analyze tab, in the Tools group, click PivotChart.
• The Insert Chart dialog box appears, Click OK
Filter
• Use the standard filters (triangles next to Product and
Country). For example, use the Country filter to only show
the total amount of each product exported to the United
States.
• Remove the Country filter.
• Because we added the Category field to the Filters area, we
can filter this pivot chart (and pivot table) by Category. For
example, use the Category filter to only show the vegetables
exported to each country.
269
Tables - Insert
a Table
• Click any single cell inside the
data set.
• On the Insert tab, in the
Tables group, click Table.
• Excel automatically selects
the data for you. Check 'My
table has headers' and click
on OK.
270
Tables – Total Row
271
Quick Analysis Tool
272
Power Query
273
Guide for Power Query
https://www.howtoexcel.org/power-query/the-complete-guide-to-power-que
ry/
274
References
275
SQL
How to access SQL Server Management Studio
Module 2
276
Open application – SQL Sever Management
Studio
277
Interface This is the screen that you are going to see.
278
• Connect with:
SQL Server Authentication
• Server Type:
Database Engine
• Server name:
Credentials curiousiq-78d054-sqlsvr.dat
abase.windows.net/
• Username (one of the following):
sqluser1, sqluser2, …,
sqluser50
• Password:
password!1
279
SQL
Module 2
280
Table of Contents
1.1 | Introduction to Data
1.2 | Database systems
1.3 | Relational model
1.4 | Normalisation
1.5 | Introduction to SQL
1.6 | Basic Query Structure
1.7 | Database Operations
1.8 | Aggregate Functions
1.9 | Window Functions
1.10 | Subqueries
1.11 | Database modification
281
Week 1
1. Intro
2. Relational Model
282
INTRODUCTION TO
DATA
283
Introduction to Data
284
Introduction to Data
1. Data Ingestion
Process of capturing the raw data. To process and analyse this
data, you must first store the data in a repository of some sort.
The repository could be a file store, a document database, or
even a relational database.
2. Data Transformation/Processing
After data is ingested into a data repository, we may want to
do some cleaning operations and remove any questionable or
invalid data, or perform some aggregations such as calculating
certain KPIs.
285
3. Data Querying
Data Classification
Structured data
Structured data is typically tabular data that is
represented by rows and columns in a
database.
286
Data Classification
Semi-structured data
Semi-structured data is information that
doesn't reside in a relational database but still
has some structure to it.
287
Data Classification
Semi-structured data
>Graph DB – Stores and queries information
about complex relationships.
A graph contains
-nodes (information about objects), and
-edges (information about the relationships
between objects).
288
Data Classification
Unstructured data
Unstructured data is data which is not
organized in any predefined manner.
289
History of Database Systems
▪ 1950s and early 1960s
• Data processing using magnetic tapes for storage
▪ Late 1960s and 1970s
• Hard disks allowed direct access to data
• Network and hierarchical data models in widespread use
• High-performance (for the era) transaction processing
▪ 1980s
• SQL becomes industrial standard
• Parallel and distributed database systems
▪ 1990s
• Large multi-terabyte data warehouses & emergence of Web commerce
▪ 2000s
• Big data storage systems (Google BigTable, Amazon, “NoSQL” systems)
• Big data analysis: beyond SQL
▪ 2010s 290
Latest trend: Big Data
The challenges of big data management result from the
expansion of all three properties.
Volume:
The quantity of generated and stored
data.
Variety:
The type and nature of the data. Big data
does not only draw from text but also
images, audio, video.
Velocity:
The speed at which the data is 291
Big Data
292
DATABASE SYSTEMS
293
Database Systems
294
Database Applications Examples 1/2
▪ Enterprise Information
• Sales: customers, products, purchases
• Accounting: payments, receipts, assets
• Human Resources: Information about employees, salaries, payroll taxes.
▪ Manufacturing: management of production, inventory, orders, supply chain.
▪ Banking and finance
• Customer information, accounts, loans, and banking transactions.
• Credit card transactions
• Finance: sales and purchases of financial instruments (e.g., stocks and bonds;
storing real-time market data
▪ Universities: registration, grades
295
Database Applications Examples 2/2
296
RELATIONAL MODEL
297
Relational Model
Rows
298
Characteristics of Relational Data
Relational data
>Data is stored in a table consisting of rows &
columns
>Row: Each row represents a single instance of
entity
>Column: Define the properties of the entity
>Each column is defined by a datatype
>All rows have the same number of columns
299
Characteristics of Relational Data
Relational data
> Some columns are used to maintain
relationships between tables
> Model shows the structure of the entities
> Primary key uniquely identifies each row
> Foreign key references the primary key of
another table and is used to maintain
relationships between tables
300
Characteristics of Relational Data
Relational data
> Some columns are used to maintain
relationships between tables
> Model shows the structure of the entities
> Primary key uniquely identifies each row
> Foreign key references the primary key of
another table and is used to maintain
relationships between tables
301
Why don’t we have all the data in one big table?
Week 2
1. Relational Database
2. Basic query structure
302
A Sample Relational Database
303
SQL Query Language
304
Query Processing
305
History
▪ IBM Sequel language developed as part of System R project at the IBM San Jose
Research Laboratory
▪ Renamed Structured Query Language (SQL)
▪ SQL syntax is similar to the English language, which makes
it relatively easy to write, read, and interpret.
▪ Many RDBMSs use SQL (and variations of SQL) to access
the data in tables. Tables can have hundreds, thousands,
sometimes even millions of rows of data. These rows are
often called records.
▪ Tables can also have many columns of data. Columns are
labeled with a descriptive name (say, age for example) and
have a specific data type. Columns are often called fields.
306
Data Definition Language
307
Data Types in SQL
308
Create Table Construct
▪ Example:
create table instructor (
ID char(5),
name varchar(20),
dept_name varchar(20),
salary numeric(8,2))
309
Integrity Constraints in Create Table
310
And a Few More Relation Definitions
311
...and more
▪ create table course (
course_id varchar(8),
title varchar(50),
dept_name varchar(20),
credits numeric(2,0),
primary key (course_id),
foreign key (dept_name) references department);
312
Updates to tables
▪ INSERT
• insert into instructor values ('10211', 'Smith', 'Biology', 66000);
▪ DELETE
• Remove all tuples from the student relation
▪ delete from student
▪ DROP Table
• drop table r
▪ ALTER
• alter table r add A
▪ where A is the name of the attribute to be added to relation r
• alter table r drop A
▪ where A is the name of an attribute of relation r
▪ Dropping of attributes not supported by many databases.
313
BASIC QUERY
STRUCTURE
314
Basic Query Structure
select column_name
from schema_name.table_name
where condition
315
The SELECT [...] FROM Clause 1/4
▪ The SELECT clause lists the attributes desired in the result of a query
▪ The FROM command is used to specify which table to
select data from.
▪ Example: find the names of all instructors:
select name
from instructor
▪ NOTE: SQL names are case insensitive (i.e., you may use upper- or
lower-case letters.)
• E.g., Name ≡ NAME ≡ name
• Some people use upper case wherever we use bold font.
316
The SELECT [...] FROM Clause 2/4
▪ SQL allows duplicates in relations as well as in query results.
▪ To force the elimination of duplicates, insert the keyword DISTINCT after select.
▪ Find the department names of all instructors, and remove duplicates
select distinct dept_name
from instructor
▪ The keyword all specifies that duplicates should not be removed.
select all dept_name
from instructor
317
The SELECT [...] FROM Clause 3/4
▪ An asterisk in the select clause denotes “all attributes”
select *
from instructor
▪ An attribute can be a literal with no from clause
select '437'
• Results is a table with one column and a single row with value “437”
• Can give the column a name using:
select '437' as FOO
▪ An attribute can be a literal with from clause
select 'A'
from instructor
• Result is a table with one column and N rows (number of tuples in the instructors
table), each row with value “A”
318
The SELECT [...] FROM Clause 4/4
▪ The select clause can contain arithmetic expressions involving the operation, +, –, *,
and /, and operating on constants or attributes of tuples.
• The query:
select ID, name, salary/12
from instructor
would return a relation that is the same as the instructor relation, except that the
value of the attribute salary is divided by 12.
• Can rename “salary/12” using the as clause:
select ID, name, salary/12 as monthly_salary
319
The WHERE Clause
▪ The where clause specifies conditions that the result must satisfy
• Corresponds to the selection predicate of the relational algebra.
▪ To find all instructors in Comp. Sci. dept
select name
from instructor
where dept_name = 'Comp. Sci.'
▪ SQL allows the use of the logical connectives and, or, and not
▪ The operands of the logical connectives can be expressions involving the
comparison operators <, <=, >, >=, =, and <>.
▪ Comparisons can be applied to results of arithmetic expressions
▪ To find all instructors in Comp. Sci. dept with salary > 70000
select name
from instructor
where dept_name = 'Comp. Sci.' and salary > 70000
320
The Rename Operation
▪ The SQL allows renaming relations and attributes using the as clause:
old-name as new-name
▪ Find the names of all instructors who have a higher salary than
some instructor in 'Comp. Sci'.
• select distinct T.name
from instructor as T, instructor as S
where T.salary > S.salary and S.dept_name = 'Comp. Sci.’
321
Week 3
1. Joins
2. Set Operations
3. Null Values
322
DATABASE
OPERATIONS
323
SQL JOINS
▪ A JOIN clause is used to combine rows from two or
more tables, based on a related column between them
324
INNER JOIN OPERATION
▪ The INNER JOIN will combine the records and will
show only the common records between the 2 tables.
▪ Also, the JOIN clause in SQL is equivalent to INNER
JOIN
325
LEFT JOIN
▪ The SQL LEFT JOIN returns all rows from the left table,
even if there are no matches in the right table.
▪ This means that if the USING clause matches 0 (zero)
records in the right table, the join will still return a row in
the result, but with NULL in each column from the right
table.
326
RIGHT JOIN
▪ The SQL RIGHT JOIN does the same as the LEFT
JOIN but with the RIGHT side table.
327
String Operations
▪ SQL includes a string-matching operator for comparisons on character strings. The operator
like uses patterns that are described using two special characters:
• percent ( % ). The % character matches any substring.
• underscore ( _ ). The _ character matches any character.
▪ Find the names of all instructors whose name includes the substring “dar”.
select name
from instructor
where name like '%dar%’
▪ Find the names of all instructors whose name starts from “Ma”
select name
from instructor
where name like ‘Ma%’
▪ Find the names of all instructors where their names starts from Mar and there is one last
character that can be whatever
select name
from instructor
where name like ‘Mar_’
▪ Match the string “100%”
like '100 \%' escape '\'
328
in that above we use backslash (\) as the escape character.
Ordering the Display of Tuples
329
Where Clause Predicates
330
Set Operations
331
Set Operations (Cont.)
332
Null Values
▪ It is possible for tuples to have a null value, denoted by null, for some of their
attributes
▪ null signifies an unknown value or that a value does not exist.
▪ The result of any arithmetic expression involving null is null
• Example: 5 + null returns null
▪ The predicate is null can be used to check for null values.
• Example: Find all instructors whose salary is null.
select name
from instructor
where salary is null
▪ The predicate is not null succeeds if the value on which it is applied is not null.
333
Null Values (Cont.)
▪ SQL treats as unknown the result of any comparison involving a null value (other
than predicates is null and is not null).
• Example: 5 < null or null <> null or null = null
▪ The predicate in a where clause can involve Boolean operations (and, or, not); thus
the definitions of the Boolean operations need to be extended to deal with the value
unknown.
• and : (true and unknown) = unknown,
(false and unknown) = false,
(unknown and unknown) = unknown
• or: (unknown or true) = true,
(unknown or false) = unknown
(unknown or unknown) = unknown
▪ Result of where clause predicate is treated as false if it evaluates to unknown
334
Week 4
1. Avg
2. Min/Max
3. Sum/Count
335
AGGREGATE
FUNCTIONS
336
Aggregate Functions
▪ These functions operate on the multiset of values of a column of a relation, and return
a value
avg: average value
min: minimum value
max: maximum value
sum: sum of values
count: number of values
337
Aggregate Functions Examples
338
Aggregate Functions – Group By
339
Aggregation (Cont.)
340
Aggregate Functions – Having Clause
▪ Find the names and average salaries of all departments whose average salary is
greater than 42000.
▪ Note: predicates in the having clause are applied after the formation of groups
whereas predicates in the where clause are applied before forming groups.
▪ Also, having clause is the only way to filter the result of
an aggregate function!
341
Week 5
1. Set comparison
2. Exists
3. With
342
SUBQUERIES
343
Nested Subqueries
▪ SQL provides a mechanism for the nesting of subqueries. A subquery is a
select-from-where expression that is nested within another query.
▪ The nesting can be done in the following SQL query
select column_name
from table_name
where condition
as follows:
• From clause: table_name can be replaced by any valid subquery
• Where clause: condition can be replaced with an expression of the form:
B <operation> (subquery)
B is an attribute and <operation> to be defined later.
• Select clause:
column_name can be replaced be a subquery that generates a single value.
344
Set Membership
345
Set Membership (Cont.)
▪ Find the total number of (distinct) students who have taken course sections
taught by the instructor with ID 10101
346
Set Comparison – “some” Clause
select name
from instructor
▪ Same query
where salaryusing
> some> some
(select clause
salary
from instructor
where dept name = 'Biology');
347
Set Comparison – “all” Clause
▪ Find the names of all instructors whose salary is greater than the salary of all
instructors in the Biology department.
select name
from instructor
where salary > all (select salary
from instructor
where dept name = 'Biology');
348
Use of “exists” Clause
▪ Yet another way of specifying the query “Find all courses taught in both the Fall 2017 semester and in
the Spring 2018 semester”
select course_id
from section as S
where semester = 'Fall' and year = 2017 and
exists (select *
from section as T
where semester = 'Spring' and year= 2018
and S.course_id = T.course_id);
▪ Also, we can apply not exists. So, for example we want to “Find all courses taught in Fall 2017
semester and not in the Spring 2018 semester.
select course_id
from section as S
where semester = 'Fall' and year = 2017 and
exists (select *
from section as T
where semester = 'Spring' and year= 2018
and S.course_id = T.course_id);
▪ Correlation name – variable S in the outer query
349
Test for Absence of Duplicates
▪ The unique construct tests whether a subquery has any duplicate tuples in its result.
▪ The unique construct evaluates to “true” if a given subquery contains no duplicates.
▪ Find all courses that were offered at most once in 2017
select T.course_id
from course as T
where unique ( select R.course_id
from section as R
where T.course_id= R.course_id
and R.year = 2017);
350
Subqueries in the From Clause
351
Subqueries in the Form Clause
352
WITH Clause
▪ The WITH clause provides a way of defining a temporary relation whose definition
is available only to the query in which the with clause occurs.
▪ Find all departments with the maximum budget
with max_budget as
(select max(budget) as value
from department
)
select department.name
from department, max_budget
where department.budget = max_budget.value;
353
Complex Queries using With Clause
▪ Find all departments where the total salary is greater than the average of the total
salary at all departments
with dept _total (dept_name, value) as
(select dept_name, sum(salary)
from instructor
group by dept_name),
dept_total_avg(value) as
(select avg(value)
from dept_total)
select dept_name
from dept_total, dept_total_avg
where dept_total.value > dept_total_avg.value;
354
Scalar Subquery
355
WINDOW / ANALYTIC
FUNCTIONS
356
Window / Analytic Functions
▪ Analytic functions calculate an aggregate value based on a
group of rows.
Query example
SELECT ord_date, sales_agent,
SUM (ord_amount) OVER ( 358
PARTITION BY agent_code
LEAD() and LAG() example
Query
SELECT Sales_Agent_ID, Order_Amount, LAG(Order_Amount, 1) OVER (
PARTITION BY Sales_Agent_ID
ORDER BY Order_Amount DESC
) Last_Amount
FROM orders
ORDER BY Sales_Agent_ID
Results
359
CUME_DIST() example
Query
SELECT month(sales_date), Sales_Agent_ID, Order_Amount, CUME_DIST() OVER(
PARTITION BY month(sales_date)
ORDER BY Order_Amount
)
FROM orders
Results
360
DATABASE
MODIFICATION
361
Modification of the Database
362
Deletion
▪ Delete all tuples in the instructor relation for those instructors associated with a
department located in the Watson building.
delete from instructor
where dept name in (select dept name
from department
where building = 'Watson');
363
Deletion (Cont.)
▪ Delete all instructors whose salary is less than the average salary of instructors
364
Insertion
▪ or equivalently
insert into course (course_id, title, dept_name, credits)
values ('CS-437', 'Database Systems', 'Comp. Sci.', 4);
365
Insertion (Cont.)
▪ Make each student in the Music department who has earned more than 144 credit
hours an instructor in the Music department with a salary of $18,000.
insert into instructor
select ID, name, dept_name, 18000
from student
where dept_name = 'Music' and total_cred > 144;
▪ The select from where statement is evaluated fully before any of its results are
inserted into the relation.
Otherwise queries like
insert into table1 select * from table1
would cause problem
366
Updates
367
Updates (Cont.)
▪ Increase salaries of instructors whose salary is over $100,000 by 3%, and all others
by a 5%
• Write two update statements:
update instructor
set salary = salary * 1.03
where salary > 100000;
update instructor
set salary = salary * 1.05
where salary <= 100000;
• The order is important
• Can be done better using the case statement (next slide)
368
Case Statement for Conditional Updates
369
Updates with Scalar Subqueries
370
NORMALISATION
371
Characteristics of Relational Data
Elements of normalisation
> Primary/Foreign keys define the relationships
> Data is retrieved by joining tables in queries
Benefits of normalisation
> Reduce storage
> Avoid data duplication 372
> Improve data quality
Normalisation
The process of structuring a relational database in accordance
with a series of so-called normal forms in order to reduce data
redundancy and improve data integrity. In simple words, is the
process of splitting an entity into more than one tables.
374
Normalisation - 1NF
▪ Each table cell should contain a single value.
▪ Each record needs to be unique.
Not normalised
Normalised
Source
375
Normalisation - 2NF
▪ Be in 1NF
▪ All of the attributes that are not part of the candidate key
depend on Title, but only Price also depends on Format. To
conform to 2NF and remove duplicities, every non
candidate-key attribute must depend on the whole
candidate key, not just part of it.
Not normalised
Normalised 376
Normalisation - 3NF
The Book table still has a transitive functional dependency
({Nationality} is dependent on {Author}, which is dependent on
{Title}). A similar violation exists for genre ({Genre Name} is
dependent on {Genre ID}, which is dependent on {Title}).
Normalised
377
Python
Introduction to Programming
Module 3
378
What is Python?
379
Python Download and Installation
380
Python Basics
Data Types
381
Python Basics
• Strings
• “strings can be between double quotes”
• ‘strings can also be between single quotes’
• Numbers
• Integers: 1, 2, 3454, -5
• Floats: 1.3, 5.0, 2343.456
• List: [“this”, “is”, “a”, “list”, 2, 5, True, ‘Hello’, 42.0]
• Tuple: (4, 50, ‘test’)
• Dictionary: {‘John’: 22, ‘Jack’: 27} The 4 basic data
• Set: {1, 10, 2, 7} structures
• Boolean: True, False
382
Python operators
https://www.w3schools.com/python/python_operators.asp
383
Python Libraries
Once there is one thing Right, one thing True, the result is going to be True
• True or False = True
• True or True = True
• False or False = False
• False or True = True
385
Lists
386
Lists
387
List Methods
https://www.w3schools.com/python/python_lists_methods.asp
388
Dictionaries
• cardict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
• Dictionaries are ordered (Python 3.7+), changeable and have key:value
pairs
• They don’t allow duplicates (items with the same key)
389
Dictionaries
390
Dictionary Methods
391
Loops
392
For Loop
We start from 1 and loop until we get to 20 adding 2 in each loop. Pay
attention that i doesn’t take the value 21 as it is outside the range.
393
While Loop
The above does exactly the same as the for loop but it has a different
structure.
394
List Comprehensions
list1 and list2 have the same elements but were created with different
methods
395
Conditional Logic
396
Conditional Logic
397
Conditional Logic
Opposite operators
398
Exception Handling
When an error occurs, or exception as we call it, Python will normally stop
and generate an error message.
399
Functions
In the above function we simply add two numbers. The function takes two
arguments (a, b), adds them together and returns their sum
400
Lambda Functions
401
Python
Pandas & Matplotlib Libraries
Module 3 & Module 4
402
What is Pandas?
Source:https://pandas.pydata.org/
403
Data Structures: Pandas
404
Series
405
Series
406
Series
Series is an one dimensional labeled array
407
Declaring a Series
▪ Series() constructor pass as an argument an array containing the
values
>>> s
0 12
1 -4
27
39
dtype: int64
409
Series
include the index option assigning an array of strings containing the labels
>>> s = pd.Series([12,-4,7,9],
index=['a','b','c','d’])
>>> s
a 12
b -4
c 7
d 9
dtype: int64
410
Series
▪ You call the two attributes of the Series
▪ index & values
>>> s.values
array([12, -4, 7, 9], dtype=int64)
>>> s.index
Index([u'a', u'b', u'c', u'd'], dtype='object’)
>>> s[['b','c']]
b -4
c 7
dtype: int64
411
Series: Selecting Elements
▪ individual elements, specifying the key
>>> s[2]
7
>>> s[['b','c’]]
b -4
c 7 412
NumpPy to Series
Define new Series starting with NumPy arrays or existing Series
>>> arr = np.array([1,2,3,4])
>>> s3 = pd.Series(arr)
>>> s3
0 1
1 2
2 3
3 4
dtype: int32
▪ Note: values contained within the NumPy array or the original
Series are not copied
▪ But passed by reference
▪ the object is inserted dynamically within the new Series object
▪ If it changes, for example its internal element varies in value 413
Filtering Value
414
Operations and Mathematical function
▪ operators (+, -, *, /) or mathematical functions that are applicable to NumPy
array can be extended to objects Series
>>> s / 2
a 6.0
b -2.0
c 3.5
d 4.5
dtype: float64
▪ NumPy math. functions: must specify the function referenced with np & the
instance of the Series passed as argument
>>> np.log(s)
a 2.484907
b NaN
c 1.945910
d 2.197225
dtype: float64
415
Operations: Unique
>>> serd = pd.Series([1,0,2,1,2,3],
index=['white','white','blue','green','green','yellow’])
>>> serd
white 1
white 0
blue 2
green 1
green 2
yellow 3
dtype: int64
>>> serd.unique()
array([1, 0, 2, 3], dtype=int64)
416
Operations: Value count
417
Membership
▪ isin() evaluates membership
▪ Boolean values returned, useful during the filtering of data within a
series or in a column of a DataFrame
>>> serd.isin([0,3])
white False white 1
white True
blue False white 0
green False
blue 2
green False
yellow True green 1
dtype: bool
▪ Where is it stored? green 2
>>> serd[serd.isin([0,3])]
white 0 yellow 3
yellow 3
dtype: int64
418
NaN value
419
>>> s2 = pd.Series([5,-3,np.NaN,14])
>>> s2
…NaN value
0 5
1 -3
2 NaN
3 14
► isnull() and notnull() functions are useful to identify the indexes without a value
>>> s2.isnull( )
0 False
1 False
2 True
3 False
dtype: bool
>>> s2.notnull( )
0 True
1 True
2 False
3 True
dtype: bool
420
…NaN value
These functions are useful to be placed inside a filtering to make a
condition
>>> s2 = pd.Series([5,-3,np.NaN,14])
>>> s2[s2.notnull( )]
0 5
1 -3
3 14
dtype: float64
>>> s2[s2.isnull( )]
2 NaN
dtype: float64
421
Series as Dictionaries
▪ Alternative way to see a Series: think of them as dict (dictionary)
▪ You can create a series from a dict previously defined
422
Operations on Series
423
Dataframes
424
Dataframes
▪ Dataframes are two dimensional arrays
▪ Columns can be different types(int, float, string)
425
Dataframes
We can define a dataframe by passing a dictionary
426
Dataframes
Another way of defining a dataframe
427
Dataframes: Access
Access the elements of a column
428
Dataframes: Access with loc and iloc
loc: Access elements with the particular ‘label’ as index
iloc: Access elements with the position given(only takes integers)
429
Dataframe
frame = pd.DataFrame(data)
>>> frame
color object price
0 blue ball 1.2 • data frame: two index arrays
1 green pen 1.0 • 1st: lines: similar functions to the index array in Series. Εach label is
2 yellow pencil 0.6 associated with all the values in the row
3 red paper 0.9
• 2nd: contains a series of labels, each associated with a column
4 white mug 1.7
430
Dataframe
► create quickly a matrix of values you can use
► np.arange(16).reshape((4,4)) generates 4x4 matrix of increasing numbers from 0-15
>>> frame3 = pd.DataFrame(np.arange(16).reshape((4,4)),
... index=['red','blue','yellow','white'],
... columns=['ball','pen','pencil','paper'])
>>> frame3
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15
431
Selecting Elements
>>> frame.columns
Index([u'colors', u'object', u'price'], dtype='object’)
can get the entire set of data: use the values attribute
>>> frame.values
array([['blue', 'ball', 1.2],
['green', 'pen', 1.0],
['yellow', 'pencil', 3.3],
['red', 'paper', 0.9],
['white', 'mug', 1.7]], dtype=object)
432
Assigning values
>>> frame['new'] = [3.0,1.3,2.2,0.8,1.1]
433
Assigning values
>>> ser = pd.Series(np.arange(5))
>>> ser
0 0
1 1
2 2
3 3
4 4
dtype: int32
435
Membership check
>>> frame[frame.isin([1.0,'pen'])]
color object price
0 NaN NaN NaN
1 NaN pen 1
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
436
Deleting a column
to delete an entire column with all its contents, use the del command
437
Filtering
you can apply the filtering through the application of certain conditions,
e.g. to get all values smaller than a certain number (e.g.12)
438
Transpose a dataframe
>>> frame2.T
2011 2012 2013
blue 17 27 18
red NaN 22 33
white 13 22 16
439
Reading & Writing Data
440
Reading CSV or Text data
441
read_table
▪ CSV files are tabulated data in which the values on the same column are separated by commas.
▪ CSV files are considered text files 🡪 use the read_table() function, but specifying the delimiter
>>> read_table('ch05_01.csv',sep=',’)
white red blue green animal
0 1 5 2 3 cat 1 2 7 8 5 dog
2 3 3 6 7 horse
3 2 2 8 3 duck
4 4 4 2 1 mouse
442
read_csv
>>>
pd.read_csv('ch05_02.csv’)
443
Specify column names
444
Writing CSV data to a file
#save all
>>> frame2.to_csv('ch05_07.csv’)
ball,pen,pencil,paper
0,1,2,3
4,5,6,7
8,9,10,11
12,13,14,15
445
Example
The data set will consist of 5 baby names and the number
of births recorded for that year (1880).
446
Create a data frame
• use the pandas library to export this data set into a csv file
• df will be a DataFrame object
• holds the contents of the BabyDataSet
• format similar to a sql table or an excel spreadsheet
447
Save / load a dataframe
• Save to disk
• Indexes are not saved: leftmost column
• Header is not saved
448
Observing/Checking data
449
A Quick Look at the Data
▪ csvframe.head(3)
▪ csvframe.tail(Number)
450
A view of the memory requirements
csvframe.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
white 5 non-null int64
red 5 non-null int64
blue 5 non-null int64
green 5 non-null int64
animal 5 non-null object
dtypes: int64(4), object(1)
memory usage: 550.0 bytes
451
JSON: Read a file
import json
from pandas.io.json import
json_normalize
file = open('books.json','r')
text = file.read() books.json
text = json.loads(text)
[{'writer': 'Mark Ross',
'nationality': 'USA',
'books': [{'title': 'XML Cookbook', 'price': 23.56},
{'title': 'Python Fundamentals', 'price': 50.7},
{'title': 'The NumPy library', 'price': 12.3}]},
>>>data.columns
Index(['books', 'nationality', 'writer'], dtype='object')
454
Basic Analysis Operations
455
Basic Analysis Operations
•loading
•assembling
•merging
•concatenating
•combining
•reshaping (pivoting)
•removing
456
Merge
457
Merging
>>> import numpy as np
>>> import pandas as pd
>>> frame1 = pd.DataFrame({'id':['ball','pencil','pen','mug','ashtray'], 'price':
[12.33,11.44,33.21,13.23,33.62]})
>>> frame1
id price
0 ball 12.33
1 pencil 11.44
2 pen 33.21
3 mug 13.23
4 ashtray 33.62
>>> frame2 = pd.DataFrame( {'id':['pencil','pencil','ball','pen'], 'color':
['white','red','red','black']})
>>> frame2
color id
0 white pencil
1 red pencil
2 red ball
3 black pen
458
Merging…
>>> frame1
id price
applying the merge( ) function to the two DataFrame objects 0 ball 12.33
>>> pd.merge(frame1,frame2) 1 pencil 11.44
2 pen 33.21
id price color 3 mug 13.23
0 ball 12.33 red 4 ashtray 33.62
1 pencil 11.44 white
>>> frame2
2 pencil 11.44 red color id
3 pen 33.21 black 0 white pencil
1 red pencil
2 red ball
If the field has not a common name across two tables, then:
3 black pen
pd.merge(frame1, frame3,left_on='id',right_on='id2')
459
Concatenate: Create 1st data frame
raw_data = {
'subject_id': ['1', '2', '3', '4', '5’],
'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung’],
'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches’]}
subject_i
first_name last_name
d
0 1 Alex Anderson
1 2 Amy Ackerman
2 3 Allen Ali
3 4 Alice Aoni
4 5 Ayoung Atiches
460
Concatenate: create 2nd Data frame
raw_data = {
'subject_id': ['4', '5', '6', '7', '8’],
'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty’],
'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan’]}
461
rd
Create 3 Data frame
raw_data = {
'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11’],
'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}
462
Join 2 data frames
Original data
463
Removing Duplicates
▪ Duplicate rows might be present in a DataFrame for various reasons
▪ First, create a simple DataFrame with some duplicate rows
465
Binning
466
Detecting & Filtering Outliers
>>> randframe =
pd.DataFrame(np.random.randn(1000,3))
>>> randframe.describe()
0 1 2
count 1000.000000 1000.000000 1000.000000
mean 0.021609 -0.022926 -0.019577
std 1.045777 0.998493 1.056961
min -2.981600 -2.828229 -3.735046 467
25% -0.675005 -0.729834 -0.737677
Outlier detection
► E.g. outliers: that have a value greater than three times the standard deviation std()
>>> randframe.std()
0 1.045777
1 0.998493
2 1.056961
dtype: float64
► Apply filtering of all the values of the DataFrame, by applying the corresponding standard
deviation for each column. Use any() function, to apply the filter on each column
► 1 in any: means columns
>>> randframe[(np.abs(randframe) > (3*randframe.std())).any(1)]
0 1 2
69 -0.442411 -1.099404 3.206832
576 -0.154413 -1.108671 3.458472
907 2.296649 1.129156 -3.735046
468
Group By
469
Group by
► Its internal mechanism a process called: SPLIT-APPLY-COMBINE
► splitting: division into groups of datasets
► Applying: application of a function on each group
► combining: combination of all the results obtained by different groups
► Example:
► Splitting: the data contained within a data structure, such as a Series or a
DataFrame, are divided into several groups, according to a given criterion, which is
often linked to indexes or just certain values in a column.
► In SQL, values contained in this column are reported as keys
► if you are working with two-dimensional objects e.g. DataFrame, the grouping
criterion may be applied both to the line (axis = 0) for that column (axis = 1)s
► Applying: a function, which will produce a new and single value, specific to that
group
► Combining: collects all the results obtained from each group & combine them to a
new object
470
Example
>>> frame
color object price1 price2
0 white pen 5.56 4.75
1 red pencil 4.20 4.12
471
2 green pencil 1.30 1.60
Example…
► Calculate the average price1 column using group labels listed in the column color
► E.g. access the price1 column and call the groupby() function with the column color
>>> group
<pandas.core.groupby.SeriesGroupBy object at 0x00000000098A2A20>
>>> group.groups
{'white': [0L], 'green': [2L, 4L], 'red': [1L, 3L]}
>>> group.mean()
color
green 2.025
red 2.380
white 5.560
Name: price1, dtype: float64
>>> group.sum()
color
green 4.05
red 4.76
white 5.56
472
Name: price1, dtype: float64
Hierarchical grouping
▪ You have seen how to group the data according to the values of a column as a key choice.
▪ Can be extended to multiple columns, i.e., make a grouping of multiple keys hierarchical
473
Aggregations
474
Aggregations
Aggregations are functions that can be applied efficient on dataframes
They can be applied to whole columns or groups that come from groupby
475
Aggregations
476
Aggregations
We can directly create a new column from our existing dataframe using the agg
command
477
Operations: Count vs Sum
▪ In this example we can’t use the count method we saw before to find how many
planets were discovered per method, because the number column can be more
than one
▪ So we use sum
478
Operations
479
Operations
480
Operations
Slicing
481
Operations
Get the shape of the array Info
482
Operations
Another piece of information we can get about our data is with dtypes
483
Operations
Describe is a very useful operation that gives a general idea for all the numerical
data
484
Operations
We can also sort with a certain value using the sort values method
485
Data Cleaning: Replace
Replace is a pandas function that finds a value and replaces it with another
486
Data Cleaning: Replace
It also accepts regular exceptions as an input
487
Visualizations
488
Visualizations
- Histogram
- Pie Chart
- Bar Graph
- Boxplot
489
Visualizations: Basic Components in
matplotlib library
490
Visualizations: histograms
491
Visualizations: Pie Charts
492
Visualizations: Bar Plots
493
Visualizations: Box Plot
source 494
Visualizations: Box Plot
source 495
Data Visualization
Module 4
496
Data Visualization
497
Examples
498
Variable Types
499
Four levels of
measurement
500
Basic Plots
Basic
Bar Plot Box Plot
Plots
501
Scatter Plot
502
Scatter Plot
Applications
503
Line Plot
504
Histogram
Parts of Histogram:
Title: The title describes the information
included in the histogram
X-axis: The X-axis are intervals that show the
scale of values which the measurements fall
under. These intervals are also called bins.
Y-axis: The Y-axis shows the number of times
that the values occurred for each interval on
the X-axis
Bars: The height of the bar shows the number
of times that the values occurred within the
interval, while the width of the bar shows the
interval that is covered.
505
Histogram
Applications
• Histograms are a very common type of plots when we are looking at data like height and weight,
stock prices, waiting time for a customer, etc which are continuous in nature.
• Histograms are good for showing general distributional features of dataset variables. You can see
roughly where the peaks of the distribution are, whether the distribution is skewed or symmetric,
and if there are any outliers.
506
Bar plot
Bar charts are one of the most common types of graphs and are used to show data associated
with the categorical variables.
Bar graphs are used to match things between different groups or to trace changes over time.
507
Types of Bar Plot
508
Bar Plot vs Histogram
509
Pie Plot
510