Lecture 01-05 Data, Central Tendency PDF
Lecture 01-05 Data, Central Tendency PDF
of Data
Science
Harish Sharma
Asst. Professor, AIML, SCSE, MUJ
• Data is a collection of facts in a raw or unorganized form such
as numbers or characters.
Introduction
• Data Science is a multidisciplinary field that
combines various techniques, methods, and
tools to extract knowledge and insights
from structured and unstructured data.
data—structured and
unstructured.
• There are two basic types of structured
data: numeric and categorical.
• Numeric data comes in two forms:
continuous, such as wind speed or time
duration, and discrete, such as the count of
the occurrence of an event.
• Categorical data takes only a fixed set of
Elements of values, such as a type of TV screen (plasma,
LCD, LED, etc.) or a state name
Structured Data (Alabama,Alaska, etc.). Binary data is an
important special case of categorical data
that takes on only one of two values, such
as 0/1, yes/no, or true/false.
• Another useful type of categorical data is
ordinal data in which the categories are
ordered; an example of this is a numerical
rating (1, 2, 3, 4, or 5).
• Data present themselves in many forms, but at a
basic level, all data can be categorized into two
structures: rectangular data and non-
rectangular data.
Rectangular
vs. Non- • rectangular data are shaped like a rectangle
where every value corresponds to some row and
rectangular column. Most data frames store rectangular data.
Data
• Non-rectangular data are not neatly arranged in
rows and columns. Instead, they are often a
culmination of separate data structures where
there is some similarity among members of the
same data structure. Usually non-rectangular data
are stored in lists.
• Traditional database tables have one or more columns designated
as an index, essentially a row number.
• In Python, with the pandas library, the basic rectangular data
structure is a DataFrame object.
• By default, an automatic integer index is created for a
DataFrame based on the order of the rows.
• In pandas, it is also possible to set multilevel/hierarchicalindexes
Data Frames and to improve the efficiency of certain operations.
Mean Mode
Median Range Coefficient of
Variation
Variance
Standard Deviation
• A measure of central tendency is
a descriptive statistic that
describes the average, or typical
Measures of Central value of a set of scores.
• There are three common
Tendency measures of central tendency:
• the mean
• the median
• the mode
The Mean
f X
X =
N
where: f X = a score multiplied by its frequency
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 12 14
Mean = 5 Mean = 6
• You should use the mean when
• the data are interval or
ratio scaled
• Many people will use
the mean with ordinally
When To Use the scaled data too
• and the data are not
Mean skewed
• The mean is preferred because
it is sensitive to every score
• If you change one score in
the data set, the mean will
change
Calculating the Mean
• Calculate the mean of the following data:
1 5 4 3 2
• Sum the scores (X):
1 + 5 + 4 + 3 + 2 = 15
• Divide the sum (X = 15) by the number of scores
(N = 5):
15 / 5 = 3
• Mean = X = 3
Calculating the Mean for
Grouped Data
• Find the mean of the following data:
SCORE NUMBER OF
• Mean = [3(10)+10(9)+9(8)+8(7)+10(6)+
STUDENTS • 2(5)]/42 = 7.57
10 3
9 10
8 9
7 8
6 10
5 2
The Median
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 12 14
Median = 5 Median = 5
• Conceptually, it is easy to
calculate the median
• There are many minor
problems that can occur; it
is best to let a computer do
it
How To Calculate the • Sort the data from highest to
lowest
Median • Find the score in the middle
• middle = (n + 1) / 2
• If n, the number of scores, is
even the median is the
average of the middle two
scores
Calculating the Median for
Grouped Data
N / 2 − cf
Median = l + h
f
• To use this formula first determine median class.
Median class is that class whose less than type cumulative
frequency is just more than N / 2 ;
• l = lower limit of median class ;
• cf = less than type cumulative frequency of premedian
class;
• f = frequency of median class
• h = class width.
• The median is often used when the
distribution of scores is either positively or
When To Use negatively skewed
• The few really large scores (positively
the Median skewed) or really small scores
(negatively skewed) will not overly
influence the median
• What is the median of the following scores:
10 8 14 15 7 3 3 8 12 10 9
• Sort the scores:
Median Example 15 14 12 10 10 9 8 8 7 3 3
• Determine the middle score:
middle = (n + 1) / 2 = (11 + 1) / 2 = 6
• Middle score = median = 9
• What is the median of the
following scores:
24 18 19 42 16 12
• Sort the scores:
42 24 19 18 16 12
Median Example • Determine the middle score:
middle = (n + 1) / 2 = (6 + 1) / 2 =
3.5
• Median = average of 3rd and 4th
scores:
(19 + 18) / 2 = 18.5
The Mode
The mode is the score that occurs most frequently
in a set of data
Not Affected by Extreme Values
There May Not be a Mode
There May be Several Modes
Used for Either Numerical or Categorical Data
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6
No Mode
Mode = 9
Calculating the Mode for
Grouped Data
f m − f1
Mode = l + h
2 f m − f1 − f 2
To use this formula first determine modal class.
Modal class is that class which has maximum
frequency ;
l = lower limit of modal class;
fm = maximum frequency;
f1 = frequency of pre modal class ;
f2 = frequency of post modal class
h = class width.
• The mode is not a very useful measure of central
tendency
• It is insensitive to large changes in the data set
When To • That is, two data sets that are very
different from each other can have the
Use the same mode
• The mode is primarily used with nominally scaled
Mode data
• It is the only measure of central tendency that
is appropriate for nominally scaled data
Calculate Mean, Median & Mode
No.of workers 50 80 30 20 50 20
(freq)
Problem 2 : Weekly demand for marine fish (in kg) (x) for 100 families is
given below. Calculate Mean, Median and Mode.
X 1 2 3 4 5 Total
No. of Families 20 50 20 5 5 100
(freq)
Relation Between
Mean, Median & Mode
• In symmetrical
distributions, the median
and mean are equal
• For normal distributions,
mean = median = mode
• In positively skewed
distributions, the mean is
greater than the median
In negatively skewed
distributions, the mean is
smaller than the median
Variance
=
2 i
N
•For the Sample: (
xi − x )2
s2 =
n −1
For the Population: use N in the For the Sample : use n - 1 in
denominator. the denominator.
Standard Deviation
n −1
Coefficient of Variation
SD
CV = 100%
X
Comparing Coefficient of Variation
Coefficient of Variation:
Stock A: CV = 10%
Stock B: CV = 5%
Shape of Curve
Describes How Data Are Distributed
Measures of Shape:
• Symmetric or skewed
Example B: 2, 5, 1, 5, 1, 2
Example C: 5, 7, 9, 1, 7, 5, 0, 4
Find the Mean,
Median, Mode
Variance, SD & CV
• Exam marks for 60 students
(marked out of 65)
Frequency Percent
0 but less than 10 4 6.7
10 but less than 20 9 15.0
20 but less than 30 17 28.3
30 but less than 40 15 25.0
40 but less than 50 9 15.0
50 but less than 60 5 8.3
60 or over 1 1.7
Total 60 100.0