0% found this document useful (0 votes)

8 views

Lecture 01-05 Data, Central Tendency PDF

this document explaines what is data central tendency and pdf

Uploaded by

agam taneja

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Lecture 01-05 Data, Central Tendency PDF

this document explaines what is data central tendency and pdf

Uploaded by

agam taneja

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Foundation

of Data
Science
Harish Sharma
Asst. Professor, AIML, SCSE, MUJ
• Data is a collection of facts in a raw or unorganized form such
as numbers or characters.
Introduction
• Data Science is a multidisciplinary field that
combines various techniques, methods, and
tools to extract knowledge and insights
from structured and unstructured data.

• It encompasses a wide range of activities

involving data collection, data cleaning,
data analysis, data visualization, and the
creation of predictive models and
algorithms.
Goal

• The primary goal of Data Science is to

turn raw data into actionable
information that can be used to make
informed decisions, solve complex
problems, and drive business or
research outcomes.
• It involves the application of statistical
analysis, machine learning, and
computational techniques to gain
valuable insights and patterns from large
datasets.
Key components

Data Collection: Gathering and sourcing

relevant data from various sources, such as
databases, websites, sensors, or APIs.

Data Cleaning and Preprocessing: Preparing

the data by handling missing values,
removing noise, and transforming it into a
consistent format suitable for analysis.

Exploratory Data Analysis (EDA): Conducting

initial data exploration to understand the
distribution, relationships, and patterns in
the data.
• Data Visualization: Creating visual
representations of data to help
understand trends, outliers, and
patterns, which aids in communication
and decision-making.
• Statistical Analysis: Applying statistical
techniques to infer meaningful insights
from the data and validate
hypotheses.
• Machine Learning: Utilizing algorithms
and models to build predictive and
descriptive models, such as regression,
classification, clustering, and
recommendation systems.
• Deep Learning: A subfield of machine
learning that focuses on using artificial
neural networks to handle complex tasks
like image recognition, natural language
processing, and speech recognition.
• Big Data: Managing and processing large-
scale datasets that traditional data
processing methods cannot handle
effectively.
• Data Ethics and Privacy: Ensuring that
data is handled responsibly, and individual
privacy is respected in the data-driven
processes.
Data Collection

• Data collection is the process of collecting,

measuring and analyzing different types of
information using a set of standard
validated techniques.
• There are two main methods of data
collection:
• Primary Data Collection
• Secondary Data Collection
• Primary data refers to data collected from first-hand
experience directly from the main source. It refers to data
that has never been used in the past. The data gathered by
primary data collection methods are generally regarded as
the best kind of data in research.

• The methods of collecting primary data can be

further divided into quantitative data collection
methods (deals with factors that can be counted) and
qualitative data collection methods (deals with

Primary Data factors that are not necessarily numerical in nature).

Here are some of the most common primary data

collection methods:
• Interviews
• Observations
• Surveys and Questionnaires
• Focus Groups
• Secondary data refers to data that has
already been collected by someone else. It
is much more inexpensive and easier to
collect than primary data.
Secondary
Data • Here are some of the most common
secondary data collection methods:
• Internet
• Government Archives
• Libraries
Structure/Unstructured
Data
Unstructured data
Structured data

Searchable Difficult to search

• There are several types of data

Main characteristics Usually text format Many data formats
Quantitative Qualitative

within the world of big data. Data lakes

Non-relational databases
Here’s a guide to structured and
Relational databases
Storage Data warehouses
Data warehouses
NoSQL databases
unstructured data. Applications
Presentation or word
Inventory control
• When it comes to data, files can Used for CRM systems
ERP systems
processing software
Tools for viewing or editing

come in many different forms. media

There are two main types of Examples

Dates, phone numbers, bank account
numbers, product SKUs
Emails, songs, videos, photos,
reports, presentations

data—structured and
unstructured.
• There are two basic types of structured
data: numeric and categorical.
• Numeric data comes in two forms:
continuous, such as wind speed or time
duration, and discrete, such as the count of
the occurrence of an event.
• Categorical data takes only a fixed set of
Elements of values, such as a type of TV screen (plasma,
LCD, LED, etc.) or a state name
Structured Data (Alabama,Alaska, etc.). Binary data is an
important special case of categorical data
that takes on only one of two values, such
as 0/1, yes/no, or true/false.
• Another useful type of categorical data is
ordinal data in which the categories are
ordered; an example of this is a numerical
rating (1, 2, 3, 4, or 5).
• Data present themselves in many forms, but at a
basic level, all data can be categorized into two
structures: rectangular data and non-
rectangular data.
Rectangular
vs. Non- • rectangular data are shaped like a rectangle
where every value corresponds to some row and
rectangular column. Most data frames store rectangular data.
Data
• Non-rectangular data are not neatly arranged in
rows and columns. Instead, they are often a
culmination of separate data structures where
there is some similarity among members of the
same data structure. Usually non-rectangular data
are stored in lists.
• Traditional database tables have one or more columns designated
as an index, essentially a row number.
• In Python, with the pandas library, the basic rectangular data
structure is a DataFrame object.
• By default, an automatic integer index is created for a
DataFrame based on the order of the rows.
• In pandas, it is also possible to set multilevel/hierarchicalindexes
Data Frames and to improve the efficiency of certain operations.

Indexes • import numpy as np

• import pandas as pd
• ##s = pd.Series(data, index=index)
• s=pd.Series(np.random.randn(5))
• s = pd.Series(np.random.randn(5),
index=["a", "b", "c", "d", "e"])
• There are other data structures
besides rectangular data.
• Time series data records
successive measurements of the
same variable. It is the raw
material for statistical
forecasting methods.
Nonrectangular • Spatial data structures, which
Data Structures are used in mapping and
location analytics, are more
complex and varied than
rectangular data structures.
• Graph (or network) data
structures are used to represent
physical, social, and abstract
relationships.
• Variables with measured or
count data might have
thousands of distinct values.
Estimates of • Basic step in exploring your data
is getting a “typical value” for
Location each feature (variable)
• an estimate of where most of
the data is located (i.e., its
central tendency).
Summary of Measures
Summary Measures

Central Tendency Quartile Variation

Mean Mode
Median Range Coefficient of
Variation
Variance

Standard Deviation
• A measure of central tendency is
a descriptive statistic that
describes the average, or typical
Measures of Central value of a set of scores.
• There are three common
Tendency measures of central tendency:
• the mean
• the median
• the mode
The Mean

The mean is:

the arithmetic average of all the scores
(X)/N
the number, m, that makes (X - m) equal to 0
the number, m, that makes (X - m)2 a
minimum
The mean of a population is represented by
the Greek letter ; the mean of a sample is
represented by X
Calculating the Mean for
Grouped Data

 f X
X =
N
where: f X = a score multiplied by its frequency

Mean affected by extreme values

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 12 14

Mean = 5 Mean = 6
• You should use the mean when
• the data are interval or
ratio scaled
• Many people will use
the mean with ordinally
When To Use the scaled data too
• and the data are not
Mean skewed
• The mean is preferred because
it is sensitive to every score
• If you change one score in
the data set, the mean will
change
Calculating the Mean
• Calculate the mean of the following data:
1 5 4 3 2
• Sum the scores (X):
1 + 5 + 4 + 3 + 2 = 15
• Divide the sum (X = 15) by the number of scores
(N = 5):
15 / 5 = 3
• Mean = X = 3
Calculating the Mean for
Grouped Data
• Find the mean of the following data:

SCORE NUMBER OF
• Mean = [3(10)+10(9)+9(8)+8(7)+10(6)+
STUDENTS • 2(5)]/42 = 7.57
10 3

9 10

8 9

7 8

6 10

5 2
The Median

• The median is simply another name for the 50th

percentile
• It is the score in the middle; half of the scores are larger
than the median and half of the scores are smaller than
the median
• Not affected by extreme values

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 12 14

Median = 5 Median = 5
• Conceptually, it is easy to
calculate the median
• There are many minor
problems that can occur; it
is best to let a computer do
it
How To Calculate the • Sort the data from highest to
lowest
Median • Find the score in the middle
• middle = (n + 1) / 2
• If n, the number of scores, is
even the median is the
average of the middle two
scores
Calculating the Median for
Grouped Data
N / 2 − cf
Median = l + h
f
• To use this formula first determine median class.
Median class is that class whose less than type cumulative
frequency is just more than N / 2 ;
• l = lower limit of median class ;
• cf = less than type cumulative frequency of premedian
class;
• f = frequency of median class
• h = class width.
• The median is often used when the
distribution of scores is either positively or
When To Use negatively skewed
• The few really large scores (positively
the Median skewed) or really small scores
(negatively skewed) will not overly
influence the median
• What is the median of the following scores:
10 8 14 15 7 3 3 8 12 10 9
• Sort the scores:
Median Example 15 14 12 10 10 9 8 8 7 3 3
• Determine the middle score:
middle = (n + 1) / 2 = (11 + 1) / 2 = 6
• Middle score = median = 9
• What is the median of the
following scores:
24 18 19 42 16 12
• Sort the scores:
42 24 19 18 16 12
Median Example • Determine the middle score:
middle = (n + 1) / 2 = (6 + 1) / 2 =
3.5
• Median = average of 3rd and 4th
scores:
(19 + 18) / 2 = 18.5
The Mode
The mode is the score that occurs most frequently
in a set of data
Not Affected by Extreme Values
There May Not be a Mode
There May be Several Modes
Used for Either Numerical or Categorical Data

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6

No Mode
Mode = 9
Calculating the Mode for
Grouped Data
 f m − f1 
Mode = l +  h
 2 f m − f1 − f 2 
To use this formula first determine modal class.
Modal class is that class which has maximum
frequency ;
l = lower limit of modal class;
fm = maximum frequency;
f1 = frequency of pre modal class ;
f2 = frequency of post modal class
h = class width.
• The mode is not a very useful measure of central
tendency
• It is insensitive to large changes in the data set
When To • That is, two data sets that are very
different from each other can have the
Use the same mode
• The mode is primarily used with nominally scaled
Mode data
• It is the only measure of central tendency that
is appropriate for nominally scaled data
Calculate Mean, Median & Mode

Problem 1 : Wages (in Rs) paid to workers of an organization are given

below. Calculate Mean, Median and Mode.

Wages ( C.I.) 40-60 60-80 80-100 100-120 120-140 140-160

No.of workers 50 80 30 20 50 20
(freq)

Problem 2 : Weekly demand for marine fish (in kg) (x) for 100 families is
given below. Calculate Mean, Median and Mode.
X 1 2 3 4 5 Total
No. of Families 20 50 20 5 5 100
(freq)
Relation Between
Mean, Median & Mode
• In symmetrical
distributions, the median
and mean are equal
• For normal distributions,
mean = median = mode
• In positively skewed
distributions, the mean is
greater than the median

In negatively skewed
distributions, the mean is
smaller than the median
Variance

•Important Measure of Variation

•Shows Variation About the Mean:
•For the Population:  (X − ) 2

 =
2 i

N
•For the Sample: (
 xi − x )2

s2 =
n −1
For the Population: use N in the For the Sample : use n - 1 in
denominator. the denominator.
Standard Deviation

•Important Measure of Variation

•Shows Variation About the Mean:
•For the Population:
=
 (X i − )
2

•For the Sample:

s=
 (x − x )
i
2

n −1
Coefficient of Variation

•Measure of Relative Variation

•Always a %
•Shows Variation Relative to Mean
•Used to Compare 2 or More Groups
•Formula (for Sample):

 SD 
CV =    100%
 X 
Comparing Coefficient of Variation

• Stock A: Average Price last year = $50

• Standard Deviation = $5
• Stock B: Average Price last year = $100
• Standard Deviation = $5

Coefficient of Variation:
Stock A: CV = 10%
Stock B: CV = 5%
Shape of Curve
Describes How Data Are Distributed

Measures of Shape:
• Symmetric or skewed

Left-Skewed Symmetric Right-Skewed

Mean Median Mode Mean = Median = Mode Mode Median Mean
• 5 test scores for Calculus I are
95, 83, 92, 81, 75.

• Consider this dataset showing

the retirement age of 11
people, in whole years:
54, 54, 54, 55, 56, 57, 57, 58,
Find the Variance, SD 58, 60, 60

& CV • 3. Here are a bunch of 10 point

quizzes from MAT117: 9, 6, 7,
10, 9, 4, 9, 2, 9, 10, 7, 7, 5, 6, 7

• 4. 11, 140, 98, 23, 45, 14, 56, 78,

93, 200, 123, 165
Find the Variance, SD & CV

• Class Interval Frequency

2 -< 4 3
4 -< 6 18
6 -< 8 9
8 -< 10 7
Example A: 3, 10, 8, 8, 7, 8, 10, 3, 3, 3

Example B: 2, 5, 1, 5, 1, 2

Example C: 5, 7, 9, 1, 7, 5, 0, 4
Find the Mean,
Median, Mode
Variance, SD & CV
• Exam marks for 60 students
(marked out of 65)

• mean = 30.3 sd = 14.46

Group Frequency Table

Frequency Percent
0 but less than 10 4 6.7
10 but less than 20 9 15.0
20 but less than 30 17 28.3
30 but less than 40 15 25.0
40 but less than 50 9 15.0
50 but less than 60 5 8.3
60 or over 1 1.7
Total 60 100.0

Slide Intro To Statistics Tutorku - UTS
No ratings yet
Slide Intro To Statistics Tutorku - UTS
85 pages
Spring 12 ECON-E370 IU Exam 1 Review
No ratings yet
Spring 12 ECON-E370 IU Exam 1 Review
27 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
E-Note_33325_Content_Document_20250319114322AM
No ratings yet
E-Note_33325_Content_Document_20250319114322AM
69 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
Week2_UnderstandingData
No ratings yet
Week2_UnderstandingData
27 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
16 pages
4.0 Introduction to Data
No ratings yet
4.0 Introduction to Data
16 pages
unit1
No ratings yet
unit1
78 pages
APznzaaTDyVpfrWbShDImgnP-JNu1yemoc2q17hXX6oIqf5nIMDti35MPCYygccsLGx4mqqqRwgsi2RuPcVeljJjLK2Pq4TVL61kXZn9tn...2w1U2TrfzirKNSEEtdBLb8IeJCqR_3agy5mhPSa-CSFFcgwGcoNjFXZ9PqDyWyLxttkHmEwQMqOnNarT7o0Mr15grkiNoeFL8MUjcekWCARrZ5jNz30iru5gxh
No ratings yet
APznzaaTDyVpfrWbShDImgnP-JNu1yemoc2q17hXX6oIqf5nIMDti35MPCYygccsLGx4mqqqRwgsi2RuPcVeljJjLK2Pq4TVL61kXZn9tn...2w1U2TrfzirKNSEEtdBLb8IeJCqR_3agy5mhPSa-CSFFcgwGcoNjFXZ9PqDyWyLxttkHmEwQMqOnNarT7o0Mr15grkiNoeFL8MUjcekWCARrZ5jNz30iru5gxh
73 pages
Data Science UNIT 1 Final
No ratings yet
Data Science UNIT 1 Final
107 pages
Data ch2
No ratings yet
Data ch2
16 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
Data-Preprocessing
No ratings yet
Data-Preprocessing
138 pages
Notes of Week-1 and Week-2
No ratings yet
Notes of Week-1 and Week-2
30 pages
02Data Edited v2
No ratings yet
02Data Edited v2
43 pages
Chapter 2 - Tagged
No ratings yet
Chapter 2 - Tagged
66 pages
DM UNIT-1-1
No ratings yet
DM UNIT-1-1
56 pages
It B.tech II Year II Sem DV (R18a0555)
No ratings yet
It B.tech II Year II Sem DV (R18a0555)
73 pages
02 Data
No ratings yet
02 Data
64 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
Lesson 02 Probability and Statistics
No ratings yet
Lesson 02 Probability and Statistics
127 pages
CS109a Lecture1
No ratings yet
CS109a Lecture1
67 pages
2 Knowing Data & Visualization
No ratings yet
2 Knowing Data & Visualization
51 pages
Unit 3
No ratings yet
Unit 3
30 pages
EDA Unit-1
No ratings yet
EDA Unit-1
9 pages
FDS Unit 1 Notes
No ratings yet
FDS Unit 1 Notes
53 pages
TYCS DS Unit1
No ratings yet
TYCS DS Unit1
28 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
42 pages
ML U2
No ratings yet
ML U2
62 pages
Data Science - g.scali (Lect1) (1)
No ratings yet
Data Science - g.scali (Lect1) (1)
22 pages
Lesson 2 Notes
No ratings yet
Lesson 2 Notes
11 pages
Presentation 1
No ratings yet
Presentation 1
46 pages
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
33 pages
How data is col
No ratings yet
How data is col
11 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
02know Your Data-Lecture2-3
No ratings yet
02know Your Data-Lecture2-3
53 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Ch01_ICS422_04
No ratings yet
Ch01_ICS422_04
84 pages
Stats and its Real world applications.
No ratings yet
Stats and its Real world applications.
53 pages
VIPDMTheoryChapter2
No ratings yet
VIPDMTheoryChapter2
56 pages
Lect 3
No ratings yet
Lect 3
51 pages
RM Module 3
No ratings yet
RM Module 3
34 pages
DS Unit 1
No ratings yet
DS Unit 1
99 pages
Chapter 2
No ratings yet
Chapter 2
53 pages
Course 3
No ratings yet
Course 3
22 pages
Module 1
No ratings yet
Module 1
64 pages
02 Data
No ratings yet
02 Data
35 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
02Data
No ratings yet
02Data
65 pages
Lecture 1,2&3
No ratings yet
Lecture 1,2&3
80 pages
Statistics
No ratings yet
Statistics
81 pages
fds print
No ratings yet
fds print
7 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
54 pages
02 Data
No ratings yet
02 Data
64 pages
Chapter 2 - Understand Data
No ratings yet
Chapter 2 - Understand Data
63 pages
CIS 467 - Topic 2 - Data Exploration and Preprocessing
No ratings yet
CIS 467 - Topic 2 - Data Exploration and Preprocessing
81 pages
Chapter 1-Introduction To Data
No ratings yet
Chapter 1-Introduction To Data
18 pages
1_L2_Intro_DAM
No ratings yet
1_L2_Intro_DAM
27 pages
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
QUALITY C
No ratings yet
QUALITY C
11 pages
GDELT
No ratings yet
GDELT
13 pages
v4 Navex Benchmark Report
No ratings yet
v4 Navex Benchmark Report
96 pages
Course: "Research Methods in Education" (8604) Semester: Autumn, 2020
0% (1)
Course: "Research Methods in Education" (8604) Semester: Autumn, 2020
23 pages
Six Sigma
No ratings yet
Six Sigma
76 pages
Basic Statistics - CHAPTER 4
No ratings yet
Basic Statistics - CHAPTER 4
42 pages
MARIA MASRIAT Ujian Biostatistik
No ratings yet
MARIA MASRIAT Ujian Biostatistik
35 pages
EE 214 Week 2 3 Module
No ratings yet
EE 214 Week 2 3 Module
23 pages
Full Essential Statistics For The Behavioral Sciences Gregory J. Privitera Ebook All Chapters
100% (4)
Full Essential Statistics For The Behavioral Sciences Gregory J. Privitera Ebook All Chapters
62 pages
Aivazian, Ge and Qiu - 2005
No ratings yet
Aivazian, Ge and Qiu - 2005
15 pages
PPM Freightinsurancequalityreferencecosts List en
No ratings yet
PPM Freightinsurancequalityreferencecosts List en
5 pages
KEY - Unit 12 Test Review
No ratings yet
KEY - Unit 12 Test Review
4 pages
Aiha Journal Industrial Hygiene
No ratings yet
Aiha Journal Industrial Hygiene
6 pages
Calculation of Median, Quartiles and Percentiles
No ratings yet
Calculation of Median, Quartiles and Percentiles
4 pages
Online Let Reviewer
67% (3)
Online Let Reviewer
31 pages
Tugas Statistik Hal 109 No 11 - 14
No ratings yet
Tugas Statistik Hal 109 No 11 - 14
26 pages
Investigating Data PDF
100% (1)
Investigating Data PDF
44 pages
Research Methodlogy
No ratings yet
Research Methodlogy
13 pages
Measure of Central Tendancy 2
No ratings yet
Measure of Central Tendancy 2
17 pages
Full Copy - Ttc Maths Module 2 Short Notes - First Edition-1
No ratings yet
Full Copy - Ttc Maths Module 2 Short Notes - First Edition-1
102 pages
Chapter 17 - Fundamental Principles of Relative Valuation
No ratings yet
Chapter 17 - Fundamental Principles of Relative Valuation
3 pages
Heart Rate Data Assignment
No ratings yet
Heart Rate Data Assignment
5 pages
Importance of Airport Access
No ratings yet
Importance of Airport Access
8 pages
Revision Add Math SPM
No ratings yet
Revision Add Math SPM
8 pages
APL103 LabReports
No ratings yet
APL103 LabReports
51 pages
Pharmacy Statistics Prelims - Reviewer
No ratings yet
Pharmacy Statistics Prelims - Reviewer
47 pages
Data, Graphs and Measures of Central Tendency Educational Video in Yellow Blue Simple Lined Style
No ratings yet
Data, Graphs and Measures of Central Tendency Educational Video in Yellow Blue Simple Lined Style
64 pages
All chapter download Developmental Mathematics 1st Edition Blitzer Test Bank
100% (7)
All chapter download Developmental Mathematics 1st Edition Blitzer Test Bank
23 pages

Lecture 01-05 Data, Central Tendency PDF

Uploaded by

Lecture 01-05 Data, Central Tendency PDF

Uploaded by

Foundation

• It encompasses a wide range of activities

• The primary goal of Data Science is to

Data Collection: Gathering and sourcing

Data Cleaning and Preprocessing: Preparing

Exploratory Data Analysis (EDA): Conducting

• Data collection is the process of collecting,

• The methods of collecting primary data can be

Primary Data factors that are not necessarily numerical in nature).

Here are some of the most common primary data

Searchable Difficult to search

• There are several types of data

within the world of big data. Data lakes

come in many different forms. media

There are two main types of Examples

Indexes • import numpy as np

Central Tendency Quartile Variation

The mean is:

Mean affected by extreme values

• The median is simply another name for the 50th

Problem 1 : Wages (in Rs) paid to workers of an organization are given

Wages ( C.I.) 40-60 60-80 80-100 100-120 120-140 140-160

•Important Measure of Variation

•Important Measure of Variation

•For the Sample:

•Measure of Relative Variation

• Stock A: Average Price last year = $50

Left-Skewed Symmetric Right-Skewed

• Consider this dataset showing

& CV • 3. Here are a bunch of 10 point

• 4. 11, 140, 98, 23, 45, 14, 56, 78,

• Class Interval Frequency

• mean = 30.3 sd = 14.46

You might also like