0% found this document useful (0 votes)

1 views10 pages

4-Demonstrate the Descriptive Statistics for a sample data like mean, median, variance and correlation etc.,-16-12-2024

Uploaded by

yash2004kaushik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views10 pages

4-Demonstrate the Descriptive Statistics for a sample data like mean, median, variance and correlation etc.,-16-12-2024

Uploaded by

yash2004kaushik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

Descriptive Statistics for a sample data like mean, median,

variance and correlation

In the mind of a computer, a data set is any collection of data. It can be anything from an
array to a complete database.

Example of an array:

[99,86,87,88,111,86,103,87,94,78,77,85,86]

Example of a database:

Carname Color Age Speed AutoPass

BMW red 5 99 Y
Volvo black 7 86 Y
VW gray 8 87 N
VW white 7 88 Y
Ford white 2 111 Y
VW white 17 86 Y
Tesla red 2 103 Y
BMW black 9 87 Y
Volvo gray 4 94 N
Ford white 11 78 N
Toyota gray 12 77 N
VW white 9 85 N
Toyota blue 6 86 Y

By looking at the array, we can guess that the average value is probably around 80 or 90, and
we are also able to determine the highest value and the lowest value, but what else can we
do?

And by looking at the database we can see that the most popular color is white, and the oldest
car is 17 years, but what if we could predict if a car had an AutoPass, just by looking at the
other values?

That is what Data mining is for! Analyzing data and predicting the outcome!

In Data mining it is common to work with very large data sets. In this tutorial we will try to
make it as easy as possible to understand the different concepts of data mining, and we will
work with small easy-to-understand data sets.

Page 1
Data Types

To analyze data, it is important to know what type of data we are dealing with.

We can split the data types into three main categories:

 Numerical
 Categorical
 Ordinal

Numerical data are numbers, and can be split into two numerical categories:

 Discrete Data
- numbers that are limited to integers. Example: The number of cars passing by.
 Continuous Data
- numbers that are of infinite value. Example: The price of an item, or the size of an
item

Categorical data are values that cannot be measured up against each other. Example: a color
value, or any yes/no values.

Ordinal data are like categorical data, but can be measured up against each other. Example:
school grades where A is better than B and so on.

By knowing the data type of your data source, you will be able to know what technique to use
when analyzing them.

Data mining - Mean Median Mode

What can we learn from looking at a group of numbers?

In Data mining (and in mathematics) there are often three values that interests us:

 Mean - The average value

 Median - The mid point value
 Mode - The most common value

Example: We have registered the speed of 13 cars:

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

What is the average, the middle, or the most common speed value?

Page 2
Mean

The mean value is the average value.

To calculate the mean, find the sum of all values, and divide the sum by the number of
values:

(99+86+87+88+111+86+103+87+94+78+77+85+86) / 13 = 89.77

Example

Use the NumPy mean() method to find the average speed:

import numpy

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

x = numpy.mean(speed)

print(x)

Median

The median value is the value in the middle, after you have sorted all the values:

77, 78, 85, 86, 86, 86, 87, 87, 88, 94, 99, 103, 111

It is important that the numbers are sorted before you can find the median.

The NumPy module has a method for this:

Example

Use the NumPy median() method to find the middle value:

import numpy

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

x = numpy.median(speed)

print(x)

If there are two numbers in the middle, divide the sum of those numbers by two.

77, 78, 85, 86, 86, 86, 87, 87, 94, 98, 99, 103

(86 + 87) / 2 = 86.5

Page 3
Example

Using the NumPy module:

import numpy

speed = [99,86,87,88,86,103,87,94,78,77,85,86]

x = numpy.median(speed)

print(x)

Mode

The Mode value is the value that appears the most number of times:

99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86 = 86

Example

Use the SciPy mode() method to find the number that appears the most:

from scipy import stats

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

x = stats.mode(speed)

print(x)

Data mining - Standard Deviation

Standard deviation is a number that describes how spread out the values are.

A low standard deviation means that most of the numbers are close to the mean (average)
value.

A high standard deviation means that the values are spread out over a wider range.

Example: This time we have registered the speed of 7 cars:

speed = [86,87,88,86,87,85,86]

The standard deviation is:

0.9

Page 4
Meaning that most of the values are within the range of 0.9 from the mean value, which is
86.4.

Let us do the same with a selection of numbers with a wider range:

speed = [32,111,138,28,59,77,97]

The standard deviation is:

37.85

Meaning that most of the values are within the range of 37.85 from the mean value, which is
77.4.

As you can see, a higher standard deviation indicates that the values are spread out over a
wider range.

The NumPy module has a method to calculate the standard deviation:

Example

Use the NumPy std() method to find the standard deviation:

import numpy

speed = [86,87,88,86,87,85,86]

x = numpy.std(speed)

print(x)

Example
import numpy

speed = [32,111,138,28,59,77,97]

x = numpy.std(speed)

print(x)

Variance

Variance is another number that indicates how spread out the values are.

In fact, if you take the square root of the variance, you get the standard deviation!

Page 5
Or the other way around, if you multiply the standard deviation by itself, you get the
variance!

To calculate the variance you have to do as follows:

1. Find the mean:

(32+111+138+28+59+77+97) / 7 = 77.4

2. For each value: find the difference from the mean:

32 - 77.4 = -45.4
111 - 77.4 = 33.6
138 - 77.4 = 60.6
28 - 77.4 = -49.4
59 - 77.4 = -18.4
77 - 77.4 = - 0.4
97 - 77.4 = 19.6

3. For each difference: find the square value:

(-45.4)2 = 2061.16
(33.6)2 = 1128.96
(60.6)2 = 3672.36
(-49.4)2 = 2440.36
(-18.4)2 = 338.56
(- 0.4)2 = 0.16
(19.6)2 = 384.16

4. The variance is the average number of these squared differences:

(2061.16+1128.96+3672.36+2440.36+338.56+0.16+384.16) / 7 = 1432.2

Luckily, NumPy has a method to calculate the variance:

Example

Use the NumPy var() method to find the variance:

import numpy

speed = [32,111,138,28,59,77,97]

x = numpy.var(speed)

print(x)

Page 6
Standard Deviation

As we have learned, the formula to find the standard deviation is the square root of the
variance:

√1432.25 = 37.85

Or, as in the example from before, use the NumPy to calculate the standard deviation:

Example

Use the NumPy std() method to find the standard deviation:

import numpy

speed = [32,111,138,28,59,77,97]

x = numpy.std(speed)

print(x)

Symbols

Standard Deviation is often represented by the symbol Sigma: σ

Variance is often represented by the symbol Sigma Square: σ2

Data mining - Percentiles

Percentiles are used in statistics to give you a number that describes the value that a given
percent of the values are lower than.

Example: Let's say we have an array of the ages of all the people that lives in a street.

ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]

What is the 75 percentile? The answer is 43, meaning that 75% of the people are 43 or
younger.

The NumPy module has a method for finding the specified percentile:

Example

Use the NumPy percentile() method to find the percentiles:

import numpy

Page 7
ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]

x = numpy.percentile(ages, 75)

print(x)

Example

What is the age that 90% of the people are younger than?

import numpy

ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]

x = numpy.percentile(ages, 90)

print(x)

Correlation

Correlation is the statistical measure that defines to which extent two variables are linearly
related to each other

import numpy as np

x = np.array([1,3,5,7,8,9, 10, 15])

y = np.array([10, 20, 30, 40, 50, 60, 70, 80])

print(np.corrcoef(x, y))

Try

1. Create an array of 6 zeros

2. Create an array of 6 ones
3. Create an array of integers from 1 to 99
4. Create an array of 5 random integers between 0 and 9
5. Create an array of all the odd integers ranging from 1 to 99
6. Create a 2X2 matrix filled with values from 1 to 4
7. Create a 4X3 matrix filled with values from 9 to 17
8. Find mean value of a randomly generated array of size 30
9. Find variance value of a randomly generated array of size 25
10. Find correlation coefficient value of a randomly generated two array x and y of size 5

Page 8
Generate random numbers using Numpy
import numpy as np
np.random.rand(5)

np.random.rand(3, 3)

np.random.randint(1, 1000, size=5)

np.random.normal(0, 1, 10)

import pandas as pa

df=pa.read_csv('/content/sales_data_ex1.csv')

df.shape

df.columns

df.info()

df.count()

df.head(3)

df.tail(3)

df.describe()

Page 9
df1 = pa.DataFrame({'num_legs': [2, 4, 8, 0],

'num_wings': [2, 0, 0, 0],

'num_specimen_seen': [10, 2, 1, 8]},

index=['falcon', 'dog', 'spider', 'fish'])

df1

p.random.seed(1)

df2 = pd.DataFrame({"C" : np.random.randint(low=1, high=100,

size=10),

"D" : np.random.normal(0.0, 1.0,

size=10)

})

df2

from sklearn.datasets import load_iris

data = load_iris()

df = pd.DataFrame(data=data.data,columns=data.feature_names)

data.target_names

Page 10

Calculation Sheet: 71197 Design For Pile Cap - Two Piles PP DJM MVL
50% (2)
Calculation Sheet: 71197 Design For Pile Cap - Two Piles PP DJM MVL
3 pages
Machine Learning: Data Set
100% (1)
Machine Learning: Data Set
52 pages
Machine Learning: Where To Start?
No ratings yet
Machine Learning: Where To Start?
71 pages
Machine Learning: Dr. Muhammad Asadullah
No ratings yet
Machine Learning: Dr. Muhammad Asadullah
69 pages
6.Lab Activity
No ratings yet
6.Lab Activity
23 pages
Machine Learning
No ratings yet
Machine Learning
80 pages
Python Tutorial - W3school2 PDF
No ratings yet
Python Tutorial - W3school2 PDF
131 pages
Build ETL Using Python
No ratings yet
Build ETL Using Python
7 pages
ML Lab Final R22
No ratings yet
ML Lab Final R22
67 pages
Modul 7 Praktikum Machine Learning Python
No ratings yet
Modul 7 Praktikum Machine Learning Python
32 pages
Data Analysis and Visualization EDA
No ratings yet
Data Analysis and Visualization EDA
51 pages
Lab Plan 5: Statistics and Probability: Describing A Single Set of Data
No ratings yet
Lab Plan 5: Statistics and Probability: Describing A Single Set of Data
19 pages
Nummerical Summaries
No ratings yet
Nummerical Summaries
11 pages
Shubh Am
No ratings yet
Shubh Am
70 pages
Notebook Statistics
No ratings yet
Notebook Statistics
6 pages
Rahul ML file'[1] 2
No ratings yet
Rahul ML file'[1] 2
30 pages
04-003 Statistics
No ratings yet
04-003 Statistics
14 pages
DS Chapter - 2
No ratings yet
DS Chapter - 2
73 pages
Statistics
No ratings yet
Statistics
23 pages
Maths
No ratings yet
Maths
30 pages
Unit 4
No ratings yet
Unit 4
66 pages
Data Mining and Predictive Modelling Assignment
No ratings yet
Data Mining and Predictive Modelling Assignment
34 pages
program-1_
No ratings yet
program-1_
15 pages
Topics To Be Covered
No ratings yet
Topics To Be Covered
58 pages
Exp-10
No ratings yet
Exp-10
4 pages
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
33 pages
ML 3170724 Unit-2
No ratings yet
ML 3170724 Unit-2
40 pages
DA Theory from Page 4
No ratings yet
DA Theory from Page 4
29 pages
Unit2PreparingtoModelpptx 2023 09 02 14 52 40
No ratings yet
Unit2PreparingtoModelpptx 2023 09 02 14 52 40
43 pages
Chapter - 3 Data Pre - Processing
No ratings yet
Chapter - 3 Data Pre - Processing
54 pages
Measures of Location and Spread
No ratings yet
Measures of Location and Spread
1 page
unit1
No ratings yet
unit1
78 pages
Chapter 1
No ratings yet
Chapter 1
44 pages
Unit 2 1
No ratings yet
Unit 2 1
54 pages
ML Course Slides
No ratings yet
ML Course Slides
345 pages
ML Course Slides
No ratings yet
ML Course Slides
356 pages
Machine Learning
No ratings yet
Machine Learning
22 pages
MLCourseSlides
No ratings yet
MLCourseSlides
427 pages
12. B Lab Manual Machine Learning SEM-7 CSE 2024
No ratings yet
12. B Lab Manual Machine Learning SEM-7 CSE 2024
49 pages
Session 12
No ratings yet
Session 12
8 pages
Data Mining Lab Maual Through Python 031023
No ratings yet
Data Mining Lab Maual Through Python 031023
22 pages
MLCourse Slides
No ratings yet
MLCourse Slides
356 pages
02know Your Data-Lecture2-3
No ratings yet
02know Your Data-Lecture2-3
53 pages
Week2_UnderstandingData
No ratings yet
Week2_UnderstandingData
27 pages
Session 3
No ratings yet
Session 3
61 pages
Unit 1 Computational Statistics
No ratings yet
Unit 1 Computational Statistics
58 pages
DA Practical Lab 02 Statistical Functions
No ratings yet
DA Practical Lab 02 Statistical Functions
6 pages
Module 1 Overview_of_Statistics
No ratings yet
Module 1 Overview_of_Statistics
11 pages
02Data (2)
No ratings yet
02Data (2)
36 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
93 pages
Lecture 2 - Statistical Inference - EDA and DS Process - 02032023 111156am 1 - 1 27022024 012412pm
No ratings yet
Lecture 2 - Statistical Inference - EDA and DS Process - 02032023 111156am 1 - 1 27022024 012412pm
44 pages
02Data Edited v2
No ratings yet
02Data Edited v2
43 pages
Lecture2
No ratings yet
Lecture2
33 pages
Ch.2 Measures of Location and Spread
No ratings yet
Ch.2 Measures of Location and Spread
1 page
MACHINE LEARNING LAB WORD 12-1-2025. DOCUMENT
No ratings yet
MACHINE LEARNING LAB WORD 12-1-2025. DOCUMENT
68 pages
5_Data Summaries and Visualization
No ratings yet
5_Data Summaries and Visualization
97 pages
Parameter Statistic Parameter Population Characteristic Statistic Sample Characteristic
No ratings yet
Parameter Statistic Parameter Population Characteristic Statistic Sample Characteristic
9 pages
CHP 2
No ratings yet
CHP 2
52 pages
Representation of Data 2
No ratings yet
Representation of Data 2
43 pages
Basic Exercises for Competitive Programming: Python
From Everand
Basic Exercises for Competitive Programming: Python
Jan Pol
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Nutella Case Study (English)
No ratings yet
Nutella Case Study (English)
1 page
Report Aicte Activity
No ratings yet
Report Aicte Activity
11 pages
Test Report M - Sand, 10mm Agg.,20mmAgg (Source - Bilakundi) Lokapur Site.
No ratings yet
Test Report M - Sand, 10mm Agg.,20mmAgg (Source - Bilakundi) Lokapur Site.
8 pages
Newton vs Freydl
No ratings yet
Newton vs Freydl
18 pages
Get Started On Creating Your Event Budget:: If You Decide To Build Your Own Start by Creating A Few Columns
No ratings yet
Get Started On Creating Your Event Budget:: If You Decide To Build Your Own Start by Creating A Few Columns
4 pages
Natwest - Applied List
No ratings yet
Natwest - Applied List
22 pages
R20 UG Regulations
No ratings yet
R20 UG Regulations
11 pages
Q1. Explain The Meaning and Application of Various HR Related Terminologies
No ratings yet
Q1. Explain The Meaning and Application of Various HR Related Terminologies
18 pages
Experience: UI & UX Designer
No ratings yet
Experience: UI & UX Designer
1 page
EEE_1131_M1
No ratings yet
EEE_1131_M1
23 pages
Installing Management Applications On VNX For File - IN
No ratings yet
Installing Management Applications On VNX For File - IN
2 pages
Sensors and Instrumentation Aircraft Aerospace Energy Harvesting Dynamic Environments Testing Volume 7 Proceedings of the 38th IMAC A Conference and Exposition on Structural Dynamics 2020 1st Edition Chad Walber download
100% (2)
Sensors and Instrumentation Aircraft Aerospace Energy Harvesting Dynamic Environments Testing Volume 7 Proceedings of the 38th IMAC A Conference and Exposition on Structural Dynamics 2020 1st Edition Chad Walber download
54 pages
Final Paper - Request For Arbitration
No ratings yet
Final Paper - Request For Arbitration
5 pages
Stand Battles in Anime - Google Search
No ratings yet
Stand Battles in Anime - Google Search
1 page
Anhydrous Ammonia System Piping Requirements
No ratings yet
Anhydrous Ammonia System Piping Requirements
4 pages
Systems of Safety Applied To Focus Four Hazards: Usdol-Osha Susan Harwood Grant SHT21005SH0
No ratings yet
Systems of Safety Applied To Focus Four Hazards: Usdol-Osha Susan Harwood Grant SHT21005SH0
72 pages
6911481143_statement (1)
No ratings yet
6911481143_statement (1)
38 pages
CP-E21A
No ratings yet
CP-E21A
7 pages
BYD Seal Brochure EM
No ratings yet
BYD Seal Brochure EM
1 page
Spectrum Analyzer FSL: Specifications
No ratings yet
Spectrum Analyzer FSL: Specifications
12 pages
Germany - MBBS - Proposal
No ratings yet
Germany - MBBS - Proposal
4 pages
MicrosoftWord 2007 Training Manual
100% (1)
MicrosoftWord 2007 Training Manual
223 pages
Asus 1005ha r1.1 Schematics
No ratings yet
Asus 1005ha r1.1 Schematics
49 pages
19 November
No ratings yet
19 November
3 pages
ALARM & WARNING HAGLUNDS in HMI
No ratings yet
ALARM & WARNING HAGLUNDS in HMI
2 pages
Resume Piyush Jain Digital Marketer
No ratings yet
Resume Piyush Jain Digital Marketer
6 pages
Marketing Plan
No ratings yet
Marketing Plan
16 pages
Slinger.preoperative Assessment - An Anesthesiologist's Perspective
No ratings yet
Slinger.preoperative Assessment - An Anesthesiologist's Perspective
15 pages
Constructing Personal Learning Environments (PLEs) Through ICT in Foreign Language Instruction
No ratings yet
Constructing Personal Learning Environments (PLEs) Through ICT in Foreign Language Instruction
1 page

4-Demonstrate the Descriptive Statistics for a sample data like mean, median, variance and correlation etc.,-16-12-2024

Uploaded by

4-Demonstrate the Descriptive Statistics for a sample data like mean, median, variance and correlation etc.,-16-12-2024

Uploaded by

Descriptive Statistics for a sample data like mean, median,

variance and correlation

Carname Color Age Speed AutoPass

We can split the data types into three main categories:

Data mining - Mean Median Mode

What can we learn from looking at a group of numbers?

 Mean - The average value

Example: We have registered the speed of 13 cars:

The mean value is the average value.

Use the NumPy mean() method to find the average speed:

The NumPy module has a method for this:

Use the NumPy median() method to find the middle value:

(86 + 87) / 2 = 86.5

Using the NumPy module:

from scipy import stats

Data mining - Standard Deviation

Example: This time we have registered the speed of 7 cars:

The standard deviation is:

Let us do the same with a selection of numbers with a wider range:

The standard deviation is:

The NumPy module has a method to calculate the standard deviation:

Use the NumPy std() method to find the standard deviation:

To calculate the variance you have to do as follows:

1. Find the mean:

2. For each value: find the difference from the mean:

3. For each difference: find the square value:

4. The variance is the average number of these squared differences:

Luckily, NumPy has a method to calculate the variance:

Use the NumPy var() method to find the variance:

Use the NumPy std() method to find the standard deviation:

Standard Deviation is often represented by the symbol Sigma: σ

Variance is often represented by the symbol Sigma Square: σ2

Data mining - Percentiles

Use the NumPy percentile() method to find the percentiles:

x = np.array([1,3,5,7,8,9, 10, 15])

y = np.array([10, 20, 30, 40, 50, 60, 70, 80])

1. Create an array of 6 zeros

np.random.randint(1, 1000, size=5)

'num_wings': [2, 0, 0, 0],

'num_specimen_seen': [10, 2, 1, 8]},

index=['falcon', 'dog', 'spider', 'fish'])

df2 = pd.DataFrame({"C" : np.random.randint(low=1, high=100,

"D" : np.random.normal(0.0, 1.0,

from sklearn.datasets import load_iris

You might also like