4-Demonstrate the Descriptive Statistics for a sample data like mean, median, variance and correlation etc.,-16-12-2024
4-Demonstrate the Descriptive Statistics for a sample data like mean, median, variance and correlation etc.,-16-12-2024
Example of an array:
[99,86,87,88,111,86,103,87,94,78,77,85,86]
Example of a database:
By looking at the array, we can guess that the average value is probably around 80 or 90, and
we are also able to determine the highest value and the lowest value, but what else can we
do?
And by looking at the database we can see that the most popular color is white, and the oldest
car is 17 years, but what if we could predict if a car had an AutoPass, just by looking at the
other values?
That is what Data mining is for! Analyzing data and predicting the outcome!
In Data mining it is common to work with very large data sets. In this tutorial we will try to
make it as easy as possible to understand the different concepts of data mining, and we will
work with small easy-to-understand data sets.
Page 1
Data Types
To analyze data, it is important to know what type of data we are dealing with.
Numerical
Categorical
Ordinal
Numerical data are numbers, and can be split into two numerical categories:
Discrete Data
- numbers that are limited to integers. Example: The number of cars passing by.
Continuous Data
- numbers that are of infinite value. Example: The price of an item, or the size of an
item
Categorical data are values that cannot be measured up against each other. Example: a color
value, or any yes/no values.
Ordinal data are like categorical data, but can be measured up against each other. Example:
school grades where A is better than B and so on.
By knowing the data type of your data source, you will be able to know what technique to use
when analyzing them.
In Data mining (and in mathematics) there are often three values that interests us:
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
What is the average, the middle, or the most common speed value?
Page 2
Mean
To calculate the mean, find the sum of all values, and divide the sum by the number of
values:
(99+86+87+88+111+86+103+87+94+78+77+85+86) / 13 = 89.77
Example
import numpy
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
x = numpy.mean(speed)
print(x)
Median
The median value is the value in the middle, after you have sorted all the values:
77, 78, 85, 86, 86, 86, 87, 87, 88, 94, 99, 103, 111
It is important that the numbers are sorted before you can find the median.
Example
import numpy
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
x = numpy.median(speed)
print(x)
If there are two numbers in the middle, divide the sum of those numbers by two.
77, 78, 85, 86, 86, 86, 87, 87, 94, 98, 99, 103
Page 3
Example
import numpy
speed = [99,86,87,88,86,103,87,94,78,77,85,86]
x = numpy.median(speed)
print(x)
Mode
The Mode value is the value that appears the most number of times:
99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86 = 86
Example
Use the SciPy mode() method to find the number that appears the most:
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
x = stats.mode(speed)
print(x)
Standard deviation is a number that describes how spread out the values are.
A low standard deviation means that most of the numbers are close to the mean (average)
value.
A high standard deviation means that the values are spread out over a wider range.
speed = [86,87,88,86,87,85,86]
0.9
Page 4
Meaning that most of the values are within the range of 0.9 from the mean value, which is
86.4.
speed = [32,111,138,28,59,77,97]
37.85
Meaning that most of the values are within the range of 37.85 from the mean value, which is
77.4.
As you can see, a higher standard deviation indicates that the values are spread out over a
wider range.
Example
import numpy
speed = [86,87,88,86,87,85,86]
x = numpy.std(speed)
print(x)
Example
import numpy
speed = [32,111,138,28,59,77,97]
x = numpy.std(speed)
print(x)
Variance
Variance is another number that indicates how spread out the values are.
In fact, if you take the square root of the variance, you get the standard deviation!
Page 5
Or the other way around, if you multiply the standard deviation by itself, you get the
variance!
(32+111+138+28+59+77+97) / 7 = 77.4
32 - 77.4 = -45.4
111 - 77.4 = 33.6
138 - 77.4 = 60.6
28 - 77.4 = -49.4
59 - 77.4 = -18.4
77 - 77.4 = - 0.4
97 - 77.4 = 19.6
(-45.4)2 = 2061.16
(33.6)2 = 1128.96
(60.6)2 = 3672.36
(-49.4)2 = 2440.36
(-18.4)2 = 338.56
(- 0.4)2 = 0.16
(19.6)2 = 384.16
(2061.16+1128.96+3672.36+2440.36+338.56+0.16+384.16) / 7 = 1432.2
Example
import numpy
speed = [32,111,138,28,59,77,97]
x = numpy.var(speed)
print(x)
Page 6
Standard Deviation
As we have learned, the formula to find the standard deviation is the square root of the
variance:
√1432.25 = 37.85
Or, as in the example from before, use the NumPy to calculate the standard deviation:
Example
import numpy
speed = [32,111,138,28,59,77,97]
x = numpy.std(speed)
print(x)
Symbols
Percentiles are used in statistics to give you a number that describes the value that a given
percent of the values are lower than.
Example: Let's say we have an array of the ages of all the people that lives in a street.
ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]
What is the 75 percentile? The answer is 43, meaning that 75% of the people are 43 or
younger.
The NumPy module has a method for finding the specified percentile:
Example
import numpy
Page 7
ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]
x = numpy.percentile(ages, 75)
print(x)
Example
What is the age that 90% of the people are younger than?
import numpy
ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]
x = numpy.percentile(ages, 90)
print(x)
Correlation
Correlation is the statistical measure that defines to which extent two variables are linearly
related to each other
import numpy as np
print(np.corrcoef(x, y))
Try
Page 8
Generate random numbers using Numpy
import numpy as np
np.random.rand(5)
np.random.rand(3, 3)
np.random.normal(0, 1, 10)
import pandas as pa
df=pa.read_csv('/content/sales_data_ex1.csv')
df.shape
df.columns
df.info()
df.count()
df.head(3)
df.tail(3)
df.describe()
Page 9
df1 = pa.DataFrame({'num_legs': [2, 4, 8, 0],
df1
p.random.seed(1)
size=10),
size=10)
})
df2
data = load_iris()
df = pd.DataFrame(data=data.data,columns=data.feature_names)
data.target_names
Page 10