DSBDAL_Assignment no 3
DSBDAL_Assignment no 3
Experiment:3
Title :
Basic Statistics - Measures of Central Tendencies and Variance
Perform the following operations on any open source dataset (eg. data.csv)
1. Provide summary statistics (mean, median, minimum, maximum, standard deviation)
for a dataset (age, income etc.) with numeric variables grouped by one of the qualitative
(categorical) variable. For example, if your categorical variable is age groups and
quantitative variable is income, then provide summary statistics of income grouped by
the age groups. Create a list that contains a numeric value for each response to the
categorical variable.
2. Write a Python program to display some basic statistical details like percentile, mean,
standard deviation etc. of the species of ‘Iris-setosa’, ‘Iris-versicolor’ and ‘Iris-
versicolor’ of iris.csv dataset.
Prerequisites:
Fundamentals of R -Programming Languages OR Python
Objectives :
To learn the concept of how to display summary statistics for each feature Available in
the dataset
Theory:
How to Find the Mean, Median, Mode, Range, and Standard Deviation
Simplify comparisons of sets of number, especially large sets of number, by calculating the
center values using mean, mode and median. Use the ranges and standard deviations of the
sets to examine the variability of data.
Calculating Mean
The mean identifies the average value of the set of numbers. For example, consider the
data set containing the values 20, 24, 25, 36, 25, 22, 23.
Formula
To find the mean, use the formula: Mean equals the sum of the numbers in the data set
divided by the number of values in the data set. In mathematical terms: Mean=(sum of all
terms)÷(how many terms or values in the set).
Adding Data Set
Add the numbers in the example data set: 20+24+25+36+25+22+23=175.
Finding Divisor
Divide by the number of data points in the set. This set has seven values so divide by 7.
Finding Mean
Insert the values into the formula to calculate the mean. The mean equals the sum of the
values (175) divided by the number of data points (7). Since 175÷7=25, the mean of this
data set equals 25. Not all mean values will equal a whole number.
Calculating Range
Range shows the mathematical distance between the lowest and highest values in the data
set. Range measures the variability of the data set. A wide range indicates greater variability
in the data, or perhaps a single outlier far from the rest of the data. Outliers may skew, or
shift, the mean value enough to impact data analysis.
Calculating Range
To calculate range, subtract the lowest value from the highest value. Since 36-20=16, the
range equals 16.
.
Formula
Finding standard deviation requires summing the squared difference between each data point
2
and the mean [∑(x-µ) ], adding all the squares, dividing that sum by one less than the number
of values (N-1), and finally calculating the square root of the dividend.
Mathematically, start with calculating the mean.
Divide the sum of the squared differences by one less than the number of data points. The
example data set has 7 values, so N-1 equals 7-1=6. The sum of the squared differences,
160, divided by 6 equals approximately 26.6667.
Standard Deviation
Calculate the standard deviation by finding the square root of the division by N-1. In the
example, the square root of 26.6667 equals approximately 5.164. Therefore, the standard
deviation equals approximately 5.164.
Evaluating Standard Deviation
Standard deviation helps evaluate data. Numbers in the data set that fall within one
standard deviation of the mean are part of the data set. Numbers that fall outside of two
standard deviations are extreme values or outliers. In the example set, the value 36 lies
more than two standard deviations from the mean, so 36 is an outlier. Outliers may
represent erroneous data or may suggest unforeseen circumstances and should be carefully
considered when interpreting data.
Input :
Structured Dataset : Iris Dataset
File: iris.csv
Output :
1. Display Dataset Details.
2. Calculate Min, Max,Mean,Varience value and Percentiles of probabilities also Display
Specific use quantile.
Conclusion:
Hence, we have studied using dataset into a dataframe and compare distribution and
identify outliers.
Questions:
1. What is Data visualization?
2. How to calculate min,max,range and standard deviation?
3. What is dataset.