0% found this document useful (0 votes)
16 views

Box Plot Data-Aggregation To Normalization DJB Notes 25-04-2024

A boxplot, or box-and-whisker plot, graphically summarizes the central tendency, dispersion, and skewness of a dataset using boxes representing quartiles and outliers. It divides the dataset into quartiles with a median line and whiskers extending to the minimum and maximum non-outlier values. Boxplots are useful for comparing distributions and identifying outliers.

Uploaded by

xogoj12262
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Box Plot Data-Aggregation To Normalization DJB Notes 25-04-2024

A boxplot, or box-and-whisker plot, graphically summarizes the central tendency, dispersion, and skewness of a dataset using boxes representing quartiles and outliers. It divides the dataset into quartiles with a median line and whiskers extending to the minimum and maximum non-outlier values. Boxplots are useful for comparing distributions and identifying outliers.

Uploaded by

xogoj12262
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Box Plots

● A boxplot, also known as a box-and-whisker plot, is a graphical


representation of the distribution of a dataset. It summarizes the central
tendency, dispersion, and skewness of the data in a concise manner.
Here's a breakdown of the components of a boxplot:
● Median (Q2): The middle value of the dataset, also known as the second
quartile (Q2). It divides the data into two halves, with 50% of the data
points falling below it and 50% above it.
● Quartiles (Q1 and Q3): The quartiles divide the dataset into four equal
parts. Q1 represents the first quartile, which is the median of the lower
half of the data. Q3 represents the third quartile, which is the median of
the upper half of the data.
● Interquartile Range (IQR): The IQR is the range between the first quartile
(Q1) and the third quartile (Q3). It covers the middle 50% of the data.
Box Plots

● Whiskers: The whiskers extend from the edges of the box to the
minimum and maximum values within 1.5 times the IQR from the
first and third quartiles, respectively. They represent the range of
the data, excluding outliers.
● Outliers: Data points that fall outside the whiskers are considered
outliers and are plotted individually as points. They represent data
values that are significantly different from the rest of the dataset.
● Boxplots are particularly useful for comparing distributions between
different groups or variables and identifying potential outliers. They
provide a visual summary of the data's spread, skewness, and central
tendency in a single plot, making them a valuable tool in exploratory
data analysis and statistical analysis.
Box Plots

● import pandas as pd
● import matplotlib.pyplot as plt

● # Load the CSV file into a Pandas DataFrame


● df = pd.read_csv('D:\DATA_SCIENCE\Sample_CSV_files\height_weight.csv')

● # Create box plots for height and weight


● plt.figure(figsize=(10, 6))

● # Box plot for height


● plt.subplot(1, 2, 1)
● plt.boxplot(df['height'])
● plt.title('Box Plot of Height')
● plt.ylabel('Height')
Box Plots

● # Box plot for weight


● plt.subplot(1, 2, 2)
● plt.boxplot(df['weight'])
● plt.title('Box Plot of Weight')
● plt.ylabel('Weight')

● plt.tight_layout()
● plt.show()
Data aggregation and grouping
with Pandas
● In Pandas, data aggregation and grouping are fundamental
operations for data analysis. Here's a basic overview of how to
perform these tasks:
● Grouping Data: The groupby() function is used to split the data
into groups based on some criteria.
import pandas as pd

# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob'],
'Score': [85, 92, 78, 90, 88]}
df = pd.DataFrame(data)

# Grouping by 'Name'
grouped = df.groupby('Name')

# Displaying groups
for name, group in grouped:
print(name)
print(group)
Aggregating Data:

● After grouping, you can apply aggregate functions like


sum, mean, count, etc. using agg().
# Aggregating data
agg_result = grouped.agg({'Score': 'mean'}) # Mean
score for each person
print(agg_result)
Applying Multiple Aggregations:

● You can apply multiple aggregation functions


simultaneously.
# Applying multiple aggregations
agg_result = grouped.agg({'Score': ['mean', 'sum',
'count']})
print(agg_result)
Custom Aggregation Functions:

● You can define custom aggregation functions.


# Custom aggregation function
def custom_agg(series):
return series.max() - series.min()

agg_result = grouped.agg({'Score': custom_agg})


print(agg_result)
Flattening MultiIndex:

● You can flatten the MultiIndex columns after


aggregation.
# Flattening MultiIndex
agg_result.columns = ['_'.join(col).strip() for col in
agg_result.columns.values]
print(agg_result)
Data transformation and
normalization
● In SciPy, you can perform data transformation and
normalization using various functions available in the
scipy.stats module. Here's a basic overview of how to do
data transformation and normalization using SciPy:
● Data Transformation: Transformation involves changing the
scale or distribution of your data. Common transformations
include log transformation, square root transformation, etc.
You can use the scipy.stats.boxcox function for power
transformations like Box-Cox transformation.
from scipy import stats

# Example data
data = [1, 2, 3, 4, 5]

# Perform Box-Cox transformation


transformed_data, lambda_value = stats.boxcox(data)

print("Transformed data:", transformed_data)


print("Lambda value:", lambda_value)
Data Normalization:

● Normalization scales the values of your data to a fixed


range, usually between 0 and 1. The scipy.stats.zscore
function is commonly used for z-score normalization.
from scipy import stats

# Example data
data = [1, 2, 3, 4, 5]

# Perform z-score normalization


normalized_data = stats.zscore(data)

print("Normalized data:", normalized_data)

You might also like