0% found this document useful (0 votes)
2 views

EDA_Module_3-1

This document is a course module on Exploratory Data Analytics, focusing on the use of the groupby() function in Pandas for data analysis. It explains the Split-Apply-Combine approach, how to group datasets by multiple columns, and perform various aggregations such as sum, mean, max, and min. Additionally, it covers the creation of pivot tables and cross-tabulations for summarizing and analyzing data.

Uploaded by

dimpalrajput511
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

EDA_Module_3-1

This document is a course module on Exploratory Data Analytics, focusing on the use of the groupby() function in Pandas for data analysis. It explains the Split-Apply-Combine approach, how to group datasets by multiple columns, and perform various aggregations such as sum, mean, max, and min. Additionally, it covers the creation of pivot tables and cross-tabulations for summarizing and analyzing data.

Uploaded by

dimpalrajput511
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 40

Course: Exploratory Data

Analytics
Course Code: 23CSE422

Mrs. Vinutha K,
Assistant Professor
Department of CS & E
Module-3:
Grouping and Correlation
Grouping Datasets - Understanding groupby()
What is groupby( ) in Pandas?
• "The groupby( ) function allows grouping rows based on values in one or more columns."

• It follows the Split-Apply-Combine approach:


• "Split: Divide data into groups."
• We can use groupby() to split this data by "Fruit." This would create two groups: one for "Apple" and one for "Banana.”
• "Apply: Perform calculations (sum, mean, etc.)."
• We can then apply the sum() function to the "Quantity" column in each group.
• This would give us the total quantity of apples sold and the total quantity of bananas sold.
• "Combine: Merge results into a new DataFrame."
• The results (total apples and total bananas) would be combined into a new table.
Understanding groupby()
Code:

import pandas as pd
data = {'Fruit': ['Apple', 'Banana', 'Apple', 'Banana'],
'Store': ['A', 'B', 'B', 'A'],
'Quantity': [10, 15, 20, 5]}
df = pd.DataFrame(data)
# Group by 'Fruit' and calculate the sum of 'Quantity'
grouped_data = df.groupby('Fruit')['Quantity'].sum()
print(grouped_data)
Groupby Mechanics
• When we analyze data using pandas, we often need to divide it into groups based on
certain criteria.
• "Groupby mechanics" refers to how we organize our dataset into these groups so we
can perform operations or changes.
• The groupby() method involves two main steps:
Splitting the data into groups based on some criteria.
Applying a function to each group independently.
• To use groupby() effectively, you need a dataset with both numerical (e.g., price) and
categorical (e.g., Color) data.
Groupby Mechanics

Output:
Selecting a subset of columns

• When you use groupby(), you're not limited to just one category.
You can group your data based on multiple columns.
• This allows for more detailed analysis.
• After grouping by multiple columns, you can then perform
calculations (like average, sum, etc.) on the resulting groups.
Selecting a subset of columns
Simple Example:
Imagine you have a table of information about toys:
Output:
Selecting a subset of columns

Let's say you want to find the average price of toys, not just by color, but by color and material.
1. Group by Multiple Columns: You would group the data by both "Color" and "Material." This
creates groups like:
• Red & Plastic
• Blue & Metal
• Silver & Metal
2. Calculate the Average Price: Then, you calculate the average "Price" for each of these combined
groups.
This gives you a more refined analysis than just looking at the average price by color or by material
alone.
Selecting a subset of columns
Output:
Max and min
Instead of the automobile dataset(used in Text Book), we'll use a toy dataset as an example.
Concept: After grouping data, we often want to find the highest (maximum) or lowest (minimum)
values within each group.
Objective: To find the highest and lowest values within specific groups of data.
Method:
 Use groupby() to create groups (e.g., by "Color").
 Apply the max() function to get the maximum value in each group.
 Apply the min() function to get the minimum value in each group.
Example:
• Imagine a toy dataset with columns like "Toy Name," "Color," "Material," and "Price."
• If we group by "Color," then using max() on "Price" finds the most expensive toy of each color.
• If we group by "Color," then using min() on "Price" finds the least expensive toy of each color.
Max and min

Output:
Mean

Finding the Mean: The mean() method is used to calculate the average of numerical columns
within each group.
Mean of a Specific Group: You can also find the mean of each column for a specific group by
using get_group() and then applying the mean() method.
Counting Records: The count() method can be used to count the number of records in each
group.
Mean
Mean
Mean
Data aggregation
• Data aggregation involves applying a mathematical operation to a dataset or a subset
of it.
• In pandas, the aggregate() function (or its shorthand agg()) is used to perform these
aggregations across columns.
Common aggregation functions include:
sum(): Calculates the sum of values.
min(): Finds the minimum value.
max(): Finds the maximum value.
Data aggregation
Group-wise operations
1. Grouping:
The groupby operation is like sorting your data into categories. For example, you might want to
group the data by "Fruit" and "Origin." This would create groups like:
• Apple, Local
• Apple, Imported
• Banana, Local
• Banana, Imported
• Orange, Imported In our fruit example it is "Fruit" and "Origin"
Group-wise operations
2. Aggregation:
Once you have these groups, you can perform calculations (aggregations) on the data within each
group.
Examples of aggregations:
min: Find the minimum value (e.g., the minimum price of apples from a specific origin).
max: Find the maximum value (e.g., the maximum quantity of bananas sold from a specific origin).
mean: Calculate the average value (e.g., the average price of oranges).
sum: Calculate the total sum of the values (e.g. the total quantity of apples sold from a specific
location)
std: Calculate the standard deviation (e.g. how spread out the prices of bananas are.) In our fruit
example, we can find the min/max/mean/sum of the price per unit, or the quantity sold.
Group-wise operations
Group-wise operations

Output:
Renaming grouped aggregation columns

Renaming the aggregated columns makes the resulting DataFrame much more readable
and understandable.
Applying it to the Fruit Example:
Let's say we want to find the:
Maximum price of each fruit from each origin (and name it "Max Price").
Minimum price of each fruit from each origin (and name it "Min Price").
Average price of each fruit from each origin (and name it "Average Price").
Renaming grouped aggregation columns
Group-wise transformations
Aggregation vs. Transformation:
Aggregation (like we did before with agg()) reduces the data within each group to a single
summary value (e.g., the average price).
Transformation, on the other hand, preserves the original size of the data. It applies a
function to each element within a group and returns a result with the same number of rows
as the original group.
Purpose:
-Transformations are used when you want to modify or calculate values based on group
information without changing the shape of your DataFrame.
- A common use case is to standardize data within groups or to calculate group-specific
statistics and add them as new columns.
Group-wise transformations
Group-wise transformations
Group-wise transformations
Pivot tables and cross-tabulations

• Pivot tables are a powerful tool for summarizing and analyzing data.
• They allow you to reshape data into a table format, where rows and columns
represent different categories.
• They are useful for aggregating data based on multiple dimensions.
Pivot tables and cross-tabulations
Fruit Example: Applying Pivot Tables
Let's use our fruit DataFrame and demonstrate some pivot tables:
1. Simple Pivot Table (Average Price by Fruit):

import pandas as pd
import numpy as np

# Assuming fruit_df is your DataFrame

pivot_simple = pd.pivot_table(fruit_df, index=["Fruit"],


values=["Price per Unit"])
print(pivot_simple)
Pivot tables and cross-tabulations
Pivot tables and cross-tabulations
2. Pivot Table with Multiple Indexes (Average Price by Fruit and Origin):

pivot_multiple_index = pd.pivot_table(fruit_df,
index=["Fruit", "Origin"],
values=["Price per Unit"])
print(pivot_multiple_index)
Pivot tables and cross-tabulations
3. Pivot Table with Columns and Aggregation:

pivot_columns_agg = pd.pivot_table(fruit_df, values="Price per Unit",

index="Fruit", columns="Origin",
aggfunc=“mean”, fill_value=0)

print(pivot_columns_agg)
Pivot tables and cross-tabulations

4. Pivot Table with Multiple Aggregations:

pivot_multiple_agg = pd.pivot_table(fruit_df, values=["Price per Unit", "Quantity Sold"],


index=["Fruit", "Origin"],
aggfunc={"Price per Unit": “mean”,
"Quantity Sold": [“min”, “max”]}, fill_value=0)

print(pivot_multiple_agg)
Pivot tables and cross-tabulations
cross-tabulations

• Crosstabs are a way to summarize categorical data (data that falls into categories).
• They show the frequency distribution of data across two or more variables.
• They are useful for understanding the relationships between different categorical
variables.
cross-tabulations
Fruit Example:
1.Counting Fruits by Color and Origin:
cross-tabulations

2. Adding Margins (Totals):


cross-tabulations

3. Using Multiple Index or Column Variables:


cross-tabulations

4. Renaming Index and Column Names:

You might also like