EDA_Module_3-1
EDA_Module_3-1
Analytics
Course Code: 23CSE422
Mrs. Vinutha K,
Assistant Professor
Department of CS & E
Module-3:
Grouping and Correlation
Grouping Datasets - Understanding groupby()
What is groupby( ) in Pandas?
• "The groupby( ) function allows grouping rows based on values in one or more columns."
import pandas as pd
data = {'Fruit': ['Apple', 'Banana', 'Apple', 'Banana'],
'Store': ['A', 'B', 'B', 'A'],
'Quantity': [10, 15, 20, 5]}
df = pd.DataFrame(data)
# Group by 'Fruit' and calculate the sum of 'Quantity'
grouped_data = df.groupby('Fruit')['Quantity'].sum()
print(grouped_data)
Groupby Mechanics
• When we analyze data using pandas, we often need to divide it into groups based on
certain criteria.
• "Groupby mechanics" refers to how we organize our dataset into these groups so we
can perform operations or changes.
• The groupby() method involves two main steps:
Splitting the data into groups based on some criteria.
Applying a function to each group independently.
• To use groupby() effectively, you need a dataset with both numerical (e.g., price) and
categorical (e.g., Color) data.
Groupby Mechanics
Output:
Selecting a subset of columns
• When you use groupby(), you're not limited to just one category.
You can group your data based on multiple columns.
• This allows for more detailed analysis.
• After grouping by multiple columns, you can then perform
calculations (like average, sum, etc.) on the resulting groups.
Selecting a subset of columns
Simple Example:
Imagine you have a table of information about toys:
Output:
Selecting a subset of columns
Let's say you want to find the average price of toys, not just by color, but by color and material.
1. Group by Multiple Columns: You would group the data by both "Color" and "Material." This
creates groups like:
• Red & Plastic
• Blue & Metal
• Silver & Metal
2. Calculate the Average Price: Then, you calculate the average "Price" for each of these combined
groups.
This gives you a more refined analysis than just looking at the average price by color or by material
alone.
Selecting a subset of columns
Output:
Max and min
Instead of the automobile dataset(used in Text Book), we'll use a toy dataset as an example.
Concept: After grouping data, we often want to find the highest (maximum) or lowest (minimum)
values within each group.
Objective: To find the highest and lowest values within specific groups of data.
Method:
Use groupby() to create groups (e.g., by "Color").
Apply the max() function to get the maximum value in each group.
Apply the min() function to get the minimum value in each group.
Example:
• Imagine a toy dataset with columns like "Toy Name," "Color," "Material," and "Price."
• If we group by "Color," then using max() on "Price" finds the most expensive toy of each color.
• If we group by "Color," then using min() on "Price" finds the least expensive toy of each color.
Max and min
Output:
Mean
Finding the Mean: The mean() method is used to calculate the average of numerical columns
within each group.
Mean of a Specific Group: You can also find the mean of each column for a specific group by
using get_group() and then applying the mean() method.
Counting Records: The count() method can be used to count the number of records in each
group.
Mean
Mean
Mean
Data aggregation
• Data aggregation involves applying a mathematical operation to a dataset or a subset
of it.
• In pandas, the aggregate() function (or its shorthand agg()) is used to perform these
aggregations across columns.
Common aggregation functions include:
sum(): Calculates the sum of values.
min(): Finds the minimum value.
max(): Finds the maximum value.
Data aggregation
Group-wise operations
1. Grouping:
The groupby operation is like sorting your data into categories. For example, you might want to
group the data by "Fruit" and "Origin." This would create groups like:
• Apple, Local
• Apple, Imported
• Banana, Local
• Banana, Imported
• Orange, Imported In our fruit example it is "Fruit" and "Origin"
Group-wise operations
2. Aggregation:
Once you have these groups, you can perform calculations (aggregations) on the data within each
group.
Examples of aggregations:
min: Find the minimum value (e.g., the minimum price of apples from a specific origin).
max: Find the maximum value (e.g., the maximum quantity of bananas sold from a specific origin).
mean: Calculate the average value (e.g., the average price of oranges).
sum: Calculate the total sum of the values (e.g. the total quantity of apples sold from a specific
location)
std: Calculate the standard deviation (e.g. how spread out the prices of bananas are.) In our fruit
example, we can find the min/max/mean/sum of the price per unit, or the quantity sold.
Group-wise operations
Group-wise operations
Output:
Renaming grouped aggregation columns
Renaming the aggregated columns makes the resulting DataFrame much more readable
and understandable.
Applying it to the Fruit Example:
Let's say we want to find the:
Maximum price of each fruit from each origin (and name it "Max Price").
Minimum price of each fruit from each origin (and name it "Min Price").
Average price of each fruit from each origin (and name it "Average Price").
Renaming grouped aggregation columns
Group-wise transformations
Aggregation vs. Transformation:
Aggregation (like we did before with agg()) reduces the data within each group to a single
summary value (e.g., the average price).
Transformation, on the other hand, preserves the original size of the data. It applies a
function to each element within a group and returns a result with the same number of rows
as the original group.
Purpose:
-Transformations are used when you want to modify or calculate values based on group
information without changing the shape of your DataFrame.
- A common use case is to standardize data within groups or to calculate group-specific
statistics and add them as new columns.
Group-wise transformations
Group-wise transformations
Group-wise transformations
Pivot tables and cross-tabulations
• Pivot tables are a powerful tool for summarizing and analyzing data.
• They allow you to reshape data into a table format, where rows and columns
represent different categories.
• They are useful for aggregating data based on multiple dimensions.
Pivot tables and cross-tabulations
Fruit Example: Applying Pivot Tables
Let's use our fruit DataFrame and demonstrate some pivot tables:
1. Simple Pivot Table (Average Price by Fruit):
import pandas as pd
import numpy as np
pivot_multiple_index = pd.pivot_table(fruit_df,
index=["Fruit", "Origin"],
values=["Price per Unit"])
print(pivot_multiple_index)
Pivot tables and cross-tabulations
3. Pivot Table with Columns and Aggregation:
index="Fruit", columns="Origin",
aggfunc=“mean”, fill_value=0)
print(pivot_columns_agg)
Pivot tables and cross-tabulations
print(pivot_multiple_agg)
Pivot tables and cross-tabulations
cross-tabulations
• Crosstabs are a way to summarize categorical data (data that falls into categories).
• They show the frequency distribution of data across two or more variables.
• They are useful for understanding the relationships between different categorical
variables.
cross-tabulations
Fruit Example:
1.Counting Fruits by Color and Origin:
cross-tabulations