0% found this document useful (0 votes)

2 views

EDA_Module_3-1

This document is a course module on Exploratory Data Analytics, focusing on the use of the groupby() function in Pandas for data analysis. It explains the Split-Apply-Combine approach, how to group datasets by multiple columns, and perform various aggregations such as sum, mean, max, and min. Additionally, it covers the creation of pivot tables and cross-tabulations for summarizing and analyzing data.

Uploaded by

dimpalrajput511

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

EDA_Module_3-1

Uploaded by

dimpalrajput511

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 40

Course: Exploratory Data

Analytics
Course Code: 23CSE422

Mrs. Vinutha K,
Assistant Professor
Department of CS & E
Module-3:
Grouping and Correlation
Grouping Datasets - Understanding groupby()
What is groupby( ) in Pandas?
• "The groupby( ) function allows grouping rows based on values in one or more columns."

• It follows the Split-Apply-Combine approach:

• "Split: Divide data into groups."
• We can use groupby() to split this data by "Fruit." This would create two groups: one for "Apple" and one for "Banana.”
• "Apply: Perform calculations (sum, mean, etc.)."
• We can then apply the sum() function to the "Quantity" column in each group.
• This would give us the total quantity of apples sold and the total quantity of bananas sold.
• "Combine: Merge results into a new DataFrame."
• The results (total apples and total bananas) would be combined into a new table.
Understanding groupby()
Code:

import pandas as pd
data = {'Fruit': ['Apple', 'Banana', 'Apple', 'Banana'],
'Store': ['A', 'B', 'B', 'A'],
'Quantity': [10, 15, 20, 5]}
df = pd.DataFrame(data)
# Group by 'Fruit' and calculate the sum of 'Quantity'
grouped_data = df.groupby('Fruit')['Quantity'].sum()
print(grouped_data)
Groupby Mechanics
• When we analyze data using pandas, we often need to divide it into groups based on
certain criteria.
• "Groupby mechanics" refers to how we organize our dataset into these groups so we
can perform operations or changes.
• The groupby() method involves two main steps:
Splitting the data into groups based on some criteria.
Applying a function to each group independently.
• To use groupby() effectively, you need a dataset with both numerical (e.g., price) and
categorical (e.g., Color) data.
Groupby Mechanics

Output:
Selecting a subset of columns

• When you use groupby(), you're not limited to just one category.
You can group your data based on multiple columns.
• This allows for more detailed analysis.
• After grouping by multiple columns, you can then perform
calculations (like average, sum, etc.) on the resulting groups.
Selecting a subset of columns
Simple Example:
Imagine you have a table of information about toys:
Output:
Selecting a subset of columns

Let's say you want to find the average price of toys, not just by color, but by color and material.
1. Group by Multiple Columns: You would group the data by both "Color" and "Material." This
creates groups like:
• Red & Plastic
• Blue & Metal
• Silver & Metal
2. Calculate the Average Price: Then, you calculate the average "Price" for each of these combined
groups.
This gives you a more refined analysis than just looking at the average price by color or by material
alone.
Selecting a subset of columns
Output:
Max and min
Instead of the automobile dataset(used in Text Book), we'll use a toy dataset as an example.
Concept: After grouping data, we often want to find the highest (maximum) or lowest (minimum)
values within each group.
Objective: To find the highest and lowest values within specific groups of data.
Method:
 Use groupby() to create groups (e.g., by "Color").
 Apply the max() function to get the maximum value in each group.
 Apply the min() function to get the minimum value in each group.
Example:
• Imagine a toy dataset with columns like "Toy Name," "Color," "Material," and "Price."
• If we group by "Color," then using max() on "Price" finds the most expensive toy of each color.
• If we group by "Color," then using min() on "Price" finds the least expensive toy of each color.
Max and min

Output:
Mean

Finding the Mean: The mean() method is used to calculate the average of numerical columns
within each group.
Mean of a Specific Group: You can also find the mean of each column for a specific group by
using get_group() and then applying the mean() method.
Counting Records: The count() method can be used to count the number of records in each
group.
Mean
Mean
Mean
Data aggregation
• Data aggregation involves applying a mathematical operation to a dataset or a subset
of it.
• In pandas, the aggregate() function (or its shorthand agg()) is used to perform these
aggregations across columns.
Common aggregation functions include:
sum(): Calculates the sum of values.
min(): Finds the minimum value.
max(): Finds the maximum value.
Data aggregation
Group-wise operations
1. Grouping:
The groupby operation is like sorting your data into categories. For example, you might want to
group the data by "Fruit" and "Origin." This would create groups like:
• Apple, Local
• Apple, Imported
• Banana, Local
• Banana, Imported
• Orange, Imported In our fruit example it is "Fruit" and "Origin"
Group-wise operations
2. Aggregation:
Once you have these groups, you can perform calculations (aggregations) on the data within each
group.
Examples of aggregations:
min: Find the minimum value (e.g., the minimum price of apples from a specific origin).
max: Find the maximum value (e.g., the maximum quantity of bananas sold from a specific origin).
mean: Calculate the average value (e.g., the average price of oranges).
sum: Calculate the total sum of the values (e.g. the total quantity of apples sold from a specific
location)
std: Calculate the standard deviation (e.g. how spread out the prices of bananas are.) In our fruit
example, we can find the min/max/mean/sum of the price per unit, or the quantity sold.
Group-wise operations
Group-wise operations

Output:
Renaming grouped aggregation columns

Renaming the aggregated columns makes the resulting DataFrame much more readable
and understandable.
Applying it to the Fruit Example:
Let's say we want to find the:
Maximum price of each fruit from each origin (and name it "Max Price").
Minimum price of each fruit from each origin (and name it "Min Price").
Average price of each fruit from each origin (and name it "Average Price").
Renaming grouped aggregation columns
Group-wise transformations
Aggregation vs. Transformation:
Aggregation (like we did before with agg()) reduces the data within each group to a single
summary value (e.g., the average price).
Transformation, on the other hand, preserves the original size of the data. It applies a
function to each element within a group and returns a result with the same number of rows
as the original group.
Purpose:
-Transformations are used when you want to modify or calculate values based on group
information without changing the shape of your DataFrame.
- A common use case is to standardize data within groups or to calculate group-specific
statistics and add them as new columns.
Group-wise transformations
Group-wise transformations
Group-wise transformations
Pivot tables and cross-tabulations

• Pivot tables are a powerful tool for summarizing and analyzing data.
• They allow you to reshape data into a table format, where rows and columns
represent different categories.
• They are useful for aggregating data based on multiple dimensions.
Pivot tables and cross-tabulations
Fruit Example: Applying Pivot Tables
Let's use our fruit DataFrame and demonstrate some pivot tables:
1. Simple Pivot Table (Average Price by Fruit):

import pandas as pd
import numpy as np

# Assuming fruit_df is your DataFrame

pivot_simple = pd.pivot_table(fruit_df, index=["Fruit"],

values=["Price per Unit"])
print(pivot_simple)
Pivot tables and cross-tabulations
Pivot tables and cross-tabulations
2. Pivot Table with Multiple Indexes (Average Price by Fruit and Origin):

pivot_multiple_index = pd.pivot_table(fruit_df,
index=["Fruit", "Origin"],
values=["Price per Unit"])
print(pivot_multiple_index)
Pivot tables and cross-tabulations
3. Pivot Table with Columns and Aggregation:

pivot_columns_agg = pd.pivot_table(fruit_df, values="Price per Unit",

index="Fruit", columns="Origin",
aggfunc=“mean”, fill_value=0)

print(pivot_columns_agg)
Pivot tables and cross-tabulations

4. Pivot Table with Multiple Aggregations:

pivot_multiple_agg = pd.pivot_table(fruit_df, values=["Price per Unit", "Quantity Sold"],

index=["Fruit", "Origin"],
aggfunc={"Price per Unit": “mean”,
"Quantity Sold": [“min”, “max”]}, fill_value=0)

print(pivot_multiple_agg)
Pivot tables and cross-tabulations
cross-tabulations

• Crosstabs are a way to summarize categorical data (data that falls into categories).
• They show the frequency distribution of data across two or more variables.
• They are useful for understanding the relationships between different categorical
variables.
cross-tabulations
Fruit Example:
1.Counting Fruits by Color and Origin:
cross-tabulations

2. Adding Margins (Totals):

cross-tabulations

3. Using Multiple Index or Column Variables:

cross-tabulations

4. Renaming Index and Column Names:

Effective Pandas
50% (2)
Effective Pandas
20 pages
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
100% (3)
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
9 pages
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
100% (4)
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
11 pages
Pandas Cheat Sheet PDF
67% (3)
Pandas Cheat Sheet PDF
1 page
Unit_5_2
No ratings yet
Unit_5_2
6 pages
Understanding Pandas Groupby For Data Aggregation
No ratings yet
Understanding Pandas Groupby For Data Aggregation
49 pages
Lecture 14
No ratings yet
Lecture 14
33 pages
Pandas Tricks To Create A DataFrame From An Existing One
No ratings yet
Pandas Tricks To Create A DataFrame From An Existing One
14 pages
Chapter-2 Python Pandas
100% (2)
Chapter-2 Python Pandas
33 pages
Pandas Plots
No ratings yet
Pandas Plots
14 pages
Unit - 1 Eda Continuation 2
No ratings yet
Unit - 1 Eda Continuation 2
34 pages
Data Aggregation and Group Operations
No ratings yet
Data Aggregation and Group Operations
34 pages
SQL To Pandas - Group Aggregations
No ratings yet
SQL To Pandas - Group Aggregations
6 pages
groupby.rst
No ratings yet
groupby.rst
32 pages
4 PythonPandas
No ratings yet
4 PythonPandas
8 pages
Python CSBS Bhavya Lab Manual
No ratings yet
Python CSBS Bhavya Lab Manual
14 pages
DATA AGGREGATION USING PYTHON (1)
No ratings yet
DATA AGGREGATION USING PYTHON (1)
33 pages
Data Visualization
No ratings yet
Data Visualization
41 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
2 pages
Descriptive Statistics With Pandas: Data Handling Using Pandas - II
100% (1)
Descriptive Statistics With Pandas: Data Handling Using Pandas - II
37 pages
Pandas
No ratings yet
Pandas
9 pages
Informatics Practices Class 12 Cbse Notes Data Handling
0% (1)
Informatics Practices Class 12 Cbse Notes Data Handling
17 pages
python interviews
No ratings yet
python interviews
154 pages
Pandas Cheat Sheet
100% (2)
Pandas Cheat Sheet
6 pages
Python Cheat Sheet For Excel Users
100% (2)
Python Cheat Sheet For Excel Users
5 pages
Pivot Tables
No ratings yet
Pivot Tables
11 pages
Data Manipulation With Pandas - Yulei's Sandbox
No ratings yet
Data Manipulation With Pandas - Yulei's Sandbox
18 pages
Data Wrangling With Python and Pandas
No ratings yet
Data Wrangling With Python and Pandas
7 pages
Pandas Cheat Sheet CN
No ratings yet
Pandas Cheat Sheet CN
4 pages
Pandas Cheat Sheet
100% (4)
Pandas Cheat Sheet
2 pages
Pandas Cheat Sheet
83% (12)
Pandas Cheat Sheet
2 pages
Commands SQL, Python (BASICS)
No ratings yet
Commands SQL, Python (BASICS)
7 pages
Unit 4 Fod
100% (1)
Unit 4 Fod
21 pages
UNIT 4,5
No ratings yet
UNIT 4,5
24 pages
Pandas Cheat Sheet
100% (1)
Pandas Cheat Sheet
2 pages
EDA LAB ASSIGNMENT2
No ratings yet
EDA LAB ASSIGNMENT2
10 pages
Usage of NumPy for Numerical Data in Detail
No ratings yet
Usage of NumPy for Numerical Data in Detail
52 pages
Data Science Cheat Sheet: KEY Imports
100% (1)
Data Science Cheat Sheet: KEY Imports
1 page
Pandas
No ratings yet
Pandas
13 pages
Numpy
No ratings yet
Numpy
40 pages
Pandas Complete + Visualisation Summary of IBM Visualization
No ratings yet
Pandas Complete + Visualisation Summary of IBM Visualization
21 pages
Summary: Introduction To Data Visualization Tools
No ratings yet
Summary: Introduction To Data Visualization Tools
13 pages
B "Hello, World!" Print (B (2:5) ) Llo
No ratings yet
B "Hello, World!" Print (B (2:5) ) Llo
52 pages
Pandas PDF(2)
No ratings yet
Pandas PDF(2)
25 pages
EDA_CODE_SNIPPETS
No ratings yet
EDA_CODE_SNIPPETS
17 pages
UNIT_IV (1)
No ratings yet
UNIT_IV (1)
63 pages
jjkjk
No ratings yet
jjkjk
10 pages
Data Mining_Week - 4
No ratings yet
Data Mining_Week - 4
8 pages
Lec 05-DSFa23
No ratings yet
Lec 05-DSFa23
65 pages
UNIT 5
No ratings yet
UNIT 5
19 pages
Pandas Notes
No ratings yet
Pandas Notes
27 pages
pandas (1)
No ratings yet
pandas (1)
25 pages
Excel To Pandas Advanced Data Techniques For BI Devs 1729266352
No ratings yet
Excel To Pandas Advanced Data Techniques For BI Devs 1729266352
9 pages
Python-Handson-MODULE-4 (1)
No ratings yet
Python-Handson-MODULE-4 (1)
8 pages
Python Cheat Sheet Code Academy
100% (1)
Python Cheat Sheet Code Academy
1 page
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet
Pivot Tables In Depth For Microsoft Excel 2016
From Everand
Pivot Tables In Depth For Microsoft Excel 2016
Suljan Qeska
3.5/5 (3)
Excel Techniques
From Everand
Excel Techniques
Online Trainees
2/5 (1)
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Lesson 3.1 Work-Energy Theorem
No ratings yet
Lesson 3.1 Work-Energy Theorem
15 pages
UNIT II Machine Learning
No ratings yet
UNIT II Machine Learning
43 pages
Shah Hussain (227) Lab Open Ended
No ratings yet
Shah Hussain (227) Lab Open Ended
12 pages
Quantum Computers 2015 PDF
No ratings yet
Quantum Computers 2015 PDF
5 pages
Problems Sets For MFE 1
No ratings yet
Problems Sets For MFE 1
1 page
Important Questions For CBSE Class 6 Maths Chapter 1
No ratings yet
Important Questions For CBSE Class 6 Maths Chapter 1
6 pages
IPMSM Model Predictive Control in Flux-Weakening Operation Using An Improved Algorithm
No ratings yet
IPMSM Model Predictive Control in Flux-Weakening Operation Using An Improved Algorithm
10 pages
Lecture 7 - 1
No ratings yet
Lecture 7 - 1
10 pages
Viscosity Temperature Pressure
No ratings yet
Viscosity Temperature Pressure
11 pages
HND Computer Engineering
100% (1)
HND Computer Engineering
203 pages
Fluid Mechanics II-course Guide Book - 2023 - ASTU
No ratings yet
Fluid Mechanics II-course Guide Book - 2023 - ASTU
3 pages
Dulces Math 1040
No ratings yet
Dulces Math 1040
6 pages
Bank A: Housing Loan Property Equity Loan
No ratings yet
Bank A: Housing Loan Property Equity Loan
5 pages
Flow Through A Venturi Meter
No ratings yet
Flow Through A Venturi Meter
13 pages
Some Applications of Trigonometry
No ratings yet
Some Applications of Trigonometry
7 pages
Python Examples
No ratings yet
Python Examples
13 pages
Vibration & FFT Analyzer Basic
No ratings yet
Vibration & FFT Analyzer Basic
29 pages
Class 7 Ch 7 Worksheet
No ratings yet
Class 7 Ch 7 Worksheet
3 pages
Vertical Alignment Lectures
100% (1)
Vertical Alignment Lectures
24 pages
02 Complex Number PDF
No ratings yet
02 Complex Number PDF
10 pages
As A Level Physics Practical Workbook
No ratings yet
As A Level Physics Practical Workbook
61 pages
Planimeters and Theodolites
No ratings yet
Planimeters and Theodolites
7 pages
Chapter 8
No ratings yet
Chapter 8
61 pages
Public TC Slides
No ratings yet
Public TC Slides
336 pages
Sonal Trivedi PDF
No ratings yet
Sonal Trivedi PDF
201 pages
Mock CAT
No ratings yet
Mock CAT
59 pages
Risks 06 00144 v2 PDF
No ratings yet
Risks 06 00144 v2 PDF
16 pages
Trigonometry and Investments Quiz
No ratings yet
Trigonometry and Investments Quiz
6 pages
emagunit2electricfieldsandforces
No ratings yet
emagunit2electricfieldsandforces
6 pages

EDA_Module_3-1

Uploaded by

EDA_Module_3-1

Uploaded by

Course: Exploratory Data

• It follows the Split-Apply-Combine approach:

# Assuming fruit_df is your DataFrame

pivot_simple = pd.pivot_table(fruit_df, index=["Fruit"],

pivot_columns_agg = pd.pivot_table(fruit_df, values="Price per Unit",

4. Pivot Table with Multiple Aggregations:

pivot_multiple_agg = pd.pivot_table(fruit_df, values=["Price per Unit", "Quantity Sold"],

2. Adding Margins (Totals):

3. Using Multiple Index or Column Variables:

4. Renaming Index and Column Names:

You might also like