0% found this document useful (0 votes)
3 views

Pandas Module

Pandas is an open-source Python library primarily used for data manipulation, analysis, and cleaning, providing easy-to-use data structures and analysis tools. It allows for the creation of DataFrames and Series, supports various operations, and offers functionalities for handling missing data and reading CSV files. Key features include the ability to clean data, analyze correlations, and integrate seamlessly with other libraries.

Uploaded by

misachmatthew
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Pandas Module

Pandas is an open-source Python library primarily used for data manipulation, analysis, and cleaning, providing easy-to-use data structures and analysis tools. It allows for the creation of DataFrames and Series, supports various operations, and offers functionalities for handling missing data and reading CSV files. Key features include the ability to clean data, analyze correlations, and integrate seamlessly with other libraries.

Uploaded by

misachmatthew
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

PYTHON MODULE.

PANDAS
Pandas is an open-source Python library

It is majorly used for:


 data manipulation,
 analysis,
 data cleaning.

It provides easy-to-use
1. data structures
2. data analysis tools

Why do we need Pandas?


Easy to learn and use
Works well with other libraries like NumPy, Matplotlib, and Scikit-learn
Excellent performance with large datasets
Built-in functions for time series analysis
IMPORTING PANDAS
It's a common convention in Python to import the Pandas library and give it an alias pd.
An alias is essentially a secondary name or shortcut for an object, module, or function.
import pandas as pd
This shorthand makes it easier to refer to the library's functions and methods without needing to write out pandas
every time.
import pandas as pd

# Creating a DataFrame using 'pd'


data = {
'Name': ['John', 'Alice', 'Bob'],
'Age': [25, 30, 22],
'City': ['Kampala', 'Entebbe', 'Mbarara']
}

df = pd.DataFrame(data) # pd is used to call pandas


functions

print(df)
PANDAS SERIES
What Is A Series?
A Pandas Series is essentially a one-dimensional labeled array capable of holding any data type (integers,
strings, floats, etc.).
Each element in the Series has an associated index (label), and you can access and manipulate these elements by
their index.
Key Features of a Pandas Series:
It's like a column in a DataFrame.
It has an index that labels each element.
It can hold various data types like integers, floats, or even strings.

Labels
If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has
index 1 etc.
This label can be used to access a specified value.
With the index argument, you can name your own labels
Note: The keys of the dictionary become the labels.
Data Used Series:
import pandas as pd 0 1.2
1 0.8
# Creating a Pandas Series for 2 1.5
'DataUsed_GB' 3 2.1
data_used_series = 4 0.9
pd.Series(data_used_GB) 5 1.0
6 1.8
# Display the Series 7 2.3
print("Data Used Series:") 8 1.6
print(data_used_series) dtype: float64

# Creating a Pandas Series with custom Dates Series with User IDs as index:
index UserID
dates_series = pd.Series(dates, 101 2025-03-01
index=userIDs) 102 2025-03-01
103 2025-03-01
# Display the Series with custom index 104 2025-03-02
print("\nDates Series with User IDs as 105 2025-03-02
index:") 106 2025-03-02
print(dates_series) 107 2025-03-03
108 2025-03-03
109 2025-03-03
dtype: object
PANDAS DATAFRAME
A Pandas DataFrame is a two-dimensional labeled data structure, similar to a table in a
database or an Excel spreadsheet. It contains rows and columns, where each column can be of
a different data type (integers, floats, strings, etc.).

Key Features of a Pandas DataFrame:


•Rows and columns can be labeled with custom indices.
•Each column is a Pandas Series.
•Supports a wide range of operations like filtering, sorting, merging, reshaping, and more.
•Allows easy handling of missing data.

Locate Row
The DataFrame is like a table with rows and columns.
Pandas use the loc (locate) attribute to return one or more specified row(s)
import pandas as pd
# Data
userIDs = [101, 102, 103, 104, 105, 106, 107, 108, 109]
dates = ["2025-03-01", "2025-03-01", "2025-03-01", "2025-03-02", "2025-03-02", "2025-03-02", "2025-03-03", "2025-03-03",
"2025-03-03"]
data_used_GB = [1.2, 0.8, 1.5, 2.1, 0.9, 1.0, 1.8, 2.3, 1.6]
call_duration_min = [45, 30, 60, 75, 40, 50, 65, 90, 55]
locations = ["New York", "San Francisco", "Chicago", "New York", "Los Angeles", "Chicago", "San Francisco", "Los Angeles", "New
York"]
signal_strength_dBm = [-70, -65, -75, -68, -60, -72, -63, -62, -69]

# Creating DataFrame
data = {
'UserID': userIDs,
'Date': dates,
'DataUsed_GB': data_used_GB,
'CallDuration_min': call_duration_min,
'Location': locations,
'SignalStrength_dBm': signal_strength_dBm
}
df = pd.DataFrame(data)
# Displaying the DataFrame
print(df)
UserID Date DataUsed_GB CallDuration_min Location SignalStrength_dBm
0 101 2025-03-01 1.2 45 New York -70
1 102 2025-03-01 0.8 30 San Francisco -65
2 103 2025-03-01 1.5 60 Chicago -75
3 104 2025-03-02 2.1 75 New York -68
4 105 2025-03-02 0.9 40 Los Angeles -60
5 106 2025-03-02 1.0 50 Chicago -72
6 107 2025-03-03 1.8 65 San Francisco -63
7 108 2025-03-03 2.3 90 Los Angeles -62
8 109 2025-03-03 1.6 55 New York -69
CSV Files/JSON Files
Pandas provides an easy way to read CSV (Comma Separated Values) files using the read_csv() function.

This allows you to load data from CSV files into a Pandas DataFrame for easy manipulation and analysis.

import pandas as pd

# Read CSV file


df = pd.read_csv('user_data.csv')

UserID,Date,DataUsed_GB,CallDuration_min,Location,Sign # Display the DataFrame


alStrength_dBm print(df)
101,2025-03-01,1.2,45,New York,-70
102,2025-03-01,0.8,30,Santa Clara,-65 # Display the first 3 rows
103,2025-03-01,1.5,60,Chicago,-75 print("\nFirst 3 rows:")
104,2025-03-02,2.1,75,New York,-68 print(df.head(3))
105,2025-03-02,0.9,40,Los Angeles,-60
# Display specific columns (e.g., 'UserID' and 'DataUsed_GB')
print("\nSelected columns:")
print(df[['UserID', 'DataUsed_GB']])
OUTPUT
UserID Date DataUsed_GB CallDuration_min Location SignalStrength_dBm
0 101 2025-03-01 1.2 45 New York -70
1 102 2025-03-01 0.8 30 Santa Clara -65
2 103 2025-03-01 1.5 60 Chicago -75
3 104 2025-03-02 2.1 75 New York -68
4 105 2025-03-02 0.9 40 Los Angeles -60

First 3 rows:
UserID Date DataUsed_GB CallDuration_min Location SignalStrength_dBm
0 101 2025-03-01 1.2 45 New York -70
1 102 2025-03-01 0.8 30 Santa Clara -65
2 103 2025-03-01 1.5 60 Chicago -75

Selected columns:
UserID DataUsed_GB
0 101 1.2
1 102 0.8
2 103 1.5
3 104 2.1
4 105 0.9
MAX_ROWS
In Pandas, the default display of a DataFrame is limited to a certain number of rows when printed to the
console. This limit is controlled by the max_rows option. This setting specifies the maximum number of
rows that will be shown when printing a DataFrame.
To display a maximum number of rows: You can change the max_rows setting to control how many rows
you want to display when printing a DataFrame.

import pandas as pd
# Create a sample DataFrame with more rows
data = {'UserID': range(1, 101), 'Name': ['User'+str(i) for i in range(1, 101)]}
df = pd.DataFrame(data)

# Set max_rows to control the display of rows


pd.set_option('display.max_rows', 20)

# Print the DataFrame (only 20 rows will be shown)


print(df)
PANDAS - ANALYZING DATAFRAMES
1. Displaying the First and Last Few Rows:
•head(): Displays the first few rows of the DataFrame
(default is 5 rows). 3. Viewing Basic Information:
•tail(): Displays the last few rows of the DataFrame (default is •info(): Provides concise summary of
5 rows). the DataFrame, including the number of
# Display the first 5 rows non-null entries, column names, and
print(df.head()) data types.
python
# Display the first 10 rows print(df.info())
print(df.head(10))
4. Accessing Specific Columns:
# Display the last 5 rows To view a specific column or subset of
print(df.tail()) columns, you can access them directly by
their names:
# Display the last 10 rows python
print(df.tail(10)) # Access a single column
2. Viewing the Shape of the DataFrame: print(df['column_name']) # Access
To know the number of rows and columns in the DataFrame, multiple columns print(df[['column1',
use the shape attribute: 'column2']])
print(df.shape) # Output: (rows, columns)
Cleaning Data
 Cleaning Data
 Cleaning Empty Cells
 Cleaning Wrong Format
 Cleaning Wrong Data
 Removing Duplicates
PANDAS - CLEANING DATA
Data cleaning is an essential step in
the data analysis process, as raw
data often comes with
inconsistencies, missing values,
duplicates, and errors
1. Handling Missing Data (NaN values)
Missing data is often represented as NaN (Not a Number). You can clean or fill missing data using the following
methods:
a) Identifying Missing Values
To check for missing values in a DataFrame, you can use the isnull() method:
# Check for missing values in the DataFrame
print(df.isnull())

# Count missing values in each column


print(df.isnull().sum())
b) Dropping Missing Values
If you want to remove rows or columns with missing data, you can use dropna().
# Drop rows with any missing values
df_cleaned = df.dropna()

# Drop rows where all columns are NaN


df_cleaned = df.dropna(how='all')

# Drop columns with missing values


df_cleaned = df.dropna(axis=1)
c) Filling Missing Values
2. Removing Duplicates
You can remove duplicate rows using drop_duplicates(). You can
also specify which columns to check for duplicates.
# Remove duplicate rows
df_cleaned = df.drop_duplicates()

# Remove duplicates based on specific columns


df_cleaned = df.drop_duplicates(subset=['column_name'])

# Keep the last occurrence of each duplicate (instead of the


first)
df_cleaned = df.drop_duplicates(keep='last')

3. Renaming Columns
If the column names are not meaningful or need to be changed, you can rename them using rename().

# Rename columns
df.rename(columns={'old_name': 'new_name', 'old_column':
'new_column'}, inplace=True)
CLEANING EMPTY CELLS
Cleaning empty cells (or missing values) is a crucial step when preparing data for analysis. Empty cells
can be represented in different ways, such as NaN, None, or an empty string ""
1. Identifying Empty Cells
To check for empty or missing cells, you can use isnull() or isna(). These
functions return a DataFrame of the same shape as the original, but with True where
there are missing values and False where there are not.
import pandas as pd

# Sample DataFrame with missing values


data = {'UserID': [101, 102, 103, None, 105],
'Name': ['Alice', 'Bob', None, 'David', 'Eve'],
'Age': [25, None, 30, 22, 35]}

df = pd.DataFrame(data)

# Check for missing values


print(df.isnull())

# Count missing values in each column


print(df.isnull().sum())
2. Dropping Empty Cells
You can remove rows or columns with empty cells using dropna().

a) Dropping Rows with Empty Cells


By default, dropna() removes any row with at least one
missing value: 3. Filling Empty Cells
# Drop rows with any missing values
df_cleaned = df.dropna() 4. Replacing Empty Cells with Conditional Logic

print(df_cleaned) 5. Checking and Handling Empty Strings


b) Dropping Rows Based on Specific Columns
You can drop rows with missing values only in certain
columns:
# Drop rows where 'Name' or 'Age' has missing values
df_cleaned = df.dropna(subset=['Name', 'Age'])

print(df_cleaned)
RESEARCH ABOUT:
 Cleaning Wrong Format
 Cleaning Wrong Data
 Removing Duplicates
Key Summary of Key Pandas Functions for Cleaning Data:
•isnull() / isna(): Check for missing values.
•dropna(): Drop rows or columns with missing values.
•fillna(): Fill missing values with a constant, method, or statistic.
•drop_duplicates(): Remove duplicate rows.
•rename(): Rename columns.
•astype(): Convert data types.
•str.strip(): Remove leading/trailing spaces.
•str.lower() / str.upper(): Normalize case.
•replace(): Replace specific values.
•get_dummies(): Convert categorical variables to dummy variables (one-hot enc
•clip(): Cap values in a column.
•map(): Map categories to numerical values.
PANDAS - DATA CORRELATIONS
In Pandas, data correlations are used to measure the relationship between two or
more variables.

Correlation helps determine how one variable might change in response to another

METHODS TO COMPUTE CORRELATIONS

 Pearson correlation coefficient


 Spearman rank correlation
 Kendall tau correlation(Research About this)
1. Pearson Correlation (default)
The Pearson correlation coefficient measures the linear relationship between two variables. It
returns a value between -1 and 1:
•1 indicates a perfect positive linear relationship.
•-1 indicates a perfect negative linear relationship.
•0 indicates no linear relationship.
You can compute the Pearson correlation using corr() on a DataFrame:

import pandas as pd
X Y Z
X 1.000000 -1.000000 0.970725
# Sample DataFrame
Y -1.000000 1.000000 -0.970725
data = {'X': [1, 2, 3, 4, 5], Z 0.970725 -0.970725 1.000000
'Y': [5, 4, 3, 2, 1],
'Z': [1, 1, 2, 2, 3]} Here, you can see that X and Y have a perfect
negative correlation (-1.0), while X and Z have a
df = pd.DataFrame(data) strong positive correlation (0.97).

# Compute the Pearson correlation matrix


correlation_matrix = df.corr()

print(correlation_matrix)
2. Spearman Rank Correlation
The Spearman rank correlation measures the monotonic relationship between two
variables. Unlike Pearson, Spearman does not require a linear relationship and works well
for ordinal or non-linear data.
You can calculate the Spearman correlation by passing method='spearman' to corr()

# Compute the Spearman rank correlation X Y Z


X 1.000000 -1.000000 1.000000
spearman_corr = df.corr(method='spearman') Y -1.000000 1.000000 -1.000000
Z 1.000000 -1.000000 1.000000
print(spearman_corr)

Here, the Spearman correlation between X and Y is -1,


showing a perfect negative monotonic relationship.
Research About
Plotting
Note “Use Knowledge from
Matplotlib”

You might also like