Pandas Module
Pandas Module
PANDAS
Pandas is an open-source Python library
It provides easy-to-use
1. data structures
2. data analysis tools
print(df)
PANDAS SERIES
What Is A Series?
A Pandas Series is essentially a one-dimensional labeled array capable of holding any data type (integers,
strings, floats, etc.).
Each element in the Series has an associated index (label), and you can access and manipulate these elements by
their index.
Key Features of a Pandas Series:
It's like a column in a DataFrame.
It has an index that labels each element.
It can hold various data types like integers, floats, or even strings.
Labels
If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has
index 1 etc.
This label can be used to access a specified value.
With the index argument, you can name your own labels
Note: The keys of the dictionary become the labels.
Data Used Series:
import pandas as pd 0 1.2
1 0.8
# Creating a Pandas Series for 2 1.5
'DataUsed_GB' 3 2.1
data_used_series = 4 0.9
pd.Series(data_used_GB) 5 1.0
6 1.8
# Display the Series 7 2.3
print("Data Used Series:") 8 1.6
print(data_used_series) dtype: float64
# Creating a Pandas Series with custom Dates Series with User IDs as index:
index UserID
dates_series = pd.Series(dates, 101 2025-03-01
index=userIDs) 102 2025-03-01
103 2025-03-01
# Display the Series with custom index 104 2025-03-02
print("\nDates Series with User IDs as 105 2025-03-02
index:") 106 2025-03-02
print(dates_series) 107 2025-03-03
108 2025-03-03
109 2025-03-03
dtype: object
PANDAS DATAFRAME
A Pandas DataFrame is a two-dimensional labeled data structure, similar to a table in a
database or an Excel spreadsheet. It contains rows and columns, where each column can be of
a different data type (integers, floats, strings, etc.).
Locate Row
The DataFrame is like a table with rows and columns.
Pandas use the loc (locate) attribute to return one or more specified row(s)
import pandas as pd
# Data
userIDs = [101, 102, 103, 104, 105, 106, 107, 108, 109]
dates = ["2025-03-01", "2025-03-01", "2025-03-01", "2025-03-02", "2025-03-02", "2025-03-02", "2025-03-03", "2025-03-03",
"2025-03-03"]
data_used_GB = [1.2, 0.8, 1.5, 2.1, 0.9, 1.0, 1.8, 2.3, 1.6]
call_duration_min = [45, 30, 60, 75, 40, 50, 65, 90, 55]
locations = ["New York", "San Francisco", "Chicago", "New York", "Los Angeles", "Chicago", "San Francisco", "Los Angeles", "New
York"]
signal_strength_dBm = [-70, -65, -75, -68, -60, -72, -63, -62, -69]
# Creating DataFrame
data = {
'UserID': userIDs,
'Date': dates,
'DataUsed_GB': data_used_GB,
'CallDuration_min': call_duration_min,
'Location': locations,
'SignalStrength_dBm': signal_strength_dBm
}
df = pd.DataFrame(data)
# Displaying the DataFrame
print(df)
UserID Date DataUsed_GB CallDuration_min Location SignalStrength_dBm
0 101 2025-03-01 1.2 45 New York -70
1 102 2025-03-01 0.8 30 San Francisco -65
2 103 2025-03-01 1.5 60 Chicago -75
3 104 2025-03-02 2.1 75 New York -68
4 105 2025-03-02 0.9 40 Los Angeles -60
5 106 2025-03-02 1.0 50 Chicago -72
6 107 2025-03-03 1.8 65 San Francisco -63
7 108 2025-03-03 2.3 90 Los Angeles -62
8 109 2025-03-03 1.6 55 New York -69
CSV Files/JSON Files
Pandas provides an easy way to read CSV (Comma Separated Values) files using the read_csv() function.
This allows you to load data from CSV files into a Pandas DataFrame for easy manipulation and analysis.
import pandas as pd
First 3 rows:
UserID Date DataUsed_GB CallDuration_min Location SignalStrength_dBm
0 101 2025-03-01 1.2 45 New York -70
1 102 2025-03-01 0.8 30 Santa Clara -65
2 103 2025-03-01 1.5 60 Chicago -75
Selected columns:
UserID DataUsed_GB
0 101 1.2
1 102 0.8
2 103 1.5
3 104 2.1
4 105 0.9
MAX_ROWS
In Pandas, the default display of a DataFrame is limited to a certain number of rows when printed to the
console. This limit is controlled by the max_rows option. This setting specifies the maximum number of
rows that will be shown when printing a DataFrame.
To display a maximum number of rows: You can change the max_rows setting to control how many rows
you want to display when printing a DataFrame.
import pandas as pd
# Create a sample DataFrame with more rows
data = {'UserID': range(1, 101), 'Name': ['User'+str(i) for i in range(1, 101)]}
df = pd.DataFrame(data)
3. Renaming Columns
If the column names are not meaningful or need to be changed, you can rename them using rename().
# Rename columns
df.rename(columns={'old_name': 'new_name', 'old_column':
'new_column'}, inplace=True)
CLEANING EMPTY CELLS
Cleaning empty cells (or missing values) is a crucial step when preparing data for analysis. Empty cells
can be represented in different ways, such as NaN, None, or an empty string ""
1. Identifying Empty Cells
To check for empty or missing cells, you can use isnull() or isna(). These
functions return a DataFrame of the same shape as the original, but with True where
there are missing values and False where there are not.
import pandas as pd
df = pd.DataFrame(data)
print(df_cleaned)
RESEARCH ABOUT:
Cleaning Wrong Format
Cleaning Wrong Data
Removing Duplicates
Key Summary of Key Pandas Functions for Cleaning Data:
•isnull() / isna(): Check for missing values.
•dropna(): Drop rows or columns with missing values.
•fillna(): Fill missing values with a constant, method, or statistic.
•drop_duplicates(): Remove duplicate rows.
•rename(): Rename columns.
•astype(): Convert data types.
•str.strip(): Remove leading/trailing spaces.
•str.lower() / str.upper(): Normalize case.
•replace(): Replace specific values.
•get_dummies(): Convert categorical variables to dummy variables (one-hot enc
•clip(): Cap values in a column.
•map(): Map categories to numerical values.
PANDAS - DATA CORRELATIONS
In Pandas, data correlations are used to measure the relationship between two or
more variables.
Correlation helps determine how one variable might change in response to another
import pandas as pd
X Y Z
X 1.000000 -1.000000 0.970725
# Sample DataFrame
Y -1.000000 1.000000 -0.970725
data = {'X': [1, 2, 3, 4, 5], Z 0.970725 -0.970725 1.000000
'Y': [5, 4, 3, 2, 1],
'Z': [1, 1, 2, 2, 3]} Here, you can see that X and Y have a perfect
negative correlation (-1.0), while X and Z have a
df = pd.DataFrame(data) strong positive correlation (0.97).
print(correlation_matrix)
2. Spearman Rank Correlation
The Spearman rank correlation measures the monotonic relationship between two
variables. Unlike Pearson, Spearman does not require a linear relationship and works well
for ordinal or non-linear data.
You can calculate the Spearman correlation by passing method='spearman' to corr()