Pandas (1)
Pandas (1)
Pandas
Lesson Plan
Polymorphism and
Encapsulation
Java + DSA
Introduction to Pandas
Python programming language's Pandas library is an open-source tool for handling and analyzing data. In
order to make dealing with organized and tabular data simple and effective, it offers data structures and
functions. With the addition of named axes (rows and columns) and more versatile data processing tools,
Pandas, which is built on top of the NumPy library, expands its functionality. In data science and data analysis, it
is a crucial tool.
Pandas Serie
A Pandas Series is a one-dimensional labeled array that can hold data of various types, such as integers,
floats, strings, or even custom objects
It is akin to a column in a spreadsheet or a single variable in statistics. Each element in a Series is assigned
an index, allowing for easy data retrieval and manipulation.
Creating Series:
To Create Pandas Series DataFrame we use “pd.Series()” by converting other data structures like List[] or NumPy
arrays, Into Series.
Data Alignment
One of the powerful features of Pandas Series is data alignment
When performing operations on multiple Series, Pandas aligns data based on index labels, making it easy to
work with incomplete or differently indexed data.
What is a DataFrame
A DataFrame is akin to a spreadsheet or SQL table, with rows and columns that can hold a variety of data
types
Each column in a DataFrame is essentially a Pandas Series, sharing a common index, making it easier to
manage and manipulate data efficiently
There are different methods to generate a Dataframe. But commonly used approaches are
Creating a Dic
Reading Custom External Data
Accessing Data:
There are different Operations we can perform in a Dataframe. To access and manipulate data some common
operations are
Accessing a specific column
Accessing a specific row
Slicing rows and columns
Java + DSA
PWSkills
Data Manipulations
Pandas provides numerous functions to manipulate data within DataFrames, such as filtering, sorting,
merging, and aggregating data
Some Common operations include
Filterin
Sortin
Merging DataFrames
Pandas primarily offers two fundamental data structures Series and DataFrame
Series: A Series is a one-dimensional array-like object that can hold various data types, including integers,
floats, and strings. It is essentially a labeled array and has an associated index. You can think of a Series as a
column in a spreadsheet.
DataFrame: A DataFrame is a two-dimensional table, similar to a spreadsheet or a SQL table. It consists of
rows and columns, and each column is a Series. DataFrames are the primary data structure for most Pandas
operations.
Summary Statistics:
Handling Missing Data: Pandas also offers robust tools for handling missing data, which is a common issue in
real-world datasets. You can use methods like “isna()”, “fillna()”, and “dropna()” to manage missing values
effectively.
imported correctly.
Java + DSA
PWSkills
Reading Data from Excel Files
Pandas can read data from Excel files, both in the older .xls format and the newer .xlsx format
You can use the read_excel() function to extract data from specific sheets and set custom data ranges.
Re-Indexing in Panda
Reindexing is a fundamental operation in the Python Pandas library, allowing data scientists and analysts to
reshape, realign, and modify data structures like Series and DataFrames
This process helps in ensuring data consistency and compatibility, which is crucial in various data
manipulation and analysis tasks.
Java + DSA
PWSkills
What is Re-indexing
Reindexing refers to the process of creating a new object with the data conformed to a new index.
In Pandas, we can reindex Series and DataFrames, enabling you to manipulate data, add missing values, or
rearrange data according to a different set of labels.
Re-indexing Syntax: To reindex a Pandas Series or DataFrame, you can use the reindex
new_data = data.reindex(new_index)
Pandas Iteration
Java + DSA
PWSkills
Vectorized Operations
Pandas is optimized for vectorized operations, meaning it performs operations on entire arrays rather than
individual elements. This is much faster than traditional iteration
Minimize explicit iteration when working with Pandas to take full advantage of its speed and efficiency.
Conditional iteration:
You can use boolean indexing to iterate through data based on specific conditions. This allows you to select
and work with subsets of your data.
Pandas Sorting:
Sorting data is a fundamental operation in data analysis and manipulation. Python, with its powerful data
manipulation library, Pandas, offers various methods to sort data efficiently.
Understanding Data Sorting Data sorting involves arranging data in a specific order, making it
easier to identify patterns, make comparisons, and gain insights. In Pandas, data can be sorted either by index
or by values.
a. Pandas DataFrames and Series can be sorted based on their index labels.
b. This is particularly useful when the index represents meaningful labels or categories, and you want to
c. To sort by index, you can use the ‘sort_index()’ method. You can specify the axis (0 for rows, 1 for columns)
2. Sorting by Values
Sorting by values is common when you want to arrange the data based on the content of one or more
columns.
b. The ‘sort_values()’ method is used for this purpose. You can specify the column(s) to sort by, the axis, and
Java + DSA
PWSkills
Working With Text Data Options & Customization
● Pandas provide various string methods through the ‘str’ accessor for text data manipulation.
Code:
Java + DSA
PWSkills
Output:
Code:
Output:
Code:
Java + DSA
PWSkills
Output:
Code:
Output:
Java + DSA
PWSkills
Removing Duplicates: To remove duplicate rows based on text data.
Text Normalization: Text normalization involves converting text to a standard form, such as converting all text
to lowercase or stemming (reducing words to their root form).
Text Data Manipulation and Customization: Text data manipulation involves various operations such as
extracting specific information, transforming text, and customizing data according to requirements.
Output:
Java + DSA
PWSkills
Code:
Output:
Code:
Output:
Java + DSA
PWSkills
Output:
Output:
Java + DSA
PWSkills
Output:
Code:
Java + DSA
Java + DSA
PWSkills
Output:
Label-based Indexing (using .loc[]): Using .loc[], you can access rows and columns based on their labels
Position-based Indexing (using .iloc[]): With .iloc[], you can access rows and columns based on their
integer positions
Boolean Indexing: Boolean indexing allows filtering data based on a certain condition.
Code:
Output:
Java + DSA
PWSkills
Basic Indexing and Selection in Pandas
In Pandas, basic indexing and selection involve accessing specific rows, columns, or elements from a
DataFrame using various methods
Analogy Code:
Code:
Output:
Java + DSA
PWSkills
Code:
Output:
Code:
Output:
Java + DSA
PWSkills
Conditional Selection in Pandas: Conditional selection involves filtering data based on specific
conditions
Code:
Output:
Code:
Java + DSA
PWSkills
Output:
Code:
Output:
Output:
sepal_length sepal_width petal_length petal_width species
Java + DSA
PWSkills
0 5.1 3.5 1.4 0.2 setosa
print(iris_data.info())
Output:
<class 'pandas.core.frame.DataFrame'>
None
print(iris_data.describe())
Output:
Pandas offers a range of functions to compute descriptive statistics on DataFrame columns, providing insights
into the central tendency, dispersion, and shape of the dataset
Central Tendency: Central tendency refers to the tendency of data to cluster around a central value or a
typical value within a dataset. It helps to identify a single representative value that best summarizes the
entire dataset. Like mean, median & mode
Dispersion (Variability): Dispersion or variability measures the spread or extent of how the data points differ
from the central value (mean, median, or mode). It indicates how much the data is scattered or spread out.
Like Variance, Standard Deviation, Range & IQR
Java + DSA
PWSkills
Code:
Output:
sepal_length 5.843333
sepal_width 3.057333
petal_length 3.758000
petal_width 1.199333
dtype: float64
sepal_length 5.80
sepal_width 3.00
petal_length 4.35
petal_width 1.30
dtype: float64
sepal_length 0.828066
sepal_width 0.435866
petal_length 1.765298
petal_width 0.762238
dtype: float64
Output:
Aggregation Functions
Aggregation functions in pandas help summarize data based on certain conditions or groupings, allowing
computation of statistics like mean, sum, count, etc., on grouped data.
Java + DSA
PWSkills
Output:
Java + DSA
PWSkills
Output:
dtype: int64
dtype: int64
Techniques involve detecting and handling outliers, like removing or transforming them.
Java + DSA
PWSkills
Applying Functions to DataFrames
Pandas allows applying functions across rows/columns of DataFrames using various methods
Functions can be applied using ‘.apply()’, ‘.map()’, or ‘.applymap()’ methods for specific column-wise or
element-wise operations.
Rolling Window
Rolling windows are a fundamental aspect of window functions
They involve calculating statistics or applying functions to a specified window size as it moves through the
data.
Java + DSA
PWSkills
Window Statistics and Aggregations:
Window statistics and aggregations involve computing summary statistics within a window, which may include
mean, sum, min, max, standard deviation, and custom aggregations.
Window grouping involves applying operations within specific groups of data rather than the entire dataset.
This is achieved using Pandas' groupby() function along with window operations.
Java + DSA
PWSkills
Custom Window Functions:
Custom window functions involve defining and applying user-defined functions to windows in Pandas.
Handling missing values within windows involves strategies like filling or interpolating missing data to ensure
the integrity of window-based calculations.
Pandas primarily utilize two main data types for handling date and time informatio
‘Timestamp’ Data Typ
Represents a single timestamp and is the fundamental type for Pandas to work with dates and times
Each Timestamp object can store nanosecond precision and is based on the datetime64 data type in
NumPy
DatetimeIndex and PeriodInde
DatetimeIndex is used to index Pandas data structures like Series and DataFrame based on timestamp
PeriodIndex represents a period of time, such as a specific day, month, quarter, etc..
Java + DSA
PWSkills
Date Range Creation
Creating date ranges is a common task when working with time series data
Pandas provides the ‘pd.date_range()’ function for generating sequences of dates.
Output:
Output:
Java + DSA
PWSkills
Output:
Output:
Output:
Java + DSA
PWSkills
Working with Date and Time Components
Pandas provides the ‘dt’ accessor, allowing easy extraction of different
components (like year, month, day, etc.) from date and time objects
This accessor enables accessing various attributes and methods for
datetime objects in a Pandas Series or DataFrame.
Output:
Output:
Output:
Java + DSA
PWSkills
Output:
Output:
Java + DSA
PWSkills
Output:
Java + DSA
PWSkills
Output:
Java + DSA
PWSkills
Output:
Java + DSA
PWSkills
Output:
Output:
Java + DSA
PWSkills
Output:
Output:
Pandas Timedelta:
Pandas' Timedelta represents a duration or difference in time. It's useful for performing arithmetic operations on
dates or time-related data, such as calculating time differences or adding/subtracting time intervals.
Java + DSA
PWSkills
import pandas as pd
Output:
Timedelta can also be used to calculate the difference between two dates or time-related data.
Output:
Timedelta is often used in time series operations, such as shifting or creating date offsets.
Java + DSA
PWSkills
Java + DSA
PWSkills