0% found this document useful (0 votes)
2 views36 pages

Pandas (1)

The document is a lesson plan for teaching the Pandas library in Python, covering its data structures, including Series and DataFrames, and various operations such as data manipulation, reading and writing data, and handling missing values. It also discusses advanced topics like re-indexing, iteration, and text data manipulation. The plan emphasizes practical applications and statistical analysis using Pandas.

Uploaded by

Tushar Agrawal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views36 pages

Pandas (1)

The document is a lesson plan for teaching the Pandas library in Python, covering its data structures, including Series and DataFrames, and various operations such as data manipulation, reading and writing data, and handling missing values. It also discusses advanced topics like re-indexing, iteration, and text data manipulation. The plan emphasizes practical applications and statistical analysis using Pandas.

Uploaded by

Tushar Agrawal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Lesson Plan

Pandas
Lesson Plan

Polymorphism and

Encapsulation

Java + DSA
Introduction to Pandas

Python programming language's Pandas library is an open-source tool for handling and analyzing data. In
order to make dealing with organized and tabular data simple and effective, it offers data structures and
functions. With the addition of named axes (rows and columns) and more versatile data processing tools,
Pandas, which is built on top of the NumPy library, expands its functionality. In data science and data analysis, it
is a crucial tool.

Pandas Serie
A Pandas Series is a one-dimensional labeled array that can hold data of various types, such as integers,
floats, strings, or even custom objects
It is akin to a column in a spreadsheet or a single variable in statistics. Each element in a Series is assigned
an index, allowing for easy data retrieval and manipulation.

Creating Series:

To Create Pandas Series DataFrame we use “pd.Series()” by converting other data structures like List[] or NumPy
arrays, Into Series.

Accessing Data or Information in Series


Once the Series DataFrame is available we access the Values of the df using “.values” or we can also access
it with the help of the Indexing Method. Using “.index”
Pandas support Indexing and Slicing Methods. We can use integer-based positions or label-based indexing.

Data Alignment
One of the powerful features of Pandas Series is data alignment
When performing operations on multiple Series, Pandas aligns data based on index labels, making it easy to
work with incomplete or differently indexed data.

What is a DataFrame
A DataFrame is akin to a spreadsheet or SQL table, with rows and columns that can hold a variety of data
types
Each column in a DataFrame is essentially a Pandas Series, sharing a common index, making it easier to
manage and manipulate data efficiently
There are different methods to generate a Dataframe. But commonly used approaches are
Creating a Dic
Reading Custom External Data

Accessing Data:

There are different Operations we can perform in a Dataframe. To access and manipulate data some common
operations are
Accessing a specific column
Accessing a specific row
Slicing rows and columns

Java + DSA
PWSkills
Data Manipulations
Pandas provides numerous functions to manipulate data within DataFrames, such as filtering, sorting,
merging, and aggregating data
Some Common operations include
Filterin
Sortin
Merging DataFrames

Data Structures in Pandas

Pandas primarily offers two fundamental data structures Series and DataFrame
Series: A Series is a one-dimensional array-like object that can hold various data types, including integers,
floats, and strings. It is essentially a labeled array and has an associated index. You can think of a Series as a
column in a spreadsheet.
DataFrame: A DataFrame is a two-dimensional table, similar to a spreadsheet or a SQL table. It consists of
rows and columns, and each column is a Series. DataFrames are the primary data structure for most Pandas
operations.

Basic DataFrame Operation


Accessing Data: You can access specific columns and rows using column labels and row indices,
respectively
Filtering Data: Filtering allows you to extract specific rows based on conditions
Adding and Modifying Data: You can add new columns and modify existing ones using pandas.

Summary Statistics:

Pandas provides functions to calculate summary statistics


Mea
Media
Mode

Handling Missing Data: Pandas also offers robust tools for handling missing data, which is a common issue in
real-world datasets. You can use methods like “isna()”, “fillna()”, and “dropna()” to manage missing values
effectively.

Reading and Writing Data


Pandas can read data from various file formats, such as CSV, Excel, and SQL databases, using functions like
“read_csv()”, “read_excel()”, and “read_sql()”
It can also write DataFrames to these formats using functions like to_csv() and to_excel().

Reading data from various file system

Reading Data from CSV Files


CSV (Comma-Separated Values) files are one of the most common data formats
Pandas makes it easy to read data from CSV files using the read_csv() function
You can specify various options such as delimiter, encoding, and header rows to ensure the data is

imported correctly.

Java + DSA
PWSkills
Reading Data from Excel Files
Pandas can read data from Excel files, both in the older .xls format and the newer .xlsx format
You can use the read_excel() function to extract data from specific sheets and set custom data ranges.

Reading Data from JSON Files


JSON (JavaScript Object Notation) is a widely used format for data exchange
With Pandas, you can read JSON files using the read_json() function
This function can also handle nested JSON structures and convert them into DataFrames.

Reading Data from SQL Databases


Pandas provides built-in support for reading data from SQL databases
You can use the read_sql() function to retrieve data from a wide range of SQL databases, including
PostgreSQL, MySQL, SQLite, and more
This enables you to work with structured data stored in databases directly.

Reading Data from Parquet Files


Parquet is a columnar storage format often used in big data environments
Pandas can read data from Parquet files using the read_parquet() function, which is particularly useful
when dealing with large datasets efficiently.

Reading Data from HTML Tables


Sometimes, you may need to scrape data from web pages. Pandas allows you to read HTML tables from web
pages with the read_html() function
This feature is especially handy for web scraping and data extraction.

Reading Data from HDF5 Files


HDF5 is a versatile data format used in scientific computing and data storage
Pandas supports reading data from HDF5 files through the read_hdf() function, making it accessible for
scientific data analysis.

Re-Indexing in Panda
Reindexing is a fundamental operation in the Python Pandas library, allowing data scientists and analysts to
reshape, realign, and modify data structures like Series and DataFrames
This process helps in ensuring data consistency and compatibility, which is crucial in various data
manipulation and analysis tasks.

Java + DSA
PWSkills
What is Re-indexing
Reindexing refers to the process of creating a new object with the data conformed to a new index.
In Pandas, we can reindex Series and DataFrames, enabling you to manipulate data, add missing values, or
rearrange data according to a different set of labels.

Re-indexing Syntax: To reindex a Pandas Series or DataFrame, you can use the reindex

method. The syntax for reindexing is as follows:

new_data = data.reindex(new_index)

Re-indexing can be used in


Pandas Series Re-indexing
Reindexing for Series refers to the process of creating a new Series object with the data conformed to a
new index
This operation allows you to manipulate data, add missing values, or rearrange data according to a
different set of labels while working exclusively with a Series object
Pandas DataFrame Re-indexing:

Reindexing can also be applied to Pandas DataFrames. You can reindex both rows (axis=0) and columns
(axis=1) independently
Handling Missing Values using Re-Indexing
Reindexing also provides options for handling missing data
You can use the method parameter to specify how missing values should be filled
Some common methods include 'ffill' for forward filling, 'bfill' for backward filling, and more.

Pandas Iteration

Iterating through DataFrames


DataFrames are one of the primary data structures in Pandas
You can iterate through the rows using methods like “iterrows()” and “itertuples()”
“iterrows()” returns an iterator that yields index and row data as Pandas Series. However, it is not the most
efficient method, especially for large DataFrames
“itertuples()” is faster than “iterrows()” and returns an iterator of named tuples, which can be more memory-
efficient.

Iterating through Series


Series objects can be iterated using a simple ‘for’ loop, treating them like lists or arrays
You can use the ‘.iteritems()’ method to iterate through key-value pairs in a Series, making it useful for
dictionary-like operations.

Java + DSA
PWSkills
Vectorized Operations
Pandas is optimized for vectorized operations, meaning it performs operations on entire arrays rather than
individual elements. This is much faster than traditional iteration
Minimize explicit iteration when working with Pandas to take full advantage of its speed and efficiency.

Using ‘.apply()’ and ‘.applymap()’


The ‘.apply()’ method allows you to apply a function along the axis of a DataFrame or Series. It is particularly
useful for custom operations
The ‘.applymap()’ method is similar but works element-wise on DataFrames.

Conditional iteration:

You can use boolean indexing to iterate through data based on specific conditions. This allows you to select
and work with subsets of your data.

Pandas Sorting:

Sorting data is a fundamental operation in data analysis and manipulation. Python, with its powerful data
manipulation library, Pandas, offers various methods to sort data efficiently.

Understanding Data Sorting Data sorting involves arranging data in a specific order, making it

easier to identify patterns, make comparisons, and gain insights. In Pandas, data can be sorted either by index
or by values.

Different Types of approaches are


Sorting by Index:

a. Pandas DataFrames and Series can be sorted based on their index labels.

b. This is particularly useful when the index represents meaningful labels or categories, and you want to

reorder the data accordingly. 

c. To sort by index, you can use the ‘sort_index()’ method. You can specify the axis (0 for rows, 1 for columns)

and the sorting order (ascending or descending).

2. Sorting by Values
Sorting by values is common when you want to arrange the data based on the content of one or more
columns.

b. The ‘sort_values()’ method is used for this purpose. You can specify the column(s) to sort by, the axis, and

the sorting order.

3. Multi-level Index Sorting


Pandas also supports sorting with multi-level indexes. You can specify the levels to sort and the sorting
order to suit your analysis requirements.

b. The ‘sort_index()’ method can be used to achieve this.

Java + DSA
PWSkills
Working With Text Data Options & Customization

Common Operations for Text Data in Pandas

● Pandas provide various string methods through the ‘str’ accessor for text data manipulation.

● You can access these methods using the ‘.str’ attribute.

Code:

Java + DSA
PWSkills
Output:

Code:

Output:

Code:

Java + DSA
PWSkills
Output:

Code:

Output:

Data Cleaning and Preprocessing Text Data:


Data cleaning and preprocessing are crucial steps in text data analysis. This involves handling missing values,
removing duplicates, and normalizing text.
Handling Missing Values: Pandas provides methods like ‘isnull()’ and ‘dropna()’ to handle

missing values in text data

Java + DSA
PWSkills
Removing Duplicates: To remove duplicate rows based on text data.

Text Normalization: Text normalization involves converting text to a standard form, such as converting all text
to lowercase or stemming (reducing words to their root form).

Text Data Manipulation and Customization: Text data manipulation involves various operations such as
extracting specific information, transforming text, and customizing data according to requirements.

Output:

Java + DSA
PWSkills
Code:

Output:

Code:

Output:

Advanced Text Data Handling in Panda


Advanced text data handling involves tokenization, vectorization, and dealing with multi-indexing to manage
and process textual information more efficiently.

Java + DSA
PWSkills
Output:

Output:

Java + DSA
PWSkills
Output:

Understanding Indexing and Selectin


Indexing with square brackets [ ]: Pandas allows selecting specific elements, rows, or columns using square
brackets
Selection using .loc[] and .iloc[]
loc[] is used for label-based indexing, accessing rows and columns by their labels
iloc[] is used for position-based indexing, accessing rows and columns by their integer position.

Code:

Java + DSA

Java + DSA
PWSkills
Output:

Indexing Methods in Panda

Label-based Indexing (using .loc[]): Using .loc[], you can access rows and columns based on their labels

Position-based Indexing (using .iloc[]): With .iloc[], you can access rows and columns based on their

integer positions

Boolean Indexing: Boolean indexing allows filtering data based on a certain condition.

Code:

Output:

Java + DSA
PWSkills
Basic Indexing and Selection in Pandas

In Pandas, basic indexing and selection involve accessing specific rows, columns, or elements from a
DataFrame using various methods

Analogy Code:

Code:

Output:

Java + DSA
PWSkills
Code:

Output:

Code:

Output:

Java + DSA
PWSkills
Conditional Selection in Pandas: Conditional selection involves filtering data based on specific

conditions

Code:

Output:

Code:

Java + DSA
PWSkills
Output:

Code:

Output:

Statistics with pandas


Loading Data:

Output:
sepal_length sepal_width petal_length petal_width species

Java + DSA
PWSkills
0 5.1 3.5 1.4 0.2 setosa

1 4.9 3.0 1.4 0.2 setosa

2 4.7 3.2 1.3 0.2 setosa

3 4.6 3.1 1.5 0.2 setosa

4 5.0 3.6 1.4 0.2 setosa

# Basic information about the DataFrame

print(iris_data.info())

Output:

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 150 entries, 0 to 149

Data columns (total 5 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 sepal_length 150 non-null float64

1 sepal_width 150 non-null float64

2 petal_length 150 non-null float64

3 petal_width 150 non-null float64

4 species 150 non-null object

dtypes: float64(4), object(1)

memory usage: 6.0+ KB

None

# Summary statistics of numerical columns

print(iris_data.describe())

Output:

sepal_length sepal_width petal_length petal_width

count 150.000000 150.000000 150.000000 150.000000

mean 5.843333 3.057333 3.758000 1.199333

std 0.828066 0.435866 1.765298 0.762238

min 4.300000 2.000000 1.000000 0.100000

25% 5.100000 2.800000 1.600000 0.300000

50% 5.800000 3.000000 4.350000 1.300000

75% 6.400000 3.300000 5.100000 1.800000

max 7.900000 4.400000 6.900000 2.500000

Descriptive Statistics with pandas:

Pandas offers a range of functions to compute descriptive statistics on DataFrame columns, providing insights
into the central tendency, dispersion, and shape of the dataset
Central Tendency: Central tendency refers to the tendency of data to cluster around a central value or a
typical value within a dataset. It helps to identify a single representative value that best summarizes the
entire dataset. Like mean, median & mode
Dispersion (Variability): Dispersion or variability measures the spread or extent of how the data points differ
from the central value (mean, median, or mode). It indicates how much the data is scattered or spread out.
Like Variance, Standard Deviation, Range & IQR

Java + DSA
PWSkills
Code:

Output:

sepal_length 5.843333

sepal_width 3.057333

petal_length 3.758000

petal_width 1.199333

dtype: float64

sepal_length 5.80

sepal_width 3.00

petal_length 4.35

petal_width 1.30

dtype: float64

sepal_length 0.828066

sepal_width 0.435866

petal_length 1.765298

petal_width 0.762238

dtype: float64

Output:

sepal_length sepal_width petal_length petal_width

count 150.000000 150.000000 150.000000 150.000000

mean 5.843333 3.057333 3.758000 1.199333

std 0.828066 0.435866 1.765298 0.762238

min 4.300000 2.000000 1.000000 0.100000

25% 5.100000 2.800000 1.600000 0.300000

50% 5.800000 3.000000 4.350000 1.300000

75% 6.400000 3.300000 5.100000 1.800000

max 7.900000 4.400000 6.900000 2.500000

Aggregation Functions

Aggregation functions in pandas help summarize data based on certain conditions or groupings, allowing
computation of statistics like mean, sum, count, etc., on grouped data.

Java + DSA
PWSkills
Output:

sepal length (cm) sepal width (cm) petal length (cm)

mean 5.843333 NaN NaN

median 5.800000 NaN NaN

std 0.828066 NaN NaN

min NaN 2.0 NaN

max NaN 4.4 NaN

sum NaN NaN 563.7

count NaN NaN 150.0

Data Cleaning and Handling Missing Values


Data cleaning involves preparing and tidying up the data for analysis
Handling missing values is a crucial step in this process
Missing values in datasets can affect analysis and modeling. Methods such as dropping missing values,
filling them with a specific value, or imputing values based on statistical measures are common strategies.

Java + DSA
PWSkills
Output:

Missing values before handling:

sepal length (cm) 30

sepal width (cm) 0

petal length (cm) 0

petal width (cm) 0

dtype: int64

Missing values after handling:

sepal length (cm) 0

sepal width (cm) 0

petal length (cm) 0

petal width (cm) 0

dtype: int64

Correlation and Covariance


Correlation measures the relationship between two variables, while covariance measures their joint
variability
Correlation values range between -1 to 1, indicating the strength and direction of the relationship between
variables
Covariance measures how two variables change together.

Outlier Detection and Treatment


Outliers are data points significantly different from other observations in a dataset
Outliers can affect statistical measures and model performance. 

Techniques involve detecting and handling outliers, like removing or transforming them.

Java + DSA
PWSkills
Applying Functions to DataFrames
Pandas allows applying functions across rows/columns of DataFrames using various methods
Functions can be applied using ‘.apply()’, ‘.map()’, or ‘.applymap()’ methods for specific column-wise or
element-wise operations.

Introduction to Window Function


Window functions in Pandas allow for performing calculations on a specific subset of data within a defined
window or range
These functions operate on a sliding or rolling window over a series or DataFrame, enabling various
analytical operations

Rolling Window
Rolling windows are a fundamental aspect of window functions
They involve calculating statistics or applying functions to a specified window size as it moves through the
data.

Java + DSA
PWSkills
Window Statistics and Aggregations:

Window statistics and aggregations involve computing summary statistics within a window, which may include
mean, sum, min, max, standard deviation, and custom aggregations.

Window Grouping and Operations:

Window grouping involves applying operations within specific groups of data rather than the entire dataset.
This is achieved using Pandas' groupby() function along with window operations.

Java + DSA
PWSkills
Custom Window Functions:

Custom window functions involve defining and applying user-defined functions to windows in Pandas.

Handling Missing Values in Windows:

Handling missing values within windows involves strategies like filling or interpolating missing data to ensure
the integrity of window-based calculations.

Introduction to Date Functionality in Pandas


Date and Time Data Types in Pandas

Pandas primarily utilize two main data types for handling date and time informatio
‘Timestamp’ Data Typ
Represents a single timestamp and is the fundamental type for Pandas to work with dates and times
Each Timestamp object can store nanosecond precision and is based on the datetime64 data type in
NumPy
DatetimeIndex and PeriodInde
DatetimeIndex is used to index Pandas data structures like Series and DataFrame based on timestamp
PeriodIndex represents a period of time, such as a specific day, month, quarter, etc..

Java + DSA
PWSkills
Date Range Creation
Creating date ranges is a common task when working with time series data
Pandas provides the ‘pd.date_range()’ function for generating sequences of dates.

Output:

Output:

Date Parsing and Formatting


Pandas facilitate parsing strings into Timestamp objects and formatting Timestamp objects into different
string representations using ‘strftime’ and ‘strptime’.

Java + DSA
PWSkills
Output:

Output:

Output:

Java + DSA
PWSkills
Working with Date and Time Components
Pandas provides the ‘dt’ accessor, allowing easy extraction of different
components (like year, month, day, etc.) from date and time objects
This accessor enables accessing various attributes and methods for
datetime objects in a Pandas Series or DataFrame.

Output:

Output:

Output:

Java + DSA
PWSkills
Output:

Resampling and Shifting Time Series Data


Resampling in Pandas refers to changing the frequency of a time series data. This is achieved using the
‘resample()’ method
It enables the aggregation of data based on a specific time-frequency (e.g., daily data aggregated into
monthly data).

Output:

Java + DSA
PWSkills
Output:

Java + DSA
PWSkills
Output:

Time Zone Handlin


Pandas enables easy handling of time zones for datetime objects using ‘tz_localize()’ and ‘tz_convert()’
methods
This functionality is crucial when working with datasets from different time zones or
requiring conversion between time zones.

Java + DSA
PWSkills
Output:

Java + DSA
PWSkills
Output:

Handling Missing Dates and Time Series Gaps:


When working with time series data, missing dates or gaps can occur
Pandas provides methods like ‘reindex()’ or ‘asfreq()’ to handle missing dates by either adding missing
dates and filling them with default values or by specifying a frequency to fill in the gaps.

Output:

Java + DSA
PWSkills
Output:

Output:

Pandas Timedelta:

Pandas' Timedelta represents a duration or difference in time. It's useful for performing arithmetic operations on
dates or time-related data, such as calculating time differences or adding/subtracting time intervals.

Java + DSA
PWSkills
import pandas as pd

Output:

Handling Time Differences:

Timedelta can also be used to calculate the difference between two dates or time-related data.

Output:

Application in Time Series Operations:

Timedelta is often used in time series operations, such as shifting or creating date offsets.

Java + DSA
PWSkills
Java + DSA
PWSkills

You might also like