0% found this document useful (0 votes)

36 views

Data Science Data Manipulation With Pandas

Uploaded by

MAryam Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views

Data Science Data Manipulation With Pandas

Uploaded by

MAryam Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 77

DATA SCIENCE

Data Manipulation with Pandas

Disclaimer Statement
In preparation of these slides, materials have been taken from different online sources in the
shape of books, websites, research papers and presentations etc. However, the author does not
have any intention to take any benefit of these in her/his own name. This lecture (audio, video,
slides etc) is prepared and delivered only for educational purposes and is not intended to
infringe upon the copyrighted material. Sources have been acknowledged where applicable. The
views expressed are presenter’s alone and do not necessarily represent actual author(s) or the
institution.
Hierarchical Indexing

Hierarchical indexing is also known as multi-indexing, it is used to

incorporate multiple index levels within a single index.

In this way, higher-dimensional data can be compactly represented

within the familiar one-dimensional Series and two-dimensional
Data Frame objects.
A Multiply Indexed Series
Let’s start by considering how we might represent two dimensional
data within a one-dimensional Series. For concreteness, we will
consider a series of data where each point has a character and
numerical key.
The bad way
Suppose you would like to track data about states from two different
years.
you can straight forwardly index or slice the series based on this
multiple index.
if you need to select all values from 2010, you’ll need to do some
messy munging to make it happen:
The better way: Pandas MultiIndex
We can create a multi-index from the tuples as follows:
The work of creating the MultiIndex is done in the background.
Similarly, if you pass a dictionary with appropriate tuples as keys,
Pandas will automatically recognize this and use a MultiIndex by
default.
MultiIndex as extra dimension
with a MultiIndex this is as easy as adding another column to the
Data Frame.
Methods of MultiIndex Creation
The most straightforward way to construct a multiply indexed Series
or Data Frame is to simply pass a list of two or more index arrays to
the constructor.
Explicit MultiIndex constructors
For more flexibility in how the index is constructed, you can instead
use the class method constructors available in the pd.MultiIndex.
MultiIndex level names
Sometimes it is convenient to name the levels of the MultiIndex. You can accomplish this by
passing the names argument to any of the above MultiIndex constructors, or by setting the
names attribute of the index after the fact.
MultiIndex for columns

In a Data Frame, the rows and columns are completely symmetric, and just as the rows can
have multiple levels of indices, the columns can have multiple levels as well. Consider the
following, which is a mock-up of some medical data.
With this in place we can, index the top-level column by the
person’s name and get a full Data Frame containing just that
person’s information.
Indexing and Slicing a MultiIndex

Indexing and slicing on a MultiIndex is designed to be intuitive, and

it helps if you think about the indices as added dimensions.

We’ll first look at indexing multiply indexed Series, and then

multiply indexed Data Frames.
Multiply indexed Series

Consider the multiply indexed Series of state populations we saw

earlier. We can access single elements by indexing with multiple
terms.
selection based on Boolean masks.
Multiply indexed Data Frames
A multiply indexed Data Frame behaves in a similar manner.
Consider our toy medical Data Frame from before.
we can recover Guido’s heart rate data with a simple operation:
As with the single-index case, we can use the loc, iloc.
You could get around this by building the desired slice explicitly
using Python’s built-in slice() function, but a better way in this
context is to use an Index Slice object, which Pandas provides for
precisely this situation.
Rearranging Multi-Indices

One of the keys to working with multiply indexed data is knowing

how to effectively transform the data.

There are a number of operations that will preserve all the

information in the dataset, but rearrange it for the purposes of various
computations.
Sorted and unsorted indices
We’ll start by creating some simple multiply indexed data where the
indices are not lexographically sorted.
Pandas provides a number of convenience routines to perform this
type of sorting; examples are the sort_index() and sortlevel()
methods of the Data Frame. We’ll use the simplest, sort_index(),
here.
With the index sorted in this way, partial slicing will work as
expected.
Index setting and resetting
this can be accomplished with the reset_index method.
Calling this on the population dictionary will result in a Data Frame with a state and year
column holding the information that was formerly in the index. For clarity, we can optionally
specify the name of the data for the column representation.
Data Aggregations on Multi-Indices
We’ve previously seen that Pandas has built-in data aggregation
methods, such as mean(), sum(), and max().

For hierarchically indexed data, these can be passed a level parameter

that controls which subset of the data the aggregate is computed on.
Perhaps we’d like to average out the measurements in the two visits
each year. We can do this by naming the index level we’d like to
explore, in this case the year.
By further making use of the axis keyword, we can take the mean
among levels on the columns as well.
Combining Datasets: Concat and Append

Some of the most interesting studies of data come from combining

different data sources.

These operations can involve anything from very straightforward

concatenation of two different datasets, to more complicated
database-style joins and merges that correctly handle any overlaps
between the datasets.
we’ll define this function, which creates a Data Frame of a particular
form that will be useful below.
Recall: Concatenation of NumPy Arrays

Concatenation of Series and Data Frame objects is very similar to concatenation

of NumPy arrays, which can be done via the np.concatenate function.
you can combine the contents of two or more arrays into a single array.
The first argument is a list or tuple of arrays to concatenate.
Additionally, it takes an axis keyword that allows you to specify the
axis along which the result will be concatenated.
Simple Concatenation with pd.concat

Pandas has a function, pd.concat(), which has a similar syntax to np.concatenate but contains a
number of options that we’ll discuss momentarily.
pd.concat() can be used for a simple concatenation of Series or Data Frame objects, just as
np.concatenate() can be used for simple concatenations of arrays:
It also works to concatenate higher-dimensional objects, such as
Data Frames.
Duplicate indices

One important difference between np.concatenate and pd.concat is that Pandas concatenation
preserves indices, even if the result will have duplicate indices! Consider this simple example.
Ignoring the index
Sometimes the index itself does not matter, and you would prefer it to simply
be ignored. You can specify this option using the ignore_index flag. With this
set to True, the concatenation will create a new integer index for the resulting
Series:
Adding MultiIndex keys

Another alternative is to use the keys option to specify a label for the
data sources; the result will be a hierarchically indexed series
containing the data.
Combining Datasets: Merge and Join
One essential feature offered by Pandas is its high-performance, in-memory join and merge
operations.

If you have ever worked with databases, you should be familiar with this type of data
interaction. The main interface for this is the pd.merge function,
Categories of Joins

The pd.merge() function implements a number of types of joins: the one-to-one, many-to-one,
and many-to-many joins.

All three types of joins are accessed via an identical call to the pd.merge() interface; the type of
join performed depends on the form of the input data.
One-to-one joins

The simplest type of merge expression is the one-to-one join.

consider the following two Data Frames, which contain information on several employees in a
company.
To combine this information into a single Data Frame, we can use
the pd.merge() function.
Many-to-one joins
Many-to-one joins are joins in which one of the two key columns contains
duplicate entries. For the many-to-one case, the resulting Data Frame will
preserve those duplicate entries as appropriate.
Many-to-many joins

Many-to-many joins are a bit confusing conceptually, but are nevertheless well defined. If the
key column in both the left and right array contains duplicates, then the result is a many-to-
many merge.
Specification of the Merge Key
We’ve already seen the default behavior of pd.merge(). it looks for
one or more matching column names between the two inputs, and
uses this as the key.

However, often the column names will not match so nicely, and
pd.merge() provides a variety of options for handling this.
The on keyword

Most simply, you can explicitly specify the name of the key column using the on keyword,
which takes a column name or a list of column names:
The left-on and right-on keywords
At times you may wish to merge two datasets with different column names; for example, we
may have a dataset in which the employee name is labeled as “name” rather than “employee”.
In this case, we can use the left_on and right_on keywords to specify the two column names.
The result has a redundant column that we can drop if desired—for
example, by using the drop() method of Data Frames.
The left_index and right_index keywords

Sometimes, rather than merging on a column, you would instead like

to merge on an index. For example, your data might look like this.
You can use the index as the key for merging by specifying the
left_index and/or right_index flags in pd.merge().
Data Frames implement the join() method, which performs a merge that defaults to joining on
indices.
If you’d like to mix indices and columns, you can combine
left_index with right_on or left_on with right_index to get the
desired behavior:
Specifying Set Arithmetic for Joins
The type of set arithmetic used in the join. This comes up when a value appears in one key
column but not the other.
Here we have merged two datasets that have only a single “name”
entry in common: Mary. By default, the result contains the
intersection of the two sets of inputs; this is what is known as an
inner join. We can specify this explicitly using the how keyword,
which defaults to 'inner‘.
Other options for the how keyword are 'outer', 'left', and 'right'. An
outer join returns a join over the union of the input columns, and
fills in all missing values with NAs.
Overlapping Column Names: The suffixes Keyword
Finally, you may end up in a case where your two input Data Frames have conflicting column
names. Consider this example:
Aggregation and Grouping

An essential piece of analysis of large data is efficient

summarization: computing aggregations like sum(), mean(),
median(), min(), and max(), in which a single number gives insight
into the nature of a potentially large dataset.
Simple Aggregation in Pandas
As with a one-dimensional NumPy array, for a Pandas Series the
aggregates return a single value.
For a Data Frame, by default the aggregates return results within
each column.
Listing of Pandas aggregation methods
A visual representation of a group by operation
As a concrete example, let’s take a look at using Pandas for the
computation.
To produce a result, we can apply an aggregate to this Data Frame
Group By object, which will perform the appropriate
apply/combine steps to produce the desired result.
Aggregate, filter, transform, apply
The preceding discussion focused on aggregation for the combine operation, but there are more
options available. In particular, Group By objects have aggregate(), filter(), transform(), and
apply() methods that efficiently implement a variety of useful operations before combining the
grouped data.
Aggregation

We’re now familiar with Group By aggregations with sum(), median(), and the like, but the
aggregate() method allows for even more flexibility.
It can take a string, a function, or a list thereof, and compute all the aggregates at once. Here is
a quick example combining all these.
Another useful pattern is to pass a dictionary mapping column
names to operations to be applied on that column.
Filtering

A filtering operation allows you to drop data based on the group properties. For example, we
might want to keep all groups in which the standard deviation is larger than some critical value.
Transformation.
While aggregation must return a reduced version of the data, transformation can return some
transformed version of the full data to recombine. For such a transformation, the output is the
same shape as the input.
The apply() method.

The apply() method lets you apply an arbitrary function to the group results. The function
should take a Data Frame, and return either a Pandas object (e.g., Data Frame, Series) or a
scalar; the combine operation will be tailored to the type of output returned.
A list, array, series, or index providing the
grouping keys.

The key can be any series or list with a length matching that of the Data Frame. For example:
Of course, this means there’s another, more verbose way of
accomplishing the df.groupby('key') from before.
A dictionary or series mapping index
to group
Another method is to provide a dictionary that maps index values to the group keys.
Any Python function.
Similar to mapping, you can pass any Python function that will input
the index value and output the group.
A list of valid keys
Further, any of the preceding key choices can be combined to group
on a multi-index.
Reference
VanderPlas, J. (2016). Python data science handbook: Essential tools for working with data. "
O'Reilly Media, Inc.".

Projectile Motion The "Coming and Going"
No ratings yet
Projectile Motion The "Coming and Going"
5 pages
Learning Pandas PDF
No ratings yet
Learning Pandas PDF
171 pages
Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
From Everand
Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
1/5 (1)
Unit 4 DSE
No ratings yet
Unit 4 DSE
9 pages
Notes For Python Part III
No ratings yet
Notes For Python Part III
44 pages
Ch8 Data Wrangling Join, Combine, and Reshape
No ratings yet
Ch8 Data Wrangling Join, Combine, and Reshape
13 pages
Notes - EDA-Unit2 (1)
No ratings yet
Notes - EDA-Unit2 (1)
43 pages
Combining Datasets
No ratings yet
Combining Datasets
36 pages
Dsp Unit-5 Updated
No ratings yet
Dsp Unit-5 Updated
23 pages
Merge, Join, and Concatenate: Concatenating Objects
No ratings yet
Merge, Join, and Concatenate: Concatenating Objects
62 pages
Python Modules
No ratings yet
Python Modules
14 pages
python 2.1.2 (2)
No ratings yet
python 2.1.2 (2)
7 pages
Pandas
No ratings yet
Pandas
94 pages
99c949c0-5910-425f-9ac5-155882800fa5
No ratings yet
99c949c0-5910-425f-9ac5-155882800fa5
36 pages
Pandas
No ratings yet
Pandas
26 pages
Introduction To Pandas in Data Analytics
No ratings yet
Introduction To Pandas in Data Analytics
12 pages
IV Unit Fds
No ratings yet
IV Unit Fds
16 pages
UNIT IV Material
No ratings yet
UNIT IV Material
23 pages
Module_4
No ratings yet
Module_4
38 pages
Data Wrangling and Analysis
100% (1)
Data Wrangling and Analysis
36 pages
Python For DS Unit4
No ratings yet
Python For DS Unit4
11 pages
Data Science - Unit II
100% (2)
Data Science - Unit II
173 pages
Python Lecture 5 (2025)
No ratings yet
Python Lecture 5 (2025)
29 pages
Pandas_Data_Analytics
No ratings yet
Pandas_Data_Analytics
61 pages
Pandas Notes(1)
No ratings yet
Pandas Notes(1)
44 pages
Pandas PDF
No ratings yet
Pandas PDF
171 pages
3Y3Z2Xzqn7 U Y%K : 2. How To Create A Data Frame Using A Dictionary of Pre-Existing Columns or Numpy 2D Arrays?
No ratings yet
3Y3Z2Xzqn7 U Y%K : 2. How To Create A Data Frame Using A Dictionary of Pre-Existing Columns or Numpy 2D Arrays?
8 pages
Lecture 8 - Data Wrangling Using Pandas
No ratings yet
Lecture 8 - Data Wrangling Using Pandas
31 pages
Chapter 2 Python Pandas - II
No ratings yet
Chapter 2 Python Pandas - II
19 pages
Unit-4Introduction To Pandas
No ratings yet
Unit-4Introduction To Pandas
44 pages
Loki Temp PPT Pandas 2
No ratings yet
Loki Temp PPT Pandas 2
31 pages
PYTHON UNIT IV- PANDAS
No ratings yet
PYTHON UNIT IV- PANDAS
36 pages
Python Programming Pandas Across Examples
No ratings yet
Python Programming Pandas Across Examples
350 pages
Pandas Notes
No ratings yet
Pandas Notes
4 pages
01-Numpy & Pandas
No ratings yet
01-Numpy & Pandas
69 pages
Lab-3 Pandas Library
No ratings yet
Lab-3 Pandas Library
14 pages
05Getting Started With Pandas
No ratings yet
05Getting Started With Pandas
44 pages
All Document Reader 1715619870900
No ratings yet
All Document Reader 1715619870900
6 pages
Pandas
No ratings yet
Pandas
44 pages
11 A)
No ratings yet
11 A)
2 pages
Combining Data in Pandas With Merge, .Join, and Concat - Real Python
No ratings yet
Combining Data in Pandas With Merge, .Join, and Concat - Real Python
2 pages
Eda Unit 2
No ratings yet
Eda Unit 2
65 pages
FDS Module 2 Notes
No ratings yet
FDS Module 2 Notes
24 pages
OOM Unit 2
No ratings yet
OOM Unit 2
145 pages
pandas_merged
No ratings yet
pandas_merged
2 pages
Praveen PPT
No ratings yet
Praveen PPT
9 pages
Data Manipulation With Pandas
No ratings yet
Data Manipulation With Pandas
38 pages
Phan1_Pandas_Numpy_Matplotlib
No ratings yet
Phan1_Pandas_Numpy_Matplotlib
158 pages
Pandas
No ratings yet
Pandas
13 pages
UnitIV.1
No ratings yet
UnitIV.1
4 pages
DevOps Session 3 Pandas.pptx
No ratings yet
DevOps Session 3 Pandas.pptx
33 pages
Session2-DM Using Pandas
No ratings yet
Session2-DM Using Pandas
51 pages
ACFrOgCuxzI7id1LCXi9yoyuvISxGard75NvAshCzyRkhz0Fv_jimN6GuJsUI3qR2_jr7vxbRmHlwJPmcpRa7v3zCXyCokAXM23U17GlLnoA-5jSOz-osgZwdAL-ghXvjz5yld44_1rLLZaDMrebwXv-HRUry-kJjWFBo4Jkhw==
No ratings yet
ACFrOgCuxzI7id1LCXi9yoyuvISxGard75NvAshCzyRkhz0Fv_jimN6GuJsUI3qR2_jr7vxbRmHlwJPmcpRa7v3zCXyCokAXM23U17GlLnoA-5jSOz-osgZwdAL-ghXvjz5yld44_1rLLZaDMrebwXv-HRUry-kJjWFBo4Jkhw==
12 pages
Pandas Data Wrangling Cheatsheet Datacamp PDF
No ratings yet
Pandas Data Wrangling Cheatsheet Datacamp PDF
1 page
Pandas Viva Questions
No ratings yet
Pandas Viva Questions
23 pages
Python Cheat Sheets
97% (33)
Python Cheat Sheets
11 pages
Data Wrangling With Python and Pandas
No ratings yet
Data Wrangling With Python and Pandas
7 pages
Data Structures and Algorithm
From Everand
Data Structures and Algorithm
Knowledge Flow
No ratings yet
DATABASE From the conceptual model to the final application in Access, Visual Basic, Pascal, Html and Php: Inside, examples of applications created with Access, Visual Studio, Lazarus and Wamp
From Everand
DATABASE From the conceptual model to the final application in Access, Visual Basic, Pascal, Html and Php: Inside, examples of applications created with Access, Visual Studio, Lazarus and Wamp
Olga Maria Stefania Cucaro
No ratings yet
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
Python for Data Science: A Hands-On Introduction
From Everand
Python for Data Science: A Hands-On Introduction
Yuli Vasiliev
No ratings yet
Continuous Review and Periodic Review
No ratings yet
Continuous Review and Periodic Review
15 pages
Boyle’s Law Reviewer
No ratings yet
Boyle’s Law Reviewer
2 pages
Test Bank for Global Logistics and Supply Chain Management 3rd Edition - PDF Version Is Available For Instant Access
100% (9)
Test Bank for Global Logistics and Supply Chain Management 3rd Edition - PDF Version Is Available For Instant Access
35 pages
(Specialist) 2006 Heffernan Exam 2 Solutions
No ratings yet
(Specialist) 2006 Heffernan Exam 2 Solutions
25 pages
Term Paper: Course Name: Numerical Analysis Course Code:-Mth 204
No ratings yet
Term Paper: Course Name: Numerical Analysis Course Code:-Mth 204
14 pages
Logic Critical Thinking Topic 1
No ratings yet
Logic Critical Thinking Topic 1
26 pages
Wj Prob of Intercept
No ratings yet
Wj Prob of Intercept
12 pages
Typescript Jumpstart Book Udemy
No ratings yet
Typescript Jumpstart Book Udemy
44 pages
Sequences and Series PDF
No ratings yet
Sequences and Series PDF
26 pages
EMI 04-09 Collected For Revision Practice MS
No ratings yet
EMI 04-09 Collected For Revision Practice MS
7 pages
BladedUser Manual
No ratings yet
BladedUser Manual
200 pages
Assignment 3 DLD
No ratings yet
Assignment 3 DLD
3 pages
Algebraic Expressions and Identities Including Operations Solved Examples Algebraic Expressions and Identities Solved Examples Solved Example ICSE 1 82c3dc1f 0f15 4748 9d43 5972d599305e
No ratings yet
Algebraic Expressions and Identities Including Operations Solved Examples Algebraic Expressions and Identities Solved Examples Solved Example ICSE 1 82c3dc1f 0f15 4748 9d43 5972d599305e
8 pages
B.Tech Iv Year-Ii Sem: Malla Reddy College of Engineering &technology
No ratings yet
B.Tech Iv Year-Ii Sem: Malla Reddy College of Engineering &technology
54 pages
Proof Study Guide
100% (1)
Proof Study Guide
6 pages
I. Objectives:: Unit Weight of Coarse and Fine Aggregate Laboratory Experiment: 2
No ratings yet
I. Objectives:: Unit Weight of Coarse and Fine Aggregate Laboratory Experiment: 2
5 pages
6.03 - Calorimetry Lesson Review: Answer Key
No ratings yet
6.03 - Calorimetry Lesson Review: Answer Key
5 pages
Phylogeny Cladistics
100% (1)
Phylogeny Cladistics
46 pages
Inclusion of A Point in A Polygon
No ratings yet
Inclusion of A Point in A Polygon
6 pages
Chapter 2
No ratings yet
Chapter 2
34 pages
C3 Integration PDF
No ratings yet
C3 Integration PDF
5 pages
Formulas For Central Tendency Dispersion Index Numbers and Probability
No ratings yet
Formulas For Central Tendency Dispersion Index Numbers and Probability
12 pages
Pearson's Correlation Coefficient
No ratings yet
Pearson's Correlation Coefficient
7 pages
STANDARD NORMAL DISTRIBUTION: Table Values Represent AREA To The LEFT of The Z Score
No ratings yet
STANDARD NORMAL DISTRIBUTION: Table Values Represent AREA To The LEFT of The Z Score
1 page
Cryptography
No ratings yet
Cryptography
20 pages
EAPP Using Graphics or Visuals in Business Group 4
No ratings yet
EAPP Using Graphics or Visuals in Business Group 4
42 pages
Vibration Performance of Composite Floors Using Slim Floor Beams
No ratings yet
Vibration Performance of Composite Floors Using Slim Floor Beams
14 pages
Inversion Techniques Applied To Resistivity Invers
No ratings yet
Inversion Techniques Applied To Resistivity Invers
19 pages

Data Science Data Manipulation With Pandas

Uploaded by

Data Science Data Manipulation With Pandas

Uploaded by

DATA SCIENCE

Data Manipulation with Pandas

Hierarchical indexing is also known as multi-indexing, it is used to

In this way, higher-dimensional data can be compactly represented

Indexing and slicing on a MultiIndex is designed to be intuitive, and

We’ll first look at indexing multiply indexed Series, and then

Consider the multiply indexed Series of state populations we saw

One of the keys to working with multiply indexed data is knowing

There are a number of operations that will preserve all the

For hierarchically indexed data, these can be passed a level parameter

Some of the most interesting studies of data come from combining

These operations can involve anything from very straightforward

Concatenation of Series and Data Frame objects is very similar to concatenation

The simplest type of merge expression is the one-to-one join.

Sometimes, rather than merging on a column, you would instead like

An essential piece of analysis of large data is efficient

You might also like