0% found this document useful (0 votes)
36 views

Data Science Data Manipulation With Pandas

Uploaded by

MAryam Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Data Science Data Manipulation With Pandas

Uploaded by

MAryam Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 77

DATA SCIENCE

Data Manipulation with Pandas


Disclaimer Statement
In preparation of these slides, materials have been taken from different online sources in the
shape of books, websites, research papers and presentations etc. However, the author does not
have any intention to take any benefit of these in her/his own name. This lecture (audio, video,
slides etc) is prepared and delivered only for educational purposes and is not intended to
infringe upon the copyrighted material. Sources have been acknowledged where applicable. The
views expressed are presenter’s alone and do not necessarily represent actual author(s) or the
institution.
Hierarchical Indexing

Hierarchical indexing is also known as multi-indexing, it is used to


incorporate multiple index levels within a single index.

In this way, higher-dimensional data can be compactly represented


within the familiar one-dimensional Series and two-dimensional
Data Frame objects.
A Multiply Indexed Series
Let’s start by considering how we might represent two dimensional
data within a one-dimensional Series. For concreteness, we will
consider a series of data where each point has a character and
numerical key.
The bad way
Suppose you would like to track data about states from two different
years.
you can straight forwardly index or slice the series based on this
multiple index.
if you need to select all values from 2010, you’ll need to do some
messy munging to make it happen:
The better way: Pandas MultiIndex
We can create a multi-index from the tuples as follows:
The work of creating the MultiIndex is done in the background.
Similarly, if you pass a dictionary with appropriate tuples as keys,
Pandas will automatically recognize this and use a MultiIndex by
default.
MultiIndex as extra dimension
with a MultiIndex this is as easy as adding another column to the
Data Frame.
Methods of MultiIndex Creation
The most straightforward way to construct a multiply indexed Series
or Data Frame is to simply pass a list of two or more index arrays to
the constructor.
Explicit MultiIndex constructors
For more flexibility in how the index is constructed, you can instead
use the class method constructors available in the pd.MultiIndex.
MultiIndex level names
Sometimes it is convenient to name the levels of the MultiIndex. You can accomplish this by
passing the names argument to any of the above MultiIndex constructors, or by setting the
names attribute of the index after the fact.
MultiIndex for columns

In a Data Frame, the rows and columns are completely symmetric, and just as the rows can
have multiple levels of indices, the columns can have multiple levels as well. Consider the
following, which is a mock-up of some medical data.
With this in place we can, index the top-level column by the
person’s name and get a full Data Frame containing just that
person’s information.
Indexing and Slicing a MultiIndex

Indexing and slicing on a MultiIndex is designed to be intuitive, and


it helps if you think about the indices as added dimensions.

We’ll first look at indexing multiply indexed Series, and then


multiply indexed Data Frames.
Multiply indexed Series

Consider the multiply indexed Series of state populations we saw


earlier. We can access single elements by indexing with multiple
terms.
selection based on Boolean masks.
Multiply indexed Data Frames
A multiply indexed Data Frame behaves in a similar manner.
Consider our toy medical Data Frame from before.
we can recover Guido’s heart rate data with a simple operation:
As with the single-index case, we can use the loc, iloc.
You could get around this by building the desired slice explicitly
using Python’s built-in slice() function, but a better way in this
context is to use an Index Slice object, which Pandas provides for
precisely this situation.
Rearranging Multi-Indices

One of the keys to working with multiply indexed data is knowing


how to effectively transform the data.

There are a number of operations that will preserve all the


information in the dataset, but rearrange it for the purposes of various
computations.
Sorted and unsorted indices
We’ll start by creating some simple multiply indexed data where the
indices are not lexographically sorted.
Pandas provides a number of convenience routines to perform this
type of sorting; examples are the sort_index() and sortlevel()
methods of the Data Frame. We’ll use the simplest, sort_index(),
here.
With the index sorted in this way, partial slicing will work as
expected.
Index setting and resetting
this can be accomplished with the reset_index method.
Calling this on the population dictionary will result in a Data Frame with a state and year
column holding the information that was formerly in the index. For clarity, we can optionally
specify the name of the data for the column representation.
Data Aggregations on Multi-Indices
We’ve previously seen that Pandas has built-in data aggregation
methods, such as mean(), sum(), and max().

For hierarchically indexed data, these can be passed a level parameter


that controls which subset of the data the aggregate is computed on.
Perhaps we’d like to average out the measurements in the two visits
each year. We can do this by naming the index level we’d like to
explore, in this case the year.
By further making use of the axis keyword, we can take the mean
among levels on the columns as well.
Combining Datasets: Concat and Append

Some of the most interesting studies of data come from combining


different data sources.

These operations can involve anything from very straightforward


concatenation of two different datasets, to more complicated
database-style joins and merges that correctly handle any overlaps
between the datasets.
we’ll define this function, which creates a Data Frame of a particular
form that will be useful below.
Recall: Concatenation of NumPy Arrays

Concatenation of Series and Data Frame objects is very similar to concatenation


of NumPy arrays, which can be done via the np.concatenate function.
you can combine the contents of two or more arrays into a single array.
The first argument is a list or tuple of arrays to concatenate.
Additionally, it takes an axis keyword that allows you to specify the
axis along which the result will be concatenated.
Simple Concatenation with pd.concat

Pandas has a function, pd.concat(), which has a similar syntax to np.concatenate but contains a
number of options that we’ll discuss momentarily.
pd.concat() can be used for a simple concatenation of Series or Data Frame objects, just as
np.concatenate() can be used for simple concatenations of arrays:
It also works to concatenate higher-dimensional objects, such as
Data Frames.
Duplicate indices

One important difference between np.concatenate and pd.concat is that Pandas concatenation
preserves indices, even if the result will have duplicate indices! Consider this simple example.
Ignoring the index
Sometimes the index itself does not matter, and you would prefer it to simply
be ignored. You can specify this option using the ignore_index flag. With this
set to True, the concatenation will create a new integer index for the resulting
Series:
Adding MultiIndex keys

Another alternative is to use the keys option to specify a label for the
data sources; the result will be a hierarchically indexed series
containing the data.
Combining Datasets: Merge and Join
One essential feature offered by Pandas is its high-performance, in-memory join and merge
operations.

If you have ever worked with databases, you should be familiar with this type of data
interaction. The main interface for this is the pd.merge function,
Categories of Joins

The pd.merge() function implements a number of types of joins: the one-to-one, many-to-one,
and many-to-many joins.

All three types of joins are accessed via an identical call to the pd.merge() interface; the type of
join performed depends on the form of the input data.
One-to-one joins

The simplest type of merge expression is the one-to-one join.


consider the following two Data Frames, which contain information on several employees in a
company.
To combine this information into a single Data Frame, we can use
the pd.merge() function.
Many-to-one joins
Many-to-one joins are joins in which one of the two key columns contains
duplicate entries. For the many-to-one case, the resulting Data Frame will
preserve those duplicate entries as appropriate.
Many-to-many joins

Many-to-many joins are a bit confusing conceptually, but are nevertheless well defined. If the
key column in both the left and right array contains duplicates, then the result is a many-to-
many merge.
Specification of the Merge Key
We’ve already seen the default behavior of pd.merge(). it looks for
one or more matching column names between the two inputs, and
uses this as the key.

However, often the column names will not match so nicely, and
pd.merge() provides a variety of options for handling this.
The on keyword

Most simply, you can explicitly specify the name of the key column using the on keyword,
which takes a column name or a list of column names:
The left-on and right-on keywords
At times you may wish to merge two datasets with different column names; for example, we
may have a dataset in which the employee name is labeled as “name” rather than “employee”.
In this case, we can use the left_on and right_on keywords to specify the two column names.
The result has a redundant column that we can drop if desired—for
example, by using the drop() method of Data Frames.
The left_index and right_index keywords

Sometimes, rather than merging on a column, you would instead like


to merge on an index. For example, your data might look like this.
You can use the index as the key for merging by specifying the
left_index and/or right_index flags in pd.merge().
Data Frames implement the join() method, which performs a merge that defaults to joining on
indices.
If you’d like to mix indices and columns, you can combine
left_index with right_on or left_on with right_index to get the
desired behavior:
Specifying Set Arithmetic for Joins
The type of set arithmetic used in the join. This comes up when a value appears in one key
column but not the other.
Here we have merged two datasets that have only a single “name”
entry in common: Mary. By default, the result contains the
intersection of the two sets of inputs; this is what is known as an
inner join. We can specify this explicitly using the how keyword,
which defaults to 'inner‘.
Other options for the how keyword are 'outer', 'left', and 'right'. An
outer join returns a join over the union of the input columns, and
fills in all missing values with NAs.
Overlapping Column Names: The suffixes Keyword
Finally, you may end up in a case where your two input Data Frames have conflicting column
names. Consider this example:
Aggregation and Grouping

An essential piece of analysis of large data is efficient


summarization: computing aggregations like sum(), mean(),
median(), min(), and max(), in which a single number gives insight
into the nature of a potentially large dataset.
Simple Aggregation in Pandas
As with a one-dimensional NumPy array, for a Pandas Series the
aggregates return a single value.
For a Data Frame, by default the aggregates return results within
each column.
Listing of Pandas aggregation methods
A visual representation of a group by operation
As a concrete example, let’s take a look at using Pandas for the
computation.
To produce a result, we can apply an aggregate to this Data Frame
Group By object, which will perform the appropriate
apply/combine steps to produce the desired result.
Aggregate, filter, transform, apply
The preceding discussion focused on aggregation for the combine operation, but there are more
options available. In particular, Group By objects have aggregate(), filter(), transform(), and
apply() methods that efficiently implement a variety of useful operations before combining the
grouped data.
Aggregation

We’re now familiar with Group By aggregations with sum(), median(), and the like, but the
aggregate() method allows for even more flexibility.
It can take a string, a function, or a list thereof, and compute all the aggregates at once. Here is
a quick example combining all these.
Another useful pattern is to pass a dictionary mapping column
names to operations to be applied on that column.
Filtering

A filtering operation allows you to drop data based on the group properties. For example, we
might want to keep all groups in which the standard deviation is larger than some critical value.
Transformation.
While aggregation must return a reduced version of the data, transformation can return some
transformed version of the full data to recombine. For such a transformation, the output is the
same shape as the input.
The apply() method.

The apply() method lets you apply an arbitrary function to the group results. The function
should take a Data Frame, and return either a Pandas object (e.g., Data Frame, Series) or a
scalar; the combine operation will be tailored to the type of output returned.
A list, array, series, or index providing the
grouping keys.

The key can be any series or list with a length matching that of the Data Frame. For example:
Of course, this means there’s another, more verbose way of
accomplishing the df.groupby('key') from before.
A dictionary or series mapping index
to group
Another method is to provide a dictionary that maps index values to the group keys.
Any Python function.
Similar to mapping, you can pass any Python function that will input
the index value and output the group.
A list of valid keys
Further, any of the preceding key choices can be combined to group
on a multi-index.
Reference
VanderPlas, J. (2016). Python data science handbook: Essential tools for working with data. "
O'Reilly Media, Inc.".

You might also like