Data Science Data Manipulation With Pandas
Data Science Data Manipulation With Pandas
In a Data Frame, the rows and columns are completely symmetric, and just as the rows can
have multiple levels of indices, the columns can have multiple levels as well. Consider the
following, which is a mock-up of some medical data.
With this in place we can, index the top-level column by the
person’s name and get a full Data Frame containing just that
person’s information.
Indexing and Slicing a MultiIndex
Pandas has a function, pd.concat(), which has a similar syntax to np.concatenate but contains a
number of options that we’ll discuss momentarily.
pd.concat() can be used for a simple concatenation of Series or Data Frame objects, just as
np.concatenate() can be used for simple concatenations of arrays:
It also works to concatenate higher-dimensional objects, such as
Data Frames.
Duplicate indices
One important difference between np.concatenate and pd.concat is that Pandas concatenation
preserves indices, even if the result will have duplicate indices! Consider this simple example.
Ignoring the index
Sometimes the index itself does not matter, and you would prefer it to simply
be ignored. You can specify this option using the ignore_index flag. With this
set to True, the concatenation will create a new integer index for the resulting
Series:
Adding MultiIndex keys
Another alternative is to use the keys option to specify a label for the
data sources; the result will be a hierarchically indexed series
containing the data.
Combining Datasets: Merge and Join
One essential feature offered by Pandas is its high-performance, in-memory join and merge
operations.
If you have ever worked with databases, you should be familiar with this type of data
interaction. The main interface for this is the pd.merge function,
Categories of Joins
The pd.merge() function implements a number of types of joins: the one-to-one, many-to-one,
and many-to-many joins.
All three types of joins are accessed via an identical call to the pd.merge() interface; the type of
join performed depends on the form of the input data.
One-to-one joins
Many-to-many joins are a bit confusing conceptually, but are nevertheless well defined. If the
key column in both the left and right array contains duplicates, then the result is a many-to-
many merge.
Specification of the Merge Key
We’ve already seen the default behavior of pd.merge(). it looks for
one or more matching column names between the two inputs, and
uses this as the key.
However, often the column names will not match so nicely, and
pd.merge() provides a variety of options for handling this.
The on keyword
Most simply, you can explicitly specify the name of the key column using the on keyword,
which takes a column name or a list of column names:
The left-on and right-on keywords
At times you may wish to merge two datasets with different column names; for example, we
may have a dataset in which the employee name is labeled as “name” rather than “employee”.
In this case, we can use the left_on and right_on keywords to specify the two column names.
The result has a redundant column that we can drop if desired—for
example, by using the drop() method of Data Frames.
The left_index and right_index keywords
We’re now familiar with Group By aggregations with sum(), median(), and the like, but the
aggregate() method allows for even more flexibility.
It can take a string, a function, or a list thereof, and compute all the aggregates at once. Here is
a quick example combining all these.
Another useful pattern is to pass a dictionary mapping column
names to operations to be applied on that column.
Filtering
A filtering operation allows you to drop data based on the group properties. For example, we
might want to keep all groups in which the standard deviation is larger than some critical value.
Transformation.
While aggregation must return a reduced version of the data, transformation can return some
transformed version of the full data to recombine. For such a transformation, the output is the
same shape as the input.
The apply() method.
The apply() method lets you apply an arbitrary function to the group results. The function
should take a Data Frame, and return either a Pandas object (e.g., Data Frame, Series) or a
scalar; the combine operation will be tailored to the type of output returned.
A list, array, series, or index providing the
grouping keys.
The key can be any series or list with a length matching that of the Data Frame. For example:
Of course, this means there’s another, more verbose way of
accomplishing the df.groupby('key') from before.
A dictionary or series mapping index
to group
Another method is to provide a dictionary that maps index values to the group keys.
Any Python function.
Similar to mapping, you can pass any Python function that will input
the index value and output the group.
A list of valid keys
Further, any of the preceding key choices can be combined to group
on a multi-index.
Reference
VanderPlas, J. (2016). Python data science handbook: Essential tools for working with data. "
O'Reilly Media, Inc.".