Merge, Join, and Concatenate: Concatenating Objects
Merge, Join, and Concatenate: Concatenating Objects
html
Concatenating objects
The concat() function (in the main pandas namespace) does all of the heavy lifting of
performing concatenation operations along an axis while performing optional set logic (union or
intersection) of the indexes (if any) on the other axes. Note that I say “if any” because there is
only a single possible axis of concatenation for Series.
Before diving into all of the details of concat and what it can do, here is a simple example:
...:
...:
...:
Without a little bit of context many of these arguments don’t make much sense. Let’s revisit the
above example. Suppose we wanted to associate specific keys with each of the pieces of the
chopped up DataFrame. We can do this using the keys argument:
In [6]: result = pd.concat(frames, keys=['x', 'y', 'z'])
As you can see (if you’ve read the rest of the documentation), the resulting object’s index has
a hierarchical index. This means that we can now select out each chunk by key:
In [7]: result.loc['y']
Out[7]:
A B C D
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
It’s not a stretch to see how this can be very useful. More detail on this functionality below.
Note
It is worth noting that concat() (and therefore append()) makes a full copy of the data, and that
constantly reusing this function can create a significant performance hit. If you need to use the
operation over several datasets, use a list comprehension.
result = pd.concat(frames)
When gluing together multiple DataFrames, you have a choice of how to handle the other axes
(other than the one being concatenated). This can be done in the following two ways:
Warning
The default behavior with join='outer' is to sort the other axis (columns in this case). In a future
version of pandas, the default will be to not sort. We specified sort=False to opt in to the new
behavior now.
Out[12]:
A B C D B D F
2 A2 B2 C2 D2 B2 D2 F2
3 A3 B3 C3 D3 B3 D3 F3
Concatenating using append
For DataFrame objects which don’t have a meaningful index, you may wish to append them and
ignore the fact that they may have overlapping indexes. To do this, use
the ignore_index argument:
Note
Since we’re concatenating a Series to a DataFrame, we could have achieved the same result
with DataFrame.assign(). To concatenate an arbitrary number of pandas objects
(DataFrame or Series), use concat.
A fairly common use of the keys argument is to override the column names when creating a
new DataFrame based on existing Series. Notice how the default behaviour consists on letting
the resulting DataFrame inherit the parent Series’ name, when these existed.
Out[26]:
foo 0 1
0 0 0 0
1 1 1 1
2 2 2 4
3 3 3 5
Out[27]:
0 0 0 0
1 1 1 1
2 2 2 4
3 3 3 5
In [32]: result.index.levels
If you wish to specify other levels (as will occasionally be the case), you can do so using
the levels argument:
....: names=['group_key'])
....:
In [34]: result.index.levels
This is fairly esoteric, but it is actually necessary for implementing things like GroupBy where
the order of a categorical variable is meaningful.
While not especially efficient (since a new object must be created), you can append a single row
to a DataFrame by passing a Series or dict to append, which returns a new DataFrame as above.
You should use ignore_index with this method to instruct DataFrame to discard its index. If you
wish to preserve the index, you should construct an appropriately-indexed DataFrame and
append or concatenate those objects.
....:
Users who are familiar with SQL but new to pandas might be interested in a comparison with
SQL.
pandas provides a single function, merge(), as the entry point for all standard database join
operations between DataFrame or named Series objects:
validate=None)
The return type will be the same as left. If left is a DataFrame or named Series and right is a
subclass of DataFrame, the return type will still be DataFrame.
Experienced users of relational databases like SQL will be familiar with the terminology used to
describe join operations between two SQL-table like structures (DataFrame objects). There are
several cases to consider which are very important to understand:
Note
When joining columns on columns (potentially a many-to-many join), any indexes on the
passed DataFrame objects will be discarded.
It is worth spending some time understanding the result of the many-to-many join case. In SQL
/ standard relational algebra, if a key combination appears more than once in both tables, the
resulting table will have the Cartesian product of the associated data. Here is a very basic
example with one unique key combination:
....:
....:
Here is a more complicated example with multiple join keys. Only the keys appearing
in left and right are present (the intersection), since how='inner' by default.
In [42]: left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
....:
....:
Merge
SQL Join Name Description
method
left LEFT OUTER JOIN Use keys from left frame only
right RIGHT OUTER JOIN Use keys from right frame only
outer FULL OUTER JOIN Use union of keys from both frames
inner INNER JOIN Use intersection of keys from both frames
Joining / merging on duplicate keys can cause a returned frame that is the multiplication of the
row dimensions, which may result in memory overflow. It is the user’ s responsibility to manage
duplicate values in keys before joining large DataFrames.
Users can use the validate argument to automatically check whether there are unexpected
duplicates in their merge keys. Key uniqueness is checked before merge operations and so
should protect against memory overflows. Checking key uniqueness is also a good way to
ensure user data structures are as expected.
In the following example, there are duplicate values of B in the right DataFrame. As this is not a
one-to-one merge – as specified in the validate argument – an exception will be raised.
...
MergeError: Merge keys are not unique in right dataset; not a
one-to-one merge
If the user is aware of the duplicates in the right DataFrame but wants to ensure there are no
duplicates in the left DataFrame, one can use the validate='one_to_many' argument instead,
which will not raise an exception.
Out[54]:
A_x B A_y
0 1 1 NaN
1 2 2 4.0
2 2 2 5.0
3 2 2 6.0
Out[57]:
0 0 a NaN left_only
1 1 b 2.0 both
The indicator argument will also accept string arguments, in which case the indicator function
will use the value of the passed string as the name for the indicator column.
Out[58]:
0 0 a NaN left_only
1 1 b 2.0 both
Merge dtypes
In [60]: left
Out[60]:
key v1
0 1 10
In [62]: right
Out[62]:
key v1
0 1 20
1 2 30
Out[63]:
key v1
0 1 10
1 1 20
2 2 30
Out[64]:
key int64
v1 int64
dtype: object
Of course if you have missing values that are introduced, then the resulting dtype will be upcast.
Out[65]:
0 1 10.0 20
1 2 NaN 30
In [66]: pd.merge(left, right, how='outer', on='key').dtypes
Out[66]:
key int64
v1_x float64
v1_y int64
dtype: object
Merging will preserve category dtypes of the mergands. See also the section on categoricals.
In [69]: X = X.astype(CategoricalDtype(categories=['foo',
'bar']))
....: size=(10,))})
....:
In [71]: left
Out[71]:
X Y
0 bar one
1 foo one
2 foo three
3 bar three
4 foo one
5 bar one
6 bar three
7 bar three
8 bar three
9 foo three
In [72]: left.dtypes
Out[72]:
X category
Y object
dtype: object
....:
dtype=CategoricalDtype(['foo', 'bar'])),
....:
In [74]: right
Out[74]:
X Z
0 foo 1
1 bar 2
In [75]: right.dtypes
Out[75]:
X category
Z int64
dtype: object
The merged result:
In [77]: result
Out[77]:
X Y Z
0 bar one 2
1 bar three 2
2 bar one 2
3 bar three 2
4 bar three 2
5 bar three 2
6 foo one 1
7 foo three 1
8 foo one 1
9 foo three 1
In [78]: result.dtypes
Out[78]:
X category
Y object
Z int64
dtype: object
Note
The category dtypes must be exactly the same, meaning the same categories and the ordered
attribute. Otherwise the result will coerce to object dtype.
Note
Merging on category dtypes that are the same can be quite performant compared
to object dtype merging.
Joining on index
....:
....:
how='left', sort=False)
Obviously you can choose whichever form you find more convenient. For many-to-one joins
(where one of the DataFrame’s is already indexed by the join key), using join may be more
convenient. Here is a simple example:
....:
....:
....:
....:
In [91]: index = pd.MultiIndex.from_tuples([('K0', 'K0'), ('K1',
'K0'),
....:
....: index=index)
....:
Now this can be joined by passing the two key column names:
The default for DataFrame.join is to perform a left join (essentially a “VLOOKUP” operation, for
Excel users), which uses only the keys found in the calling DataFrame. Other join types, for
example inner join, can be just as easily performed:
In [94]: result = left.join(right, on=['key1', 'key2'],
how='inner')
As you can see, this drops any rows where there was no match.
....:
....: index=index)
....:
This is equivalent but less verbose and more memory efficient / faster than this.
....:
Joining with two MultiIndexes
This is supported in a limited way, provided that the index for the right argument is completely
used in the join, and is a subset of the indices in the left argument, as in this example:
.....: names=['abc',
'xy', 'num'])
.....:
In [102]: left
Out[102]:
v1
abc xy num
a x 1 0
2 1
y 1 2
2 3
b x 1 4
2 5
y 1 6
2 7
c x 1 8
2 9
y 1 10
2 11
.....: names=['abc',
'xy'])
.....:
In [105]: right
Out[105]:
v2
abc xy
a x 100
y 200
b x 300
y 400
c x 500
y 600
Out[106]:
v1 v2
abc xy num
a x 1 0 100
2 1 100
y 1 2 200
2 3 200
b x 1 4 300
2 5 300
y 1 6 400
2 7 400
c x 1 8 500
2 9 500
y 1 10 600
2 11 600
If that condition is not satisfied, a join with two multi-indexes can be done using the following
code.
.....: names=['key',
'X'])
.....:
.....: index=leftindex)
.....:
In [109]: rightindex = pd.MultiIndex.from_tuples([('K0', 'Y0'),
('K1', 'Y1'),
.....: names=['key',
'Y'])
.....:
.....: index=rightindex)
.....:
.....: on=['key'],
how='inner').set_index(['key', 'X', 'Y'])
.....:
Merging on a combination of columns and index levels
.....: index=left_index)
.....:
.....: index=right_index)
.....:
Note
When DataFrames are merged on a string that matches an index level in both frames, the index
level is preserved as an index level in the resulting DataFrame.
Note
When DataFrames are merged using only some of the levels of a MultiIndex, the extra levels
will be dropped from the resulting merge. In order to preserve those levels, use reset_index on
those level names to move those levels to columns prior to doing the merge.
Note
If a string matches both a column name and an index level name, then a warning is issued and
the column takes precedence. This will result in an ambiguity error in a future version.
A list or tuple of DataFrames can also be passed to join() to join them together on their indexes.
.....:
.....:
Note that this method only takes values from the right DataFrame if they are missing in the
left DataFrame. A related method, update(), alters non-NA values in place:
In [129]: df1.update(df2)
Timeseries friendly merging
A merge_ordered() function allows combining time series and other ordered data. In particular
it has an optional fill_method keyword to fill/interpolate missing data:
.....:
.....:
Out[132]:
k lv s rv
0 K0 1.0 a NaN
1 K1 1.0 a 1.0
2 K2 1.0 a 2.0
3 K4 1.0 a 3.0
4 K1 2.0 b 1.0
5 K2 2.0 b 2.0
6 K4 2.0 b 3.0
7 K1 3.0 c 1.0
8 K2 3.0 c 2.0
9 K4 3.0 c 3.0
10 K1 NaN d 1.0
11 K2 4.0 d 2.0
12 K4 4.0 d 3.0
Merging asof
A merge_asof() is similar to an ordered left-join except that we match on nearest key rather
than equal keys. For each row in the left DataFrame, we select the last row in
the right DataFrame whose on key is less than the left’s key. Both DataFrames must be sorted
by the key.
Optionally an asof merge can perform a group-wise merge. This matches the by key equally, in
addition to the nearest match on the on key.
.....:
.....: 'MSFT'],
.....:
In [135]: trades
Out[135]:
In [136]: quotes
Out[136]:
.....: on='time',
.....: by='ticker')
.....:
Out[137]:
We only asof within 2ms between the quote time and the trade time.
.....: on='time',
.....: by='ticker',
.....: tolerance=pd.Timedelta('2ms'))
.....:
Out[138]:
time ticker price quantity bid
ask
We only asof within 10ms between the quote time and the trade time and we exclude exact
matches on time. Note that though we exclude the exact matches (of the quotes), prior
quotes do propagate to that point in time.
.....: on='time',
.....: by='ticker',
.....: tolerance=pd.Timedelta('10ms'),
.....: allow_exact_matches=False)
.....:
Out[139]: