Skip to content

[BUG]: DataFrame.join inconsistent behavior, accepts overlapping columns provided suffixes is specified #13659

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dragonator4 opened this issue Jul 14, 2016 · 4 comments
Labels
Enhancement Error Reporting Incorrect or improved errors from pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@dragonator4
Copy link

dragonator4 commented Jul 14, 2016

Here is a sample code to reproduce the error:

In [1]: df1 = pd.DataFrame(np.random.rand(5,2))
        df2 = pd.DataFrame(np.random.rand(5,2))

In [2]: df2.join(df1)
Out[2]: ---------------------------------------------------------------------------
        ValueError: columns overlap but no suffix specified: RangeIndex(start=0, stop=2, step=1)

In [3]: df2.join(df1, lsuffix='_x', rsuffix='_x')
Out[2]:     0_x         1_x         0_x         1_x
        0   0.904888    0.491802    0.509346    0.367847
        1   0.282420    0.092652    0.672786    0.358450
        2   0.339018    0.318990    0.359977    0.640366
        3   0.775293    0.767872    0.820965    0.018728
        4   0.543648    0.412799    0.650457    0.712789

So ultimately one does get a merged DataFrame with overlapping column names. Then why raise an error in the first place?

Note, I am using the latest Pandas, Python and Numpy.

@sinhrks sinhrks added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Jul 15, 2016
@sinhrks
Copy link
Member

sinhrks commented Jul 15, 2016

I think this is for foolproof. People doesn't want Index with duplicates cos it's confusing and not very performant. One idea is to change the default behavior to add suffix by some rule.

@jorisvandenbossche
Copy link
Member

@dragonator4 In any case, when passing such prefixes, it is the deliberate choice of the user to have duplicate names, so I think that justifies the difference in behaviour.

BTW, if you want to just join dataframes on the index without worrying about column names (because eg in your example the column names go from int to string), you can also use concat

@dragonator4
Copy link
Author

@jorisvandenbossche I stumbled across this when I was trying to do a proper join where I cared about my indexes. It completely flummoxed me for all of 5 minutes, then I realised that I passed rsuffix = '_x' instead of '_y'. This is an issue because had I not checked my output, I could have run into some serious trouble.

Perhaps it should raise an error when suffixes are not passed and there are duplicate columns, as it does. But then perhaps it should warn if the passed suffixes also cause duplicate column names. That way you cater to the deliberate choice of the user as you put it.

@jeswcollins
Copy link

I found the error message confusing. columns has a specific meaning as a Pandas DataFrame parameter, but it also has a broader meaning. In the broadest sense of the word "column", we might interpret it to include the specific Pandas term index as a column. Indeed, the index appears in a columnar format in stdout.

So I expected the index "column" to overlap in my two dataframes. That's what I was trying to join on!

I'm not sure how to rephrase the error message, but a clearer error message, or joining with default suffices as in a df.merge(df,right_index=True,left_index=True), both seem preferred.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Error Reporting Incorrect or improved errors from pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

5 participants