Skip to content

Segfault when modifying pandas.DataFrame in-place after creating from numpy recarray #6026

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bburan-galenea opened this issue Jan 21, 2014 · 6 comments · Fixed by #6031
Closed
Labels
Bug Internals Related to non-user accessible pandas implementation
Milestone

Comments

@bburan-galenea
Copy link
Contributor

The following code generates a segfault when use_records is True. This segfault only occurs when the DataFrame is generated from a record array and then I attempt to modify the series in-place. This was not an issue in the previous release (v0.12) of Pandas. Tested this code against 0.13.0-268-g08c1302.

import numpy as np
import pandas

data = [('right', 'left', 'left', 'left', 'right', 'left', 'timeout')]

use_records = True
if use_records:
    recarray = np.rec.fromarrays(data, names=['response'])
    df = pandas.DataFrame(recarray)
else:
    df = pandas.DataFrame({'response': data[0]})
mask = df.response == 'timeout'
df.response[mask] = 'none'
@jreback
Copy link
Contributor

jreback commented Jan 21, 2014

prob need to copy the rec array elements as they go in; numpy might be sharing it somehow; this doesn't happen with a regular numpy array, but maybe the sharing is different with a rec array.

want to do a PR for this?

@bburan-galenea
Copy link
Contributor Author

I upgraded Numpy 1.7.1 to 1.8 and the problem went away (BTW, the problem was present even for regular arrays, so it must have been a bug in Numpy 1.7.1 that was fixed).

@jreback
Copy link
Contributor

jreback commented Jan 21, 2014

can you show what you were doing for a regular array? this shouldn't be the case....going to reopen (seg faults are bad...)

@jreback jreback reopened this Jan 21, 2014
@bburan-galenea
Copy link
Contributor Author

This will trigger the segfault as well (Numpy=1.7.1 & 1.7.2, Pandas=0.13.0):

import numpy as np
import pandas
data = ['right', 'left', 'left', 'left', 'right', 'left', 'timeout']
df = pandas.DataFrame({'response': np.array(data)})
mask = df.response == 'timeout'
df.response[mask] = 'none'
print df

Note that the print df is required to generate the segfault.

@jreback
Copy link
Contributor

jreback commented Jan 21, 2014

You are assigning over a view, see here: http://pandas.pydata.org/pandas-docs/dev/indexing.html#indexing-view-versus-copy

Do this instead (you NEVER want to do kind of a chained assignment!).

In [5]: df.loc[mask,'response'] = 'none'

In [6]: df
Out[6]: 
  response
0    right
1     left
2     left
3     left
4    right
5     left
6     none

[7 rows x 1 columns]

Its exceedingly difficult for pandas to even detect this. I will look at this, but generally this is a bad idea.

@bburan-galenea
Copy link
Contributor Author

I understand. Thanks for the information! I wasn't aware of loc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Internals Related to non-user accessible pandas implementation
Projects
None yet
2 participants