Skip to content

BUG: Replace in string series with NA #32621

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
albertotb opened this issue Mar 11, 2020 · 6 comments · Fixed by #32890
Closed

BUG: Replace in string series with NA #32621

albertotb opened this issue Mar 11, 2020 · 6 comments · Fixed by #32890
Labels
Bug NA - MaskedArrays Related to pd.NA and nullable extension arrays
Milestone

Comments

@albertotb
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

rep = {'one': '1', 'two': '2'}
a = pd.Series(['one', 'two'], dtype='string')
b = pd.Series(['one', 'two', np.nan])
c = pd.Series(['one', 'two', np.nan], dtype='string')


# A: this works
a.replace(to_replace=rep)

# B: this also works
b.replace(to_replace=rep)

# C: this throws exception
c.replace(to_replace=rep)
# TypeError: Cannot compare types 'ndarray(dtype=object)' and 'str'

Problem description

pandas.Series.replace cannot be used in series of type string that contain <NA>

Expected Output

I would expect C to output the same as B but with string dtype

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Linux
OS-release : 4.4.0-112-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : es_ES.UTF-8
LOCALE : es_ES.UTF-8

pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 41.4.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.8.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None
numba : None

@jorisvandenbossche
Copy link
Member

@albertotb Thanks for the report!

Slightly related to #32075

It seems this works with scalars:

In [18]: c.replace('one', '1')                                                                                                                                                                                     
Out[18]: 
0       1
1     two
2    <NA>
dtype: string

but not with lists/dicts

@jorisvandenbossche jorisvandenbossche added Bug NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Mar 11, 2020
@jorisvandenbossche jorisvandenbossche changed the title Replace in string series with NA BUG: Replace in string series with NA Mar 11, 2020
@chrispe
Copy link
Contributor

chrispe commented Mar 12, 2020

Hi, can I try to pick this up?

@jorisvandenbossche
Copy link
Member

Yes, that would be welcome!

@chrispe
Copy link
Contributor

chrispe commented Mar 12, 2020

So, here's a few things I've noticed so far:

  • When the dtype of a pd.Series is defined as a string, then the nan values of the array (which is to be stored under a series object) are converted topd.NA.
  • When a dictionary/list is used to define the replacements (see variable rep in the first example) a different function is called underneath to apply those replacements. That's why the version specified by @jorisvandenbossche is working fine. Apparently, the replace function is very different in that case (not sure yet why that is though).
  • The function used to find where to apply the replacements is using operator.eq. That operator fails to make the element-wise comparison between a value and pd.NA and returns a single scalar (instead of an array consisted of boolean values). And that's why the TypeError exception is raised.

I'm now investigating what's the best way of resolving this.

chrispe added a commit to chrispe/pandas that referenced this issue Mar 21, 2020
The pd.NA values are replaced with np.nan before comparing the arrays/scalars
@chrispe
Copy link
Contributor

chrispe commented Mar 21, 2020

I also noticed another issue, have a look at this code:

import pandas as pd
import numpy as np

replacements = {'one': '1', 'two': '2'}

series_a = pd.Series(['one', 'two'], dtype='string')
for rep in replacements:
    series_a = series_a.replace(rep, replacements[rep])

series_b = pd.Series(['one', 'two'], dtype='string')
series_b = series_b.replace(to_replace=replacements)

It produces the following output:
In [4]: series_a
Out[4]:
0 1
1 2
dtype: string
In [5]: series_b
Out[5]:
0 1
1 2
dtype: object

Shouldn't they both return a pd.Series with a dtype of string?

chrispe added a commit to chrispe/pandas that referenced this issue Mar 21, 2020
Added condition for when to apply the na replacement
@chrispe
Copy link
Contributor

chrispe commented Mar 21, 2020

Here's an explanation of my current solution (chrispe@47f6676). In order to enable comparison with arrays containing pd.NA, I make sure to replace all of the missing values with np.na:

# Replace all definitions of missing values (isna=True) to a numpy.nan
# Where x is an array of values
x = np.where(isna(x), np.nan, x)

chrispe added a commit to chrispe/pandas that referenced this issue Mar 21, 2020
The pd.NA values are replaced with np.nan before comparing the arrays/scalars
chrispe added a commit to chrispe/pandas that referenced this issue Mar 21, 2020
Made improvements based on the tests which failed
chrispe added a commit to chrispe/pandas that referenced this issue Mar 21, 2020
Added change to resolve linting check
chrispe added a commit to chrispe/pandas that referenced this issue Mar 21, 2020
chrispe added a commit to chrispe/pandas that referenced this issue Mar 22, 2020
chrispe added a commit to chrispe/pandas that referenced this issue Mar 22, 2020
chrispe added a commit to chrispe/pandas that referenced this issue Mar 29, 2020
@jreback jreback added this to the 1.1 milestone Apr 7, 2020
chrispe added a commit to chrispe/pandas that referenced this issue Apr 8, 2020
chrispe added a commit to chrispe/pandas that referenced this issue Apr 8, 2020
chrispe added a commit to chrispe/pandas that referenced this issue Apr 8, 2020
chrispe added a commit to chrispe/pandas that referenced this issue Apr 8, 2020
chrispe added a commit to chrispe/pandas that referenced this issue Apr 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants