BUG: Replace in `string` series with NA #32621

albertotb · 2020-03-11T12:54:56Z

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

rep = {'one': '1', 'two': '2'}
a = pd.Series(['one', 'two'], dtype='string')
b = pd.Series(['one', 'two', np.nan])
c = pd.Series(['one', 'two', np.nan], dtype='string')


# A: this works
a.replace(to_replace=rep)

# B: this also works
b.replace(to_replace=rep)

# C: this throws exception
c.replace(to_replace=rep)
# TypeError: Cannot compare types 'ndarray(dtype=object)' and 'str'

Problem description

pandas.Series.replace cannot be used in series of type string that contain <NA>

Expected Output

I would expect C to output the same as B but with string dtype

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Linux
OS-release : 4.4.0-112-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : es_ES.UTF-8
LOCALE : es_ES.UTF-8

pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 41.4.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.8.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None
numba : None

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2020-03-11T13:07:51Z

@albertotb Thanks for the report!

Slightly related to #32075

It seems this works with scalars:

In [18]: c.replace('one', '1')                                                                                                                                                                                     
Out[18]: 
0       1
1     two
2    <NA>
dtype: string

but not with lists/dicts

chrispe · 2020-03-12T08:47:58Z

Hi, can I try to pick this up?

jorisvandenbossche · 2020-03-12T08:50:18Z

Yes, that would be welcome!

chrispe · 2020-03-12T13:56:05Z

So, here's a few things I've noticed so far:

When the dtype of a pd.Series is defined as a string, then the nan values of the array (which is to be stored under a series object) are converted topd.NA.
When a dictionary/list is used to define the replacements (see variable rep in the first example) a different function is called underneath to apply those replacements. That's why the version specified by @jorisvandenbossche is working fine. Apparently, the replace function is very different in that case (not sure yet why that is though).
The function used to find where to apply the replacements is using operator.eq. That operator fails to make the element-wise comparison between a value and pd.NA and returns a single scalar (instead of an array consisted of boolean values). And that's why the TypeError exception is raised.

I'm now investigating what's the best way of resolving this.

The pd.NA values are replaced with np.nan before comparing the arrays/scalars

chrispe · 2020-03-21T09:10:46Z

I also noticed another issue, have a look at this code:

import pandas as pd
import numpy as np

replacements = {'one': '1', 'two': '2'}

series_a = pd.Series(['one', 'two'], dtype='string')
for rep in replacements:
    series_a = series_a.replace(rep, replacements[rep])

series_b = pd.Series(['one', 'two'], dtype='string')
series_b = series_b.replace(to_replace=replacements)

It produces the following output:
In [4]: series_a
Out[4]:
0 1
1 2
dtype: string
In [5]: series_b
Out[5]:
0 1
1 2
dtype: object

Shouldn't they both return a pd.Series with a dtype of string?

Added condition for when to apply the na replacement

chrispe · 2020-03-21T12:54:39Z

Here's an explanation of my current solution (chrispe@47f6676). In order to enable comparison with arrays containing pd.NA, I make sure to replace all of the missing values with np.na:

# Replace all definitions of missing values (isna=True) to a numpy.nan
# Where x is an array of values
x = np.where(isna(x), np.nan, x)

The pd.NA values are replaced with np.nan before comparing the arrays/scalars

Made improvements based on the tests which failed

Added change to resolve linting check

Added test for the reported bug

jorisvandenbossche added Bug NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Mar 11, 2020

jorisvandenbossche changed the title ~~Replace in string series with NA~~ BUG: Replace in string series with NA Mar 11, 2020

jorisvandenbossche mentioned this issue Mar 20, 2020

DataFrame.replace fails to replace value when columns are specified and only non-replacement columns contain pd.NA #32838

Closed

chrispe added a commit to chrispe/pandas that referenced this issue Mar 21, 2020

BUG: Replace in string series with NA (pandas-dev#32621)

293a504

The pd.NA values are replaced with np.nan before comparing the arrays/scalars

chrispe added a commit to chrispe/pandas that referenced this issue Mar 21, 2020

BUG: Replace in string series with NA (pandas-dev#32621)

5672626

Added condition for when to apply the na replacement

chrispe added a commit to chrispe/pandas that referenced this issue Mar 21, 2020

BUG: Fix replacing in string series with NA (pandas-dev#32621)

47f6676

The pd.NA values are replaced with np.nan before comparing the arrays/scalars

chrispe mentioned this issue Mar 21, 2020

BUG: Fix replacing in string series with NA (pandas-dev#32621) #32890

Merged

4 tasks

chrispe added a commit to chrispe/pandas that referenced this issue Mar 21, 2020

BUG: Fix replacing in string series with NA (pandas-dev#32621)

2b53200

Made improvements based on the tests which failed

chrispe added a commit to chrispe/pandas that referenced this issue Mar 21, 2020

BUG: Fix replacing in string series with NA (pandas-dev#32621)

7678495

Added change to resolve linting check

chrispe added a commit to chrispe/pandas that referenced this issue Mar 21, 2020

BUG: Fix replacing in string series with NA (pandas-dev#32621)

719369d

Added test for the reported bug

chrispe added a commit to chrispe/pandas that referenced this issue Mar 22, 2020

BUG: Fix replacing in string series with NA (pandas-dev#32621)

e98c7c9

chrispe added a commit to chrispe/pandas that referenced this issue Mar 22, 2020

BUG: Fix replacing in string series with NA (pandas-dev#32621)

fb8d143

chrispe added a commit to chrispe/pandas that referenced this issue Mar 29, 2020

BUG: Fix replacing in string series with NA (pandas-dev#32621)

ca81cb0

jreback added this to the 1.1 milestone Apr 7, 2020

chrispe added a commit to chrispe/pandas that referenced this issue Apr 8, 2020

BUG: Fix replacing in string series with NA (pandas-dev#32621)

c32a2cc

chrispe added a commit to chrispe/pandas that referenced this issue Apr 8, 2020

BUG: Fix replacing in string series with NA (pandas-dev#32621)

0a76844

chrispe added a commit to chrispe/pandas that referenced this issue Apr 8, 2020

BUG: Fix replacing in string series with NA (pandas-dev#32621)

b62ad89

chrispe added a commit to chrispe/pandas that referenced this issue Apr 8, 2020

BUG: Fix replacing in string series with NA (pandas-dev#32621)

a73e2eb

chrispe added a commit to chrispe/pandas that referenced this issue Apr 8, 2020

BUG: Fix replacing in string series with NA (pandas-dev#32621)

949accc

jreback closed this as completed in #32890 Apr 10, 2020

jreback pushed a commit that referenced this issue Apr 10, 2020

BUG: Fix replacing in string series with NA (#32621) (#32890)

3cca07c

TomAugspurger mentioned this issue May 1, 2020

Performance regression in replace.ReplaceDict.time_replace_series #33920

Closed

chrispe mentioned this issue Jul 11, 2020

Place the calculation of mask prior to the calls of comp in replace_list to improve performance #35229

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Replace in `string` series with NA #32621

BUG: Replace in `string` series with NA #32621

albertotb commented Mar 11, 2020

INSTALLED VERSIONS

jorisvandenbossche commented Mar 11, 2020

chrispe commented Mar 12, 2020

jorisvandenbossche commented Mar 12, 2020

chrispe commented Mar 12, 2020 •

edited

Loading

chrispe commented Mar 21, 2020 •

edited

Loading

chrispe commented Mar 21, 2020 •

edited

Loading

BUG: Replace in string series with NA #32621

BUG: Replace in string series with NA #32621

Comments

albertotb commented Mar 11, 2020

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jorisvandenbossche commented Mar 11, 2020

chrispe commented Mar 12, 2020

jorisvandenbossche commented Mar 12, 2020

chrispe commented Mar 12, 2020 • edited Loading

chrispe commented Mar 21, 2020 • edited Loading

chrispe commented Mar 21, 2020 • edited Loading

BUG: Replace in `string` series with NA #32621

BUG: Replace in `string` series with NA #32621

Output of `pd.show_versions()`

chrispe commented Mar 12, 2020 •

edited

Loading

chrispe commented Mar 21, 2020 •

edited

Loading

chrispe commented Mar 21, 2020 •

edited

Loading