BUG: assigning Series.array / PandasArray to column fails #26390

jorisvandenbossche · 2019-05-14T14:21:18Z

Assigning a PandasArray (so also the result of df['a'].array) of the correct length to add a column fails:

In [1]: df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': ['a', 'b', 'c', 'd']})                                                                                     

In [2]: df['c'] = pd.array([1, 2, None, 3])                                                                                                                   
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/scipy/pandas/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2672             try:
-> 2673                 return self._engine.get_loc(key)
   2674             except KeyError:

~/scipy/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

~/scipy/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

~/scipy/pandas/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

~/scipy/pandas/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'c'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
~/scipy/pandas/pandas/core/internals/managers.py in set(self, item, value)
   1048         try:
-> 1049             loc = self.items.get_loc(item)
   1050         except KeyError:

~/scipy/pandas/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2674             except KeyError:
-> 2675                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2676         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

~/scipy/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

~/scipy/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

~/scipy/pandas/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

~/scipy/pandas/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'c'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-2-03925b585d9b> in <module>
----> 1 df['c'] = pd.array([1, 2, None, 3])

~/scipy/pandas/pandas/core/frame.py in __setitem__(self, key, value)
   3334         else:
   3335             # set column
-> 3336             self._set_item(key, value)
   3337 
   3338     def _setitem_slice(self, key, value):

~/scipy/pandas/pandas/core/frame.py in _set_item(self, key, value)
   3410         self._ensure_valid_index(value)
   3411         value = self._sanitize_column(key, value)
-> 3412         NDFrame._set_item(self, key, value)
   3413 
   3414         # check if we are modifying a copy

~/scipy/pandas/pandas/core/generic.py in _set_item(self, key, value)
   3232 
   3233     def _set_item(self, key, value):
-> 3234         self._data.set(key, value)
   3235         self._clear_item_cache()
   3236 

~/scipy/pandas/pandas/core/internals/managers.py in set(self, item, value)
   1050         except KeyError:
   1051             # This item wasn't present, just insert at end
-> 1052             self.insert(len(self.items), item, value)
   1053             return
   1054 

~/scipy/pandas/pandas/core/internals/managers.py in insert(self, loc, item, value, allow_duplicates)
   1152 
   1153         block = make_block(values=value, ndim=self.ndim,
-> 1154                            placement=slice(loc, loc + 1))
   1155 
   1156         for blkno, count in _fast_count_smallints(self._blknos[loc:]):

~/scipy/pandas/pandas/core/internals/blocks.py in make_block(values, placement, klass, ndim, dtype, fastpath)
   3052         values = DatetimeArray._simple_new(values, dtype=dtype)
   3053 
-> 3054     return klass(values, ndim=ndim, placement=placement)
   3055 
   3056 

~/scipy/pandas/pandas/core/internals/blocks.py in __init__(self, values, placement, ndim)
   2584             values = np.array(values, dtype=object)
   2585 
-> 2586         super().__init__(values, ndim=ndim, placement=placement)
   2587 
   2588     @property

~/scipy/pandas/pandas/core/internals/blocks.py in __init__(self, values, placement, ndim)
     74 
     75     def __init__(self, values, placement, ndim=None):
---> 76         self.ndim = self._check_ndim(values, ndim)
     77         self.mgr_locs = placement
     78         self.values = values

~/scipy/pandas/pandas/core/internals/blocks.py in _check_ndim(self, values, ndim)
    111             msg = ("Wrong number of dimensions. values.ndim != ndim "
    112                    "[{} != {}]")
--> 113             raise ValueError(msg.format(values.ndim, ndim))
    114 
    115         return ndim

ValueError: Wrong number of dimensions. values.ndim != ndim [1 != 2]

Note this only fails for the PandasArray types (so when creating a FloatBlock or IntBlock, .. which expect 2D data, so when not creating an ExtensionBlock as is done for an "actual" ExtensionArray).

The text was updated successfully, but these errors were encountered:

shantanu-gontia · 2019-05-14T18:39:27Z

This seems to work for me

In [1]: import pandas as pd                                                     

In [2]: pd.__version__                                                          
Out[2]: '0.24.2'

In [3]: df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': ['a', 'b', 'c', 'd']})       

In [4]: df['c'] = pd.array([1, 2, None, 3])                                     

In [5]: df                                                                      
Out[5]: 
   a  b     c
0  1  a     1
1  2  b     2
2  3  c  None
3  4  d     3

jorisvandenbossche · 2019-05-14T18:44:22Z

Indeed, on 0.24.2 it 'seems' to work, but we incorrectly store the PandasArray (which I think we fixed). It is however on master that this is failing.

shantanu-gontia · 2019-05-14T19:45:15Z

The current master implementation does seem to convert any PandasArray to a a Numpy Array.

if isinstance(values, ABCPandasArray):
        values = values.to_numpy()
if isinstance(dtype, PandasDtype):
        dtype = dtype.numpy_dtype

jorisvandenbossche · 2019-05-14T19:46:46Z

Thanks for looking into it!
Yes, that's indeed what we want. But apparently something else still goes wrong.

shantanu-gontia · 2019-05-14T20:23:08Z

Without converting the PandasArray to a Numpy Array the block type assigned is ExtensionBlock. However, conversion to a NumpyArray results in the block type to be ObjectBlock.

ExtensionBlock is a child of NonConsolidatableMixIn which sets its _validate_ndim property to False hence when the _check_ndim check is performed no error is raised.

pandas/pandas/core/internals/blocks.py

Lines 1511 to 1515 in e5d15b2

    
           class NonConsolidatableMixIn: 
        
               """ hold methods for the nonconsolidatable blocks """ 
        
               _can_consolidate = False 
        
               _verify_integrity = False 
        
               _validate_ndim = False

This is not true for ObjectBlock, which has its _validate_ndim property set to True. Hence, the error is raised.

If we pass a NumpyArray instead of a PandasArray, then the during the call to the _set_item
method of NDFrame,

pandas/pandas/core/frame.py

Lines 3400 to 3413 in e5d15b2

    
               def _set_item(self, key, value): 
        
                   """ 
        
                   Add series to DataFrame in specified column. 
        
                   If series is a numpy-array (not a Series/TimeSeries), it must be the 
        
                   same length as the DataFrames index or an error will be thrown. 
        
                   Series/TimeSeries will be conformed to the DataFrames index to 
        
                   ensure homogeneity. 
        
                   """ 
        
                   self._ensure_valid_index(value) 
        
                   value = self._sanitize_column(key, value) 
        
                   NDFrame._set_item(self, key, value)

the _sanitize_column method, when used with a numpy array explicitly converts it to 2-dimensions

pandas/pandas/core/frame.py

Line 3635 in e5d15b2

return np.atleast_2d(np.asarray(value))

This step is left out when we convert the PandasArray to a numpy array. Perhaps we can add this after the to_numpy() conversion

pandas/pandas/core/internals/blocks.py

Lines 3032 to 3039 in e5d15b2

    
           def make_block(values, placement, klass=None, ndim=None, dtype=None, 
        
                          fastpath=None): 
        
               # Ensure that we don't allow PandasArray / PandasDtype in internals. 
        
               # For now, blocks should be backed by ndarrays when possible. 
        
               if isinstance(values, ABCPandasArray): 
        
                   values = values.to_numpy() 
        
               if isinstance(dtype, PandasDtype): 
        
                   dtype = dtype.numpy_dtype

shantanu-gontia · 2019-05-15T18:11:55Z

If we simply add a line values = np.atleast_2d(np.asarray(values)) after Line 3036 in

pandas/pandas/core/internals/blocks.py

Line 3036 in e5d15b2

if isinstance(values, ABCPandasArray):

then the following test will fail

pandas/pandas/tests/internals/test_internals.py

Lines 1295 to 1312 in ff4437e

    
           def test_make_block_no_pandas_array(): 
        
               # https://github.com/pandas-dev/pandas/pull/24866 
        
               arr = pd.array([1, 2]) 
        
               # PandasArray, no dtype 
        
               result = make_block(arr, slice(len(arr))) 
        
               assert result.is_integer is True 
        
               assert result.is_extension is False 
        
               # PandasArray, PandasDtype 
        
               result = make_block(arr, slice(len(arr)), dtype=arr.dtype) 
        
               assert result.is_integer is True 
        
               assert result.is_extension is False 
        
               # ndarray, PandasDtype 
        
               result = make_block(arr.to_numpy(), slice(len(arr)), dtype=arr.dtype) 
        
               assert result.is_integer is True 
        
               assert result.is_extension is False

However, with the bug in hand, and the new implementation of passing PandasArray by converting them to NumPy arrays, is this test valid now?

Another solution can be to convert the PandasArray to a NumpyArray during the _sanitize_column method, maybe here

pandas/pandas/core/frame.py

Lines 3623 to 3625 in e5d15b2

    
           # return internal types directly 
        
           if is_extension_type(value) or is_extension_array_dtype(value): 
        
               return value

or add a special case of ABCPandasArray

jorisvandenbossche added Bug ExtensionArray Extending pandas with custom dtypes or arrays. labels May 14, 2019

jorisvandenbossche added this to the 0.25.0 milestone May 14, 2019

shantanu-gontia mentioned this issue May 15, 2019

BUG: bugfix 26390 assigning PandasArray to DataFrame error #26417

Merged

4 tasks

jreback closed this as completed in #26417 May 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: assigning Series.array / PandasArray to column fails #26390

BUG: assigning Series.array / PandasArray to column fails #26390

jorisvandenbossche commented May 14, 2019

shantanu-gontia commented May 14, 2019

jorisvandenbossche commented May 14, 2019

shantanu-gontia commented May 14, 2019 •

edited

Loading

jorisvandenbossche commented May 14, 2019

shantanu-gontia commented May 14, 2019 •

edited

Loading

shantanu-gontia commented May 15, 2019

BUG: assigning Series.array / PandasArray to column fails #26390

BUG: assigning Series.array / PandasArray to column fails #26390

Comments

jorisvandenbossche commented May 14, 2019

shantanu-gontia commented May 14, 2019

jorisvandenbossche commented May 14, 2019

shantanu-gontia commented May 14, 2019 • edited Loading

jorisvandenbossche commented May 14, 2019

shantanu-gontia commented May 14, 2019 • edited Loading

shantanu-gontia commented May 15, 2019

shantanu-gontia commented May 14, 2019 •

edited

Loading

shantanu-gontia commented May 14, 2019 •

edited

Loading