Skip to content

BUG/ENH: Bad columns dtype when creating empty DataFrame #22858

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lowerthansound opened this issue Sep 27, 2018 · 3 comments · Fixed by #22963
Closed

BUG/ENH: Bad columns dtype when creating empty DataFrame #22858

lowerthansound opened this issue Sep 27, 2018 · 3 comments · Fixed by #22963
Labels
Dtype Conversions Unexpected or buggy dtype conversions MultiIndex
Milestone

Comments

@lowerthansound
Copy link

lowerthansound commented Sep 27, 2018

Code Sample

>>> df = pd.DataFrame(columns=list('ABC'), dtype='int64')
>>> df
Empty DataFrame
Columns: [A, B, C]
Index: []
>>> df.dtypes
A    float64
B    float64
C    float64
dtype: object

Problem description

When creating a DataFrame with no rows, the presence of a dtype argument may convert the columns into float64. The problem does not happen if the DataFrame has one or more rows:

>>> df = pd.DataFrame([[1, 2, 3]], columns=list('ABC'), dtype='int64')
>>> df
   A  B  C
0  1  2  3
>>> df.dtypes
A    int64
B    int64
C    int64
dtype: object

Expected Output

>>> df = pd.DataFrame(columns=list('ABC'), dtype='int64')
>>> df.dtypes
A    int64
B    int64
C    int64
dtype: object

Output of pd.show_versions()

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Linux
OS-release: 4.18.5-arch1-1-ARCH
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 3.8.0
pip: 10.0.1
setuptools: 40.2.0
Cython: 0.28.5
numpy: 1.15.1
scipy: 1.1.0
pyarrow: 0.9.0
xarray: 0.10.8
IPython: 6.5.0
sphinx: 1.7.9
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: 0.4.0
matplotlib: 2.2.3
openpyxl: 2.5.5
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.1.0
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.11
pymysql: 0.9.2
psycopg2: None
jinja2: 2.10
s3fs: 0.1.6
fastparquet: 0.1.6
pandas_gbq: None
pandas_datareader: None

@sinhrks sinhrks added Dtype Conversions Unexpected or buggy dtype conversions MultiIndex labels Oct 1, 2018
@JustinZhengBC
Copy link
Contributor

This seems to be intended behaviour, as demonstrated by the following test in pandas/tests/frame/test_constructor.py::TestDataFramConstructors::test_constructor_corner

df = DataFrame(index=lrange(10), columns=['a', 'b'], dtype=int)
    assert df.values.dtype == np.dtype('float64')

The code responsible for this behaviour is found in pandas/core/dtypes/cast.py, on line 1223. Commenting out these two lines causes the above test, and no others, to fail in the pytest suite.

if is_integer_dtype(dtype) and isna(value):
    dtype = np.float64

@lowerthansound
Copy link
Author

lowerthansound commented Oct 3, 2018

I don't feel this is intended behavior, but it may be a rough corner produced by the code you mentioned.

In the issue sample, the columns are empty, therefore, no need to upcast to float:

>>> df = pd.DataFrame(columns=list('ABC'), dtype='int64')
>>> df
Empty DataFrame
Columns: [A, B, C]
Index: []

In the test case you mentioned, though, the DataFrame must be filled with NaN and therefore float is needed:

>>> df = pd.DataFrame(index=range(10), columns=['a', 'b'], dtype=int)
>>> df
    a   b
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN

@JustinZhengBC
Copy link
Contributor

Good point. Theoretically it could be fixed by making the int cast to float only if an lrange is specified. I can try it out later and submit a PR if the tests pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions MultiIndex
Projects
None yet
4 participants