BUG: OverflowError on to_json with numbers larger than sys.maxsize #34395

kinghuang · 2020-05-26T21:36:29Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import sys
from pandas.io.json import dumps

dumps(sys.maxsize)
dumps(sys.maxsize + 1)

Problem description

The Pandas JSON dumper doesn't seem to handle number values larger than sys.maxsize (a word). I have a dataframe that I'm trying to write to_json, but it's failing with OverflowError: int too big to convert. There are some numbers larger than 9223372036854775807 in it.

Passing a default_handler doesn't help. It doesn't get called for the error.

>>> dumps(sys.maxsize)
'9223372036854775807'
>>> dumps(sys.maxsize + 1, default_handler=str)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OverflowError: int too big to convert

Expected Output

Python's built-in json module handles large numbers without issues.

>>> import json
>>> json.dumps(sys.maxsize)
'9223372036854775807'
>>> json.dumps(sys.maxsize+1)
'9223372036854775808'

I expect Pandas to be able to output large numbers to JSON. An option to use the built-in json module instead of ujson would be fine.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.8.3.final.0
python-bits : 64
OS : Linux
OS-release : 4.19.76-linuxkit
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.3
numpy : 1.18.2
pytz : 2019.3
dateutil : 2.8.1
pip : 20.1.1
setuptools : 46.4.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.16.0
pytables : None
pytest : None
pyxlsb : None
s3fs : 0.4.2
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

The text was updated successfully, but these errors were encountered:

arw2019 · 2020-05-27T03:03:58Z

I checked that this bug exists in the master version.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 62c7dd3
python : 3.8.2.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-101-generic
Version : #102-Ubuntu SMP Mon May 11 10:07:26 UTC 2020
machine : x86_64
processor :
byteorder : little
LC_ALL : C.UTF-8
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.0.dev0+1681.g62c7dd3e7
numpy : 1.17.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 46.4.0.post20200518
Cython : 0.29.19
pytest : 5.4.2
hypothesis : 5.15.1
sphinx : 3.0.4
blosc : None
feather : None
xlsxwriter : 1.2.8
lxml.etree : 4.5.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.14.0
pandas_datareader: None
bs4 : 4.9.1
bottleneck : 1.3.2
fastparquet : 0.4.0
gcsfs : None
matplotlib : 3.2.1
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pyxlsb : None
s3fs : 0.4.2
scipy : 1.4.1
sqlalchemy : 1.3.17
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.15.1
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.49.1

arw2019 · 2020-05-27T04:13:11Z

I dug a little and tracked the problem down to the version of dumps specified in pandas._libs. The following reproduces the same bug as above:

import sys
import pandas as pd

pd._libs.json.dumps(sys.maxsize)
pd._libs.json.dumps(sys.maxsize + 1)

I'm stuck on finding the actual code for dumps inside _libs. I'm happy to keep going with this, though, if somebody can give me a prod in the right direction!

kinghuang · 2020-05-27T04:24:47Z

I think the implementation comes from the embedded version of ultrajson in pandas/_libs/src/ujson. I'm not sure how it's vendored or gets linked up, though.

arw2019 · 2020-05-27T04:57:46Z

Thanks!

It looks to me like the code which does the encoding is in pandas/_libs/src/ujson/lib/ultrajsonenc.c and it gets linked up in pandas/_libs/src/ujson/python/objToJSON.c.

arw2019 · 2020-05-27T05:19:15Z

I guess that there is no way to fix the problem without messing with the ultrajson source code?

In pandas/io/json/_json.py dumps is defined via a direct call to ultrajson's dumps method, so I think to resolve the current bug one has to make changes to ultrajson.

import pandas._libs.json as json                       # line 10
dumps = json.dumps                                     # line 28

WillAyd · 2020-05-27T08:07:32Z

Related to #20599 this isn’t really feasible to do in the ujson source so would probably have to catch and coerce to a serializable type

arw2019 · 2020-05-27T18:21:04Z

@WillAyd Thanks for this!

Reading through that thread it seems like a solution to this issue would be to wrap ultrajson's dumps and catch the OverflowError inside pandas/io/json/_json.py.

So, instead of:

dumps = json.dumps   # line 28

we would do something like this:

def dumps(obj, default_handler=str, **kwargs):
    try:
        return json.dumps(obj, **kwargs)
    except OverflowError:
        return json.dumps(default_handler(obj), **kwargs)

This fixes the original error. I checked that with this change the code still passes the unit test in pandas/tests/io/json/test_ujson.py - so the rewrite doesn't seem to break anything.

I'm happy to keep working on fixing this is this solution isn't quite right!

Once we've settled on the fix, would the next steps be these?

add the testcase to pandas/tests/io/json/test_ujson.py
submit a pull request

WillAyd · 2020-05-28T02:09:16Z

Yea if you want to add a test case and submit a pull request we can go from there. Will also want to check the performance benchmarks for JSON which you’ll find more info on here

https://pandas.pydata.org/pandas-docs/stable/development/contributing.html#running-the-performance-test-suite

…-dev#34395)

* BUG: overflow on to_json with numbers larger than sys.maxsize * TST: overflow on to_json with numbers larger than sys.maxsize (#34395) * DOC: update with issue #34395 * TST: removed unused import * ENH: added case JT_BIGNUM to encode * ENH: added JT_BIGNUM to JSTYPES * BUG: changed error for ints>sys.maxsize into JT_BIGNUM * ENH: removed debug statements * BUG: removed dumps wrapper * removed bigNum from TypeContext * TST: fixed bug in the test * added pointer to string rep converter for BigNum * TST: removed ujson.loads from the test * added getBigNumStringValue * added code to JT_BIGNUM handler by analogy with JT_UTF8 * TST: update pandas/tests/io/json/test_ujson.py Co-authored-by: William Ayd <[email protected]> * added Object_getBigNumStringValue to pyEncoder * added skeletal code for Object_GetBigNumStringValue * completed Object_getBigNumStringValue using PyObject_Repr * BUG: changed Object_getBigNumStringValue * improved Object_getBigNumStringValue some more * update getBigNumStringValue argument * corrected Object_getBigNumStringValue * more fixes to Object_getBigNumStringValue * Update pandas/_libs/src/ujson/python/objToJSON.c Co-authored-by: William Ayd <[email protected]> * Update pandas/_libs/src/ujson/python/objToJSON.c Co-authored-by: William Ayd <[email protected]> * Update pandas/_libs/src/ujson/python/objToJSON.c Co-authored-by: William Ayd <[email protected]> * Update pandas/_libs/src/ujson/python/objToJSON.c Co-authored-by: William Ayd <[email protected]> * Update pandas/_libs/src/ujson/python/objToJSON.c Co-authored-by: William Ayd <[email protected]> * Update pandas/_libs/src/ujson/python/objToJSON.c Co-authored-by: William Ayd <[email protected]> * Update pandas/_libs/src/ujson/python/objToJSON.c * Update pandas/_libs/src/ujson/python/objToJSON.c * updated pyEncoder for JT_BIGNUM * updated pyEncoder * moved getBigNumStringValue to pyEncoder * fixed declaration of Object_getBigNumStringValue * fixed Object_getBigNumStringValue * catch overflow error with PyLong_AsLongLongAndOverflow * remove unnecessary error check * added shortcircuit for error check * simplify int overflow error catching Co-authored-by: William Ayd <[email protected]> * Update long int test in pandas/tests/io/json/test_ujson.py Co-authored-by: William Ayd <[email protected]> * removed tests expecting numeric overflow * remove underscore from overflow Co-authored-by: William Ayd <[email protected]> * removed underscores from _overflow everywhere * fixed small typo * fix type of exc * deleted numeric overflow tests * remove extraneous condition in if statement Co-authored-by: William Ayd <[email protected]> * remove extraneous condition in if statement Co-authored-by: William Ayd <[email protected]> * change _Bool into int Co-authored-by: William Ayd <[email protected]> * Update pandas/_libs/src/ujson/python/objToJSON.c Co-authored-by: William Ayd <[email protected]> * Update pandas/_libs/src/ujson/lib/ultrajsonenc.c Co-authored-by: William Ayd <[email protected]> * allocate an extra byte in Object_getBigNumStringValue Co-authored-by: William Ayd <[email protected]> * allocate an extra byte in Object_getBigNumStringValue Co-authored-by: William Ayd <[email protected]> * reinstate RESERVE_STRING(szlen) in JT_BIGNUM case * replaced (private) with (public) in whatnew * release bytes in Object_endTypeContext * in JT_BIGNUM change if+if into if+else if * added reallocation of bigNum_bytes * removed bigNum_bytes * added to_json test for ints>sys.maxsize * Use python malloc to match PyObject_Free in endTypeContext Co-authored-by: William Ayd <[email protected]> * TST: added manually constructed strs to compare encodings * fixed styling to minimize diff with master * fixed styling * fixed conflicts with master * fix styling to minimize diff * fix styling to minimize diff * fixed styling * added negative nigNum to test_to_json_large_numers * added negative nigNum to test_to_json_large_numers * Update pandas/tests/io/json/test_ujson.py Co-authored-by: William Ayd <[email protected]> * fixe test_to_json_for_large_nums for -ve * TST: added xfail for ujson.encode with long int input * TST: fixed variable names in test_to_json_large_numbers * TST: added xfail test for json.decode Series with long int * TST: added xfail test for json.decode DataFrame with long int * BENCH: added benchmarks for long ints Co-authored-by: William Ayd <[email protected]>

* BUG: overflow on to_json with numbers larger than sys.maxsize * TST: overflow on to_json with numbers larger than sys.maxsize (pandas-dev#34395) * DOC: update with issue pandas-dev#34395 * TST: removed unused import * ENH: added case JT_BIGNUM to encode * ENH: added JT_BIGNUM to JSTYPES * BUG: changed error for ints>sys.maxsize into JT_BIGNUM * ENH: removed debug statements * BUG: removed dumps wrapper * removed bigNum from TypeContext * TST: fixed bug in the test * added pointer to string rep converter for BigNum * TST: removed ujson.loads from the test * added getBigNumStringValue * added code to JT_BIGNUM handler by analogy with JT_UTF8 * TST: update pandas/tests/io/json/test_ujson.py Co-authored-by: William Ayd <[email protected]> * added Object_getBigNumStringValue to pyEncoder * added skeletal code for Object_GetBigNumStringValue * completed Object_getBigNumStringValue using PyObject_Repr * BUG: changed Object_getBigNumStringValue * improved Object_getBigNumStringValue some more * update getBigNumStringValue argument * corrected Object_getBigNumStringValue * more fixes to Object_getBigNumStringValue * Update pandas/_libs/src/ujson/python/objToJSON.c Co-authored-by: William Ayd <[email protected]> * Update pandas/_libs/src/ujson/python/objToJSON.c Co-authored-by: William Ayd <[email protected]> * Update pandas/_libs/src/ujson/python/objToJSON.c Co-authored-by: William Ayd <[email protected]> * Update pandas/_libs/src/ujson/python/objToJSON.c Co-authored-by: William Ayd <[email protected]> * Update pandas/_libs/src/ujson/python/objToJSON.c Co-authored-by: William Ayd <[email protected]> * Update pandas/_libs/src/ujson/python/objToJSON.c Co-authored-by: William Ayd <[email protected]> * Update pandas/_libs/src/ujson/python/objToJSON.c * Update pandas/_libs/src/ujson/python/objToJSON.c * updated pyEncoder for JT_BIGNUM * updated pyEncoder * moved getBigNumStringValue to pyEncoder * fixed declaration of Object_getBigNumStringValue * fixed Object_getBigNumStringValue * catch overflow error with PyLong_AsLongLongAndOverflow * remove unnecessary error check * added shortcircuit for error check * simplify int overflow error catching Co-authored-by: William Ayd <[email protected]> * Update long int test in pandas/tests/io/json/test_ujson.py Co-authored-by: William Ayd <[email protected]> * removed tests expecting numeric overflow * remove underscore from overflow Co-authored-by: William Ayd <[email protected]> * removed underscores from _overflow everywhere * fixed small typo * fix type of exc * deleted numeric overflow tests * remove extraneous condition in if statement Co-authored-by: William Ayd <[email protected]> * remove extraneous condition in if statement Co-authored-by: William Ayd <[email protected]> * change _Bool into int Co-authored-by: William Ayd <[email protected]> * Update pandas/_libs/src/ujson/python/objToJSON.c Co-authored-by: William Ayd <[email protected]> * Update pandas/_libs/src/ujson/lib/ultrajsonenc.c Co-authored-by: William Ayd <[email protected]> * allocate an extra byte in Object_getBigNumStringValue Co-authored-by: William Ayd <[email protected]> * allocate an extra byte in Object_getBigNumStringValue Co-authored-by: William Ayd <[email protected]> * reinstate RESERVE_STRING(szlen) in JT_BIGNUM case * replaced (private) with (public) in whatnew * release bytes in Object_endTypeContext * in JT_BIGNUM change if+if into if+else if * added reallocation of bigNum_bytes * removed bigNum_bytes * added to_json test for ints>sys.maxsize * Use python malloc to match PyObject_Free in endTypeContext Co-authored-by: William Ayd <[email protected]> * TST: added manually constructed strs to compare encodings * fixed styling to minimize diff with master * fixed styling * fixed conflicts with master * fix styling to minimize diff * fix styling to minimize diff * fixed styling * added negative nigNum to test_to_json_large_numers * added negative nigNum to test_to_json_large_numers * Update pandas/tests/io/json/test_ujson.py Co-authored-by: William Ayd <[email protected]> * fixe test_to_json_for_large_nums for -ve * TST: added xfail for ujson.encode with long int input * TST: fixed variable names in test_to_json_large_numbers * TST: added xfail test for json.decode Series with long int * TST: added xfail test for json.decode DataFrame with long int * BENCH: added benchmarks for long ints Co-authored-by: William Ayd <[email protected]>

kinghuang added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 26, 2020

dsaxton added the IO JSON read_json, to_json, json_normalize label May 27, 2020

mroeschke removed the Needs Triage Issue that has not been reviewed by a pandas team member label May 27, 2020

arw2019 added a commit to arw2019/pandas that referenced this issue May 30, 2020

TST: overflow on to_json with numbers larger than sys.maxsize (pandas…

6d2f8bd

…-dev#34395)

arw2019 added a commit to arw2019/pandas that referenced this issue May 30, 2020

DOC: update with issue pandas-dev#34395

4fc5b87

arw2019 mentioned this issue May 30, 2020

fix to_json for numbers larger than sys.maxsize #34473

Merged

6 tasks

jreback added this to the 1.1 milestone Jun 24, 2020

WillAyd closed this as completed in #34473 Jun 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: OverflowError on to_json with numbers larger than sys.maxsize #34395

BUG: OverflowError on to_json with numbers larger than sys.maxsize #34395

kinghuang commented May 26, 2020

INSTALLED VERSIONS

arw2019 commented May 27, 2020

INSTALLED VERSIONS

arw2019 commented May 27, 2020

kinghuang commented May 27, 2020

arw2019 commented May 27, 2020

arw2019 commented May 27, 2020

WillAyd commented May 27, 2020

arw2019 commented May 27, 2020

WillAyd commented May 28, 2020

BUG: OverflowError on to_json with numbers larger than sys.maxsize #34395

BUG: OverflowError on to_json with numbers larger than sys.maxsize #34395

Comments

kinghuang commented May 26, 2020

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

arw2019 commented May 27, 2020

INSTALLED VERSIONS

arw2019 commented May 27, 2020

kinghuang commented May 27, 2020

arw2019 commented May 27, 2020

arw2019 commented May 27, 2020

WillAyd commented May 27, 2020

arw2019 commented May 27, 2020

WillAyd commented May 28, 2020

Output of `pd.show_versions()`