-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
BUG: fixing to allow unpickling of PY3 pickles from PY2 #14275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
It looks like the azure checks are complaining about code that wasn't modified? Am I reading this wrong? |
I checked the travis failure right now. And maybe one or so is, but it is failing due to compiler warnings/errors that need to be fixed up. |
Thanks! I updated the code and now I am seeing this problem in the
Any ideas about that one? |
That build uses a nightly PyPy build, and last night's build did not succeed. That failure can be ignored. |
This needs a test. You can save the target with python3, add the binary bytes to the test, Maybe I didn;t understand the issue, but I think the example at the top of the PR will not work since numpy now stores object arrays with pickle protocol 3, which is not available on python2. The original issue was about unicode field names, is that what you wanted to fix? |
Sure, will add a test. About the |
Seems there was an issue with the PR check |
Ahh, now the example is much clearer, thanks. |
I kicked this off to 1.16.6. |
I modified the example to use a context manager. |
5cb5f63
to
e19333c
Compare
Just pushed the refactored C code. Seems that there are some issue with the checks again, unless I am not reading it properly |
@mattip Hi! Sorry to insist, do I have a way to re-start the checks? The failures are not related to my changes I believe |
This looks good to me. Anyone else want to take a look? |
codecov is complaining since we only collect statistics on python3, and this is a fix for python2 |
else if (PyUnicode_Check(name)) { | ||
// The field names of a structured dtype were pickled in PY3 as unicode strings | ||
// so, to unpickle them in PY2, we need to convert them to PY2 strings | ||
new_name = PyUnicode_AsEncodedString(name, "utf-8", "strict"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this result in an an un-round-trippable field name? I think this encoding needs to match line 2795, else the following will fail:
- Write in py3 with non-ascii field names
- Read in py2
- Write back out in py2
- Read in py3
Choosing ascii
as an encoding might be safest here, simply because its the intersection of latin1 and utf8, both of which we use elsewhere in compatibility code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fersarr - thoughts? Can we construct another test that will
- add another bytestring result from the result of step 3
- on python3 load the bytestring and compare it to the original
- on python2 check that the result of the current test matches this second string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Hey. Sure, I will give that a go as soon as I am back from holiday!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eric-wieser @mattip I just pushed the new changes, hopefully now we are okay to merge. Thanks for all the help.
numpy/tests/test_reloading.py
Outdated
''' | ||
|
||
loaded = pickle.loads(saved_pickle_from_py2) | ||
assert loaded == expected_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of running this conditionally on py3 only, should we change the test to the following?
if py3:
assert loads(saved_pickle_from_py2) == expected_data
else:
# check that our string above is what we claim on py2
assert dumps(expected_data) == saved_pickle_from_py2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done! thanks. I used the idea for both tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm @eric-wieser it seems doing that does not work on Windows and Pypy:
Pypy error
https://dev.azure.com/numpy/numpy/_build/results?buildId=5698&view=logs
=================================== FAILURES ===================================
_____________ test_py3_can_load_py2_pickle_with_dtype_field_names ______________
def test_py3_can_load_py2_pickle_with_dtype_field_names():
# gh-2407 and PR #14275
# Roundtrip: Py3 should be able to load a pickle that was created in PY2
# after loading the saved_pickle (from PY3) in the test named
# 'test_py2_can_load_py3_pickle_with_dtype_field_names'
import numpy as np
expected_dtype = np.dtype([('SPOT', np.float64)])
expected_data = np.array([(6.0)], dtype=expected_dtype)
# Pickled under Python 2.7.16 with protocol=2 after it was loaded
# by test 'test_py2_can_load_py3_pickle_with_dtype_field_names'
saved_pickle_from_py2 = b'''\
\x80\x02cnumpy.core.multiarray\n_reconstruct\nq\x01cnumpy\nndarray\n\
q\x02K\x00\x85U\x01b\x87Rq\x03(K\x01K\x01\x85cnumpy\ndtype\nq\x04U\x02\
V8K\x00K\x01\x87Rq\x05(K\x03U\x01|NU\x04SPOTq\x06\x85q\x07}q\x08h\x06h\
\x04U\x02f8K\x00K\x01\x87Rq\t(K\x03U\x01<NNNJ\xff\xff\xff\xffJ\xff\xff\
\xff\xffK\x00tbK\x00\x86sK\x08K\x01K\x10tb\x89U\x08\x00\x00\x00\x00\x00\
\x00\x18@tb.\
'''
if sys.version_info[0] >= 3: # PY3
assert pickle.loads(saved_pickle_from_py2) == expected_data
else:
# check that the string above is what we claim on PY2
> assert pickle.dumps(expected_data, protocol=2) == saved_pickle_from_py2
E AssertionError: assert '\x80\x02cnum...q\x17tq\x18b.' == '\x80\x02cnump...0\x00\x18@tb.'
E \x80\x02cnumpy.core.multiarray
E _reconstruct
E q\x01cnumpy
E ndarray
E - q\x02K\x00\x85q\x03U\x01bq\x04\x87q\x05Rq\x06(K\x01K\x01\x85q\x07cnumpy
E ? ----- ----- ----- ^ -----
E + q\x02K\x00\x85U\x01b\x87Rq\x03(K\x01K\x01\x85cnumpy...
E
E ...Full output truncated (8 lines hidden), use '-vv' to show
expected_data = array([(6.,)], dtype=[('SPOT', '<f8')])
expected_dtype = dtype([('SPOT', '<f8')])
np = <module 'numpy' from '/home/vsts/work/1/s/build/testenv/site-packages/numpy/__init__.pyc'>
saved_pickle_from_py2 = '\x80\x02cnumpy.core.multiarray\n_reconstruct\nq\x01cnumpy\nndarray\nq\x02K\x00\x85U\x01b\x87Rq\x03(K\x01K\x01\x85cnum...U\x01<NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00tbK\x00\x86sK\x08K\x01K\x10tb\x89U\x08\x00\x00\x00\x00\x00\x00\x18@tb.'
numpy/tests/test_reloading.py:96: AssertionError
- generated xml file: /home/vsts/work/1/s/build/testenv/site-packages/junit/test-results.xml -
windows
https://dev.azure.com/numpy/numpy/_build/results?buildId=5698&view=logs
_____________ test_py3_can_load_py2_pickle_with_dtype_field_names _____________
def test_py3_can_load_py2_pickle_with_dtype_field_names():
# gh-2407 and PR #14275
# Roundtrip: Py3 should be able to load a pickle that was created in PY2
# after loading the saved_pickle (from PY3) in the test named
# 'test_py2_can_load_py3_pickle_with_dtype_field_names'
import numpy as np
expected_dtype = np.dtype([('SPOT', np.float64)])
expected_data = np.array([(6.0)], dtype=expected_dtype)
# Pickled under Python 2.7.16 with protocol=2 after it was loaded
# by test 'test_py2_can_load_py3_pickle_with_dtype_field_names'
saved_pickle_from_py2 = b'''\
\x80\x02cnumpy.core.multiarray\n_reconstruct\nq\x01cnumpy\nndarray\n\
q\x02K\x00\x85U\x01b\x87Rq\x03(K\x01K\x01\x85cnumpy\ndtype\nq\x04U\x02\
V8K\x00K\x01\x87Rq\x05(K\x03U\x01|NU\x04SPOTq\x06\x85q\x07}q\x08h\x06h\
\x04U\x02f8K\x00K\x01\x87Rq\t(K\x03U\x01<NNNJ\xff\xff\xff\xffJ\xff\xff\
\xff\xffK\x00tbK\x00\x86sK\x08K\x01K\x10tb\x89U\x08\x00\x00\x00\x00\x00\
\x00\x18@tb.\
'''
if sys.version_info[0] >= 3: # PY3
assert pickle.loads(saved_pickle_from_py2) == expected_data
else:
# check that the string above is what we claim on PY2
> assert pickle.dumps(expected_data, protocol=2) == saved_pickle_from_py2
E AssertionError: assert '\x80\x02cnum...0\x00\x18@tb.' == '\x80\x02cnump...0\x00\x18@tb.'
E Skipping 90 identical leading characters in diff, use -v to show
E - \x03(K\x01\x8a\x01\x01\x85cnumpy
E ? ^^^^^^^^
E + \x03(K\x01K\x01\x85cnumpy
E ? ^
E dtype
E q\x04U\x02V8K\x00K\x01\x87Rq\x05(K\x03U\x01|NU\x04SPOTq\x06\x85q\x07}q\x08h\x06h\x04U\x02f8K\x00K\x01\x87Rq\t(K\x03U\x01<NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00tbK\x00\x86sK\x08K\x01K\x10tb\x89U\x08\x00\x00\x00\x00\x00\x00\x18@tb.
expected_data = array([(6.,)], dtype=[('SPOT', '<f8')])
expected_dtype = dtype([('SPOT', '<f8')])
np = <module 'numpy' from 'C:\hostedtoolcache\windows\Python\2.7.16\x64\lib\site-packages\numpy\__init__.pyc'>
saved_pickle_from_py2 = '\x80\x02cnumpy.core.multiarray\n_reconstruct\nq\x01cnumpy\nndarray\nq\x02K\x00\x85U\x01b\x87Rq\x03(K\x01K\x01\x85cnum...U\x01<NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00tbK\x00\x86sK\x08K\x01K\x10tb\x89U\x08\x00\x00\x00\x00\x00\x00\x18@tb.'
C:\hostedtoolcache\windows\Python\2.7.16\x64\lib\site-packages\numpy\tests\test_reloading.py:96: AssertionError
------- generated xml file: D:\a\1\s\build\test\junit\test-results.xml --------
=========================== short test summary info ===========================
1028f4c
to
4418998
Compare
numpy/tests/test_reloading.py
Outdated
assert pickle.loads(saved_pickle_from_py2) == expected_data | ||
else: | ||
# check that the string above is what we claim on PY2 | ||
assert pickle.dumps(expected_data, protocol=2) == saved_pickle_from_py2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the actual byte stream produced on cpython linux 2.7 is not the same as the one produced on windows 2.7 nor the one produced on pypy 2.7. Rather than try to diagnose exactly why the stream is different, perhaps we could only assert they are identical when sys.platform.startswith('linux') and not IS_PYPY
, after from numpy.testing import IS_PYPY
. I verified that the saved_pickle_from_py2
correctly imports on those platforms, even though the roundtrip is not accurate. I don't want to dive into decoding the byte stream to find the difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattip @eric-wieser any other comments? can we merge?
fixing to allow unpickling of PY3 pickles from PY2, particularly when the pickle containes structured dtypes with field names. see numpygh-2407
@eric-wieser All your suggestions are done now :) |
Thanks @fersarr |
Fixing to allow unpickling of PY3 pickles from PY2, particularly when the pickle contains structured dtypes with field names. Only applying to branch
1.16
as1.17
and above are PY3 onlyExample from gh-2407 by jmlarson1:
Write the pickle in
PY3
Reading the pickle from
PY2
: