Skip to content

BUG: fixing to allow unpickling of PY3 pickles from PY2 #14275

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 20, 2019

Conversation

fersarr
Copy link

@fersarr fersarr commented Aug 14, 2019

Fixing to allow unpickling of PY3 pickles from PY2, particularly when the pickle contains structured dtypes with field names. Only applying to branch 1.16 as 1.17 and above are PY3 only

Example from gh-2407 by jmlarson1:

Write the pickle in PY3

import pickle
import numpy as np
my_datatype = np.dtype([('SPOT', np.float64)])
my_data = np.array([(6.0)], dtype=my_datatype)

# Pickle the data using maximum supported protocol in Py2
with open('py3_out', 'wb') as fid:
    pickle.dump(my_data, fid, protocol=2)

Reading the pickle from PY2:

import pickle
with open('py3_out', 'rb') as fid:
    loaded = pickle.load(fid)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/path/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 393, in load
    return format.read_array(fid)
  File "/path/lib/python2.7/dist-packages/numpy/lib/format.py", line 602, in read_array
    array = pickle.load(fp)
ValueError: non-string names in Numpy dtype unpickling

@fersarr
Copy link
Author

fersarr commented Aug 14, 2019

It looks like the azure checks are complaining about code that wasn't modified? Am I reading this wrong?

@seberg
Copy link
Member

seberg commented Aug 14, 2019

I checked the travis failure right now. And maybe one or so is, but it is failing due to compiler warnings/errors that need to be fixed up.

@fersarr
Copy link
Author

fersarr commented Aug 15, 2019

Thanks! I updated the code and now I am seeing this problem in the Linux_Pypi checks

++ tar --strip-components=1 -xf ../pypy.tar.bz2
tar: This does not look like a tar archive

bzip2: Compressed file ends unexpectedly;
	perhaps it is corrupted?  *Possible* reason follows.
bzip2: Inappropriate ioctl for device
	Input file = (stdin), output file = (stdout)

It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.

You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.

tar: Child returned status 2
tar: Error is not recoverable: exiting now
##[error]Bash exited with code '1'.

Any ideas about that one?

@mattip
Copy link
Member

mattip commented Aug 15, 2019

That build uses a nightly PyPy build, and last night's build did not succeed. That failure can be ignored.

@mattip
Copy link
Member

mattip commented Aug 15, 2019

This needs a test. You can save the target with python3, add the binary bytes to the test, s = b'\x93NUMPY\x01\x00v\x00{\'descr\': \'|O\', \' ... then wrap the bytes in an np.load(io.BytesIO(s))

Maybe I didn;t understand the issue, but I think the example at the top of the PR will not work since numpy now stores object arrays with pickle protocol 3, which is not available on python2. The original issue was about unicode field names, is that what you wanted to fix?

@fersarr
Copy link
Author

fersarr commented Aug 19, 2019

This needs a test. You can save the target with python3, add the binary bytes to the test, s = b'\x93NUMPY\x01\x00v\x00{\'descr\': \'|O\', \' ... then wrap the bytes in an np.load(io.BytesIO(s))

Maybe I didn;t understand the issue, but I think the example at the top of the PR will not work since numpy now stores object arrays with pickle protocol 3, which is not available on python2. The original issue was about unicode field names, is that what you wanted to fix?

Sure, will add a test.

About the protocol, you are right, I updated the example. I had copied it from gh-2407 but now I added the actual example I used for testing.

@fersarr
Copy link
Author

fersarr commented Aug 19, 2019

Seems there was an issue with the PR check continuous-integration/travis-ci/pr. How can I start it again?
No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.

@mattip
Copy link
Member

mattip commented Aug 20, 2019

Ahh, now the example is much clearer, thanks. Note it will not close the file descriptors, you should always use open inside a context manager: with open(...) as fid: since people tend to indiscriminately copy-paste code from the internet.

@charris charris modified the milestones: 1.16.5 release, 1.16.6 Aug 22, 2019
@charris
Copy link
Member

charris commented Aug 22, 2019

I kicked this off to 1.16.6.

@mattip
Copy link
Member

mattip commented Aug 22, 2019

I modified the example to use a context manager.

@fersarr fersarr force-pushed the 1.16_2 branch 2 times, most recently from 5cb5f63 to e19333c Compare August 23, 2019 14:06
@fersarr
Copy link
Author

fersarr commented Aug 23, 2019

Just pushed the refactored C code. Seems that there are some issue with the checks again, unless I am not reading it properly

@fersarr
Copy link
Author

fersarr commented Aug 30, 2019

Just pushed the refactored C code. Seems that there are some issue with the checks again, unless I am not reading it properly

@mattip Hi! Sorry to insist, do I have a way to re-start the checks? The failures are not related to my changes I believe

@mattip
Copy link
Member

mattip commented Aug 30, 2019

This looks good to me. Anyone else want to take a look?

@mattip
Copy link
Member

mattip commented Aug 30, 2019

codecov is complaining since we only collect statistics on python3, and this is a fix for python2

else if (PyUnicode_Check(name)) {
// The field names of a structured dtype were pickled in PY3 as unicode strings
// so, to unpickle them in PY2, we need to convert them to PY2 strings
new_name = PyUnicode_AsEncodedString(name, "utf-8", "strict");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this result in an an un-round-trippable field name? I think this encoding needs to match line 2795, else the following will fail:

  • Write in py3 with non-ascii field names
  • Read in py2
  • Write back out in py2
  • Read in py3

Choosing ascii as an encoding might be safest here, simply because its the intersection of latin1 and utf8, both of which we use elsewhere in compatibility code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fersarr - thoughts? Can we construct another test that will

  • add another bytestring result from the result of step 3
  • on python3 load the bytestring and compare it to the original
  • on python2 check that the result of the current test matches this second string

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Hey. Sure, I will give that a go as soon as I am back from holiday!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eric-wieser @mattip I just pushed the new changes, hopefully now we are okay to merge. Thanks for all the help.

@mattip mattip requested a review from eric-wieser September 21, 2019 17:05
'''

loaded = pickle.loads(saved_pickle_from_py2)
assert loaded == expected_data
Copy link
Member

@eric-wieser eric-wieser Sep 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of running this conditionally on py3 only, should we change the test to the following?

if py3:
    assert loads(saved_pickle_from_py2) == expected_data
else:
    # check that our string above is what we claim on py2
    assert dumps(expected_data) == saved_pickle_from_py2

Copy link
Author

@fersarr fersarr Sep 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done! thanks. I used the idea for both tests

Copy link
Author

@fersarr fersarr Sep 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm @eric-wieser it seems doing that does not work on Windows and Pypy:

Pypy error
https://dev.azure.com/numpy/numpy/_build/results?buildId=5698&view=logs

=================================== FAILURES ===================================
_____________ test_py3_can_load_py2_pickle_with_dtype_field_names ______________

    def test_py3_can_load_py2_pickle_with_dtype_field_names():
        # gh-2407 and PR #14275
        # Roundtrip: Py3 should be able to load a pickle that was created in PY2
        # after loading the saved_pickle (from PY3) in the test named
        # 'test_py2_can_load_py3_pickle_with_dtype_field_names'
        import numpy as np
    
        expected_dtype = np.dtype([('SPOT', np.float64)])
        expected_data = np.array([(6.0)], dtype=expected_dtype)
        # Pickled under Python 2.7.16 with protocol=2 after it was loaded
        # by test 'test_py2_can_load_py3_pickle_with_dtype_field_names'
        saved_pickle_from_py2 = b'''\
    \x80\x02cnumpy.core.multiarray\n_reconstruct\nq\x01cnumpy\nndarray\n\
    q\x02K\x00\x85U\x01b\x87Rq\x03(K\x01K\x01\x85cnumpy\ndtype\nq\x04U\x02\
    V8K\x00K\x01\x87Rq\x05(K\x03U\x01|NU\x04SPOTq\x06\x85q\x07}q\x08h\x06h\
    \x04U\x02f8K\x00K\x01\x87Rq\t(K\x03U\x01<NNNJ\xff\xff\xff\xffJ\xff\xff\
    \xff\xffK\x00tbK\x00\x86sK\x08K\x01K\x10tb\x89U\x08\x00\x00\x00\x00\x00\
    \x00\x18@tb.\
    '''
    
        if sys.version_info[0] >= 3:  # PY3
            assert pickle.loads(saved_pickle_from_py2) == expected_data
        else:
            # check that the string above is what we claim on PY2
>           assert pickle.dumps(expected_data, protocol=2) == saved_pickle_from_py2
E           AssertionError: assert '\x80\x02cnum...q\x17tq\x18b.' == '\x80\x02cnump...0\x00\x18@tb.'
E               \x80\x02cnumpy.core.multiarray
E               _reconstruct
E               q\x01cnumpy
E               ndarray
E             - q\x02K\x00\x85q\x03U\x01bq\x04\x87q\x05Rq\x06(K\x01K\x01\x85q\x07cnumpy
E             ?               -----      -----    -----     ^               -----
E             + q\x02K\x00\x85U\x01b\x87Rq\x03(K\x01K\x01\x85cnumpy...
E             
E             ...Full output truncated (8 lines hidden), use '-vv' to show

expected_data = array([(6.,)], dtype=[('SPOT', '<f8')])
expected_dtype = dtype([('SPOT', '<f8')])
np         = <module 'numpy' from '/home/vsts/work/1/s/build/testenv/site-packages/numpy/__init__.pyc'>
saved_pickle_from_py2 = '\x80\x02cnumpy.core.multiarray\n_reconstruct\nq\x01cnumpy\nndarray\nq\x02K\x00\x85U\x01b\x87Rq\x03(K\x01K\x01\x85cnum...U\x01<NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00tbK\x00\x86sK\x08K\x01K\x10tb\x89U\x08\x00\x00\x00\x00\x00\x00\x18@tb.'

numpy/tests/test_reloading.py:96: AssertionError
- generated xml file: /home/vsts/work/1/s/build/testenv/site-packages/junit/test-results.xml -

windows
https://dev.azure.com/numpy/numpy/_build/results?buildId=5698&view=logs

_____________ test_py3_can_load_py2_pickle_with_dtype_field_names _____________

    def test_py3_can_load_py2_pickle_with_dtype_field_names():
        # gh-2407 and PR #14275
        # Roundtrip: Py3 should be able to load a pickle that was created in PY2
        # after loading the saved_pickle (from PY3) in the test named
        # 'test_py2_can_load_py3_pickle_with_dtype_field_names'
        import numpy as np
    
        expected_dtype = np.dtype([('SPOT', np.float64)])
        expected_data = np.array([(6.0)], dtype=expected_dtype)
        # Pickled under Python 2.7.16 with protocol=2 after it was loaded
        # by test 'test_py2_can_load_py3_pickle_with_dtype_field_names'
        saved_pickle_from_py2 = b'''\
    \x80\x02cnumpy.core.multiarray\n_reconstruct\nq\x01cnumpy\nndarray\n\
    q\x02K\x00\x85U\x01b\x87Rq\x03(K\x01K\x01\x85cnumpy\ndtype\nq\x04U\x02\
    V8K\x00K\x01\x87Rq\x05(K\x03U\x01|NU\x04SPOTq\x06\x85q\x07}q\x08h\x06h\
    \x04U\x02f8K\x00K\x01\x87Rq\t(K\x03U\x01<NNNJ\xff\xff\xff\xffJ\xff\xff\
    \xff\xffK\x00tbK\x00\x86sK\x08K\x01K\x10tb\x89U\x08\x00\x00\x00\x00\x00\
    \x00\x18@tb.\
    '''
    
        if sys.version_info[0] >= 3:  # PY3
            assert pickle.loads(saved_pickle_from_py2) == expected_data
        else:
            # check that the string above is what we claim on PY2
>           assert pickle.dumps(expected_data, protocol=2) == saved_pickle_from_py2
E           AssertionError: assert '\x80\x02cnum...0\x00\x18@tb.' == '\x80\x02cnump...0\x00\x18@tb.'
E             Skipping 90 identical leading characters in diff, use -v to show
E             - \x03(K\x01\x8a\x01\x01\x85cnumpy
E             ?           ^^^^^^^^
E             + \x03(K\x01K\x01\x85cnumpy
E             ?           ^
E               dtype
E               q\x04U\x02V8K\x00K\x01\x87Rq\x05(K\x03U\x01|NU\x04SPOTq\x06\x85q\x07}q\x08h\x06h\x04U\x02f8K\x00K\x01\x87Rq\t(K\x03U\x01<NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00tbK\x00\x86sK\x08K\x01K\x10tb\x89U\x08\x00\x00\x00\x00\x00\x00\x18@tb.

expected_data = array([(6.,)], dtype=[('SPOT', '<f8')])
expected_dtype = dtype([('SPOT', '<f8')])
np         = <module 'numpy' from 'C:\hostedtoolcache\windows\Python\2.7.16\x64\lib\site-packages\numpy\__init__.pyc'>
saved_pickle_from_py2 = '\x80\x02cnumpy.core.multiarray\n_reconstruct\nq\x01cnumpy\nndarray\nq\x02K\x00\x85U\x01b\x87Rq\x03(K\x01K\x01\x85cnum...U\x01<NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00tbK\x00\x86sK\x08K\x01K\x10tb\x89U\x08\x00\x00\x00\x00\x00\x00\x18@tb.'

C:\hostedtoolcache\windows\Python\2.7.16\x64\lib\site-packages\numpy\tests\test_reloading.py:96: AssertionError
------- generated xml file: D:\a\1\s\build\test\junit\test-results.xml --------
=========================== short test summary info ===========================

@fersarr fersarr force-pushed the 1.16_2 branch 3 times, most recently from 1028f4c to 4418998 Compare September 30, 2019 10:20
assert pickle.loads(saved_pickle_from_py2) == expected_data
else:
# check that the string above is what we claim on PY2
assert pickle.dumps(expected_data, protocol=2) == saved_pickle_from_py2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the actual byte stream produced on cpython linux 2.7 is not the same as the one produced on windows 2.7 nor the one produced on pypy 2.7. Rather than try to diagnose exactly why the stream is different, perhaps we could only assert they are identical when sys.platform.startswith('linux') and not IS_PYPY, after from numpy.testing import IS_PYPY. I verified that the saved_pickle_from_py2 correctly imports on those platforms, even though the roundtrip is not accurate. I don't want to dive into decoding the byte stream to find the difference.

Copy link
Author

@fersarr fersarr Oct 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Done!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattip @eric-wieser any other comments? can we merge?

fixing to allow unpickling of PY3 pickles from PY2, particularly when
the pickle containes structured dtypes with field names. see numpygh-2407
@fersarr
Copy link
Author

fersarr commented Oct 18, 2019

@eric-wieser All your suggestions are done now :)

@mattip mattip requested a review from eric-wieser October 20, 2019 09:04
@mattip mattip merged commit e9322e8 into numpy:maintenance/1.16.x Oct 20, 2019
@mattip
Copy link
Member

mattip commented Oct 20, 2019

Thanks @fersarr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants