BUG: python 3 compression and read_fwf #3963

TomAugspurger · 2013-06-19T21:12:19Z

Getting a TypeError: Type str doesn't support the buffer API when using read_fwf on a compressed file with python 3 only; works in python 2.7

Example:

with a file fwf.bug.txt like

import pandas as pd

print(pd.__version__)

widths = [5, 5]

!gzip -d fwf_bug.txt

df = pd.read_fwf('fwf_bug.txt', widths=widths, names=['one', 'two'])
print(df)

!gzip fwf_bug.txt

# python 3 throws an error here.
df2 = pd.read_fwf('fwf_bug.txt.gz', widths=widths,
                  names=['one', 'two'], compression='gzip')

print(df2)

Versions:
python3: 0.11.1.dev-4d06037 should be most recent
python2: 0.11.1.dev-3ebfef9

I can paste the full traceback if you'd like.

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2013-06-20T04:02:03Z

Doing a bit of digging around, looks like it's a unicode thing.

The error is coming here, line 1919 in pandas/build/lib.macosx-10.8-x86_64-3.3/pandas/io/parsers.py:

    def next(self):
        line = next(self.f)
        # Note: 'colspecs' is a sequence of half-open intervals.
        return [line[fromm:to].strip(self.filler or ' ')
                for (fromm, to) in self.colspecs]

here, line is a bytecode string:

ipdb> line
b'1111111111\n'

I'm not sure what the preferred way of dealing with this is, but

ipdb> line.decode('utf-8')[fromm:to].strip(' ')
'11111'

works.

cpcloud · 2013-06-20T04:07:15Z

should probably be

import pandas.core.common as com
com.pprint_thing(line[fromm:to]).strip(self.filler or ' ')

jtratner · 2013-09-09T02:36:51Z

basically, the problem is that you can't mix bytes and str in Python 3:

In [12]: b'abcd'.strip('')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-12-e487f65c9b72> in <module>()
----> 1 b'abcd'.strip('')

TypeError: Type str doesn't support the buffer API

In [13]: # Whereas it works with bytes

In [14]: b'adbc'.strip(b'')
Out[14]: b'adbc'

ghost · 2013-09-09T13:30:10Z

I'm reopening, since after staring at #4784 for a bit I think it's (and my +1 of it) wrong.

The use of next(f) when f is a BytesIO object seem dodgy,
and since next(f) doesn't strip the newline (it can't reliably), the decoding
may fail. I'm also not sure that the use of strip() is well-defined here either.

An example to illustrate some of this:

#!/usr/bin/env python3.3
from io import BytesIO
from encodings.aliases import aliases

for enc in set(aliases.values()):
    try:
        # print(enc, next(BytesIO("1234\nabcd".encode(enc)))[:-1].decode(enc)=='1234')
        bs=BytesIO("1234\nabcd".encode(enc))
        line=next(bs)
        res=line.strip().decode(enc)
        if res!='1234':
            print(enc, res)
    except LookupError:
        pass
    except UnicodeDecodeError:
        print("%s failed" % enc)

I think TextIOWrapper is the correct solution here, you can't get lines until you have text

jtratner · 2013-09-09T16:35:40Z

@y-p TextIOWrapper is what I tried first (in #4783) and it works perfectly in 3.3 (though we'd need to edit it to use the specified encoding if and only if one is provided). However, bz2 doesn't play nice with it because it doesn't support a read1() method. (gzip is fine).

Two options:

Make a subclass of io.TextIOWrapper that detects whether the passed buffer defines read1() and calls read() instead if it's not defined.
Special case bz2 and create a wrapper class that proxies everything to the internal bz2 reader, but calls read() when read1() is asked for.

Either way would work - (1) is probably more explicit though. If bz2 is the only place where we'll need this, probably makes more sense to do (2)...

ghost · 2013-09-09T17:29:48Z

So you did, didn't see it earlier.

I would be fine with correct code that only works on 3.3 and raising an error (or just blowing up) otherwise.
Building a compat layer for 3.2 seems like wasted effort.

ghost assigned jtratner Sep 5, 2013

This was referenced Sep 9, 2013

BUG: Fix input bytes conversion in Py3 to return str #4783

Merged

BUG: Fix read_fwf with compressed files. #4784

Merged

jtratner closed this as completed in #4784 Sep 9, 2013

ghost reopened this Sep 9, 2013

ghost mentioned this issue Sep 9, 2013

read_fwf/table on py3 has trouble with BytesIO #4785

Closed

jtratner closed this as completed in #4783 Sep 14, 2013

wesm unassigned jtratner Oct 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: python 3 compression and read_fwf #3963

BUG: python 3 compression and read_fwf #3963

TomAugspurger commented Jun 19, 2013

TomAugspurger commented Jun 20, 2013

cpcloud commented Jun 20, 2013

jtratner commented Sep 9, 2013

ghost commented Sep 9, 2013

jtratner commented Sep 9, 2013

ghost commented Sep 9, 2013

BUG: python 3 compression and read_fwf #3963

BUG: python 3 compression and read_fwf #3963

Comments

TomAugspurger commented Jun 19, 2013

TomAugspurger commented Jun 20, 2013

cpcloud commented Jun 20, 2013

jtratner commented Sep 9, 2013

ghost commented Sep 9, 2013

jtratner commented Sep 9, 2013

ghost commented Sep 9, 2013