read_csv(engine='c') can insert spurious rows full of NaNs #10022

jblackburne · 2015-04-29T19:34:14Z

I have a well-formed CSV file with about 70k lines that parses out to a DataFrame with about 170k rows, where the extra rows are just full of NaNs. It only happens with the 'c' engine.

This is with git master.

I won't bother to upload the CSV file in question because I think I've tracked down the problem. It seems that tokenize_delimited() runs on chunks of data at a time. This problem occurs when a chunk happens to start with '\n'. I'll send a pull request.

The text was updated successfully, but these errors were encountered:

jreback · 2015-04-29T22:16:30Z

skip_blank_lines=True already does this (and is the default). Pls show the version you are using.

jblackburne · 2015-04-29T22:54:09Z

skip_blank_lines=True does not fix the problem for this CSV file. I will put together a self-contained example for reproducibility.

Here are my versions:

>>> import pandas
>>> pandas.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.5.final.0
python-bits: 64
OS: Linux
OS-release: 3.19.5-100.fc20.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.16.0-222-g845cec9
nose: 1.3.0
Cython: 0.21.1
numpy: 1.9.2
scipy: 0.12.1
statsmodels: None
IPython: 2.3.1
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.2
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.3.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None

jblackburne · 2015-04-30T00:15:30Z

Here is a self-contained test case. With my patch, the assert no longer fails.

from __future__ import print_function
from StringIO import StringIO

import numpy as np
import pandas as pd


# Create the input data array
csv_buf = StringIO()
data_in = np.random.random((25000, 8)) * 1000

# This next line is necessary to line up a newline with a chunk boundary
# Yes, it's a corner case
print('a'*55, file=csv_buf, end='\r\n')

# Now write out the data in CSV format
for row in data_in:
    print('{:8.0f},{:8.0f},{:8.0f},{:8.0f},{:8.0f},{:15.7f},{:15.7f},'
          '{:25.16f}'.format(*row), file=csv_buf, end='\r\n')

# Parse the csv
csv_buf.seek(0)
data_out = pd.read_csv(csv_buf, skiprows=1, header=None)

# This fails in pandas HEAD
assert(data_out.shape == data_in.shape)

evanpw · 2015-04-30T11:37:48Z

I think there's still an issue even after your fix: we can't backtrack past the beginning of the chunk, so if we cross a chunk boundary while in the state WHITESPACE_LINE and then find a non-whitespace character, then we've already thrown away some data that we need. Example:

text = 'a' * (1024 * 256 - 2) + '\n ' + 'hello'
data_out = pd.read_csv(StringIO(text), skiprows=1, header=None)
assert data_out.at[0, 0] == ' hello' # Fails

jblackburne · 2015-04-30T18:03:09Z

Good catch @evanpw. Perhaps a better strategy than backtracking would be to cache whitespace until hitting data (->save whitespace) or newline (->discard whitespace).

I think it would be best to spin yours off into a separate issue, because my proposed change fixes my problem while not exacerbating your problem and is much less work than it would take to fix your issue.

jblackburne · 2015-04-30T18:04:19Z

I've been looking through the unit tests, and I haven't found any that exercise the parsers, or read_csv. Am I missing something, or are these parts of the code not covered?

evanpw · 2015-04-30T19:12:05Z

They're just in a different place: look at pandas/io/tests

jblackburne · 2015-05-01T04:05:01Z

Ah, I see them now. Thanks.

havardox · 2024-11-23T16:23:54Z

Still an issue in Pandas 2.2.2.

import urllib.request

import pandas as pd


def text_cleanup(df: pd.DataFrame):
    df = df.replace(r"\n", "", regex=True)
    df = df.replace(r'  +', ' ', regex=True)
    df = df.apply(strip_chars)
    return df

# Function to strip specific characters
def strip_chars(s: pd.Series):
    if s.dtype == "object":
        s = s.str.strip()
        s = s.str.strip('"')
        s = s.str.strip("'")
        s = s.str.lower()
        s = s.str.strip()
    return s
    
url = "https://raw.githubusercontent.com/havardox/example_json_file/refs/heads/master/example_data.json"

response = urllib.request.urlopen(url)
example_data_1 = response.read()

with open("example_data.json", 'wb') as f:
    f.write(example_data_1)

example_data_1 = pd.read_json("example_data.json")
example_data_1 = text_cleanup(example_data_1)
print(len(example_data_1))

example_data_1.to_csv("example_data.csv", index=False)

example_data_2 = pd.read_csv("example_data.csv", engine="c")
print(len(example_data_2))

This outputs:

62758
73020

Interestingly removing the text_cleanup function makes the dataframes equal in length.

import urllib.request

import pandas as pd


def text_cleanup(df: pd.DataFrame):
    df = df.replace(r"\n", "", regex=True)
    df = df.replace(r'  +', ' ', regex=True)
    df = df.apply(strip_chars)
    return df

# Function to strip specific characters
def strip_chars(s: pd.Series):
    if s.dtype == "object":
        s = s.str.strip()
        s = s.str.strip('"')
        s = s.str.strip("'")
        s = s.str.lower()
        s = s.str.strip()
    return s

url = "https://raw.githubusercontent.com/havardox/example_json_file/refs/heads/master/example_data.json"

response = urllib.request.urlopen(url)
example_data_1 = response.read()

with open("example_data.json", 'wb') as f:
    f.write(example_data_1)

example_data_1 = pd.read_json("example_data.json")
print(len(example_data_1))

example_data_1.to_csv("example_data.csv", index=False)

example_data_2 = pd.read_csv("example_data.csv", engine="c")
print(len(example_data_2))

This outputs:

62758
62758

Anyone know what's going on here?

jblackburne mentioned this issue Apr 29, 2015

read_csv newline fix #10023

Merged

jreback added the IO CSV read_csv, to_csv label Apr 29, 2015

jreback modified the milestones: 0.17.0, 0.16.1 May 7, 2015

jreback closed this as completed in #10023 May 8, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv(engine='c') can insert spurious rows full of NaNs #10022

read_csv(engine='c') can insert spurious rows full of NaNs #10022

jblackburne commented Apr 29, 2015

jreback commented Apr 29, 2015

jblackburne commented Apr 29, 2015

jblackburne commented Apr 30, 2015

evanpw commented Apr 30, 2015

jblackburne commented Apr 30, 2015

jblackburne commented Apr 30, 2015

evanpw commented Apr 30, 2015

jblackburne commented May 1, 2015

havardox commented Nov 23, 2024 •

edited

Loading

read_csv(engine='c') can insert spurious rows full of NaNs #10022

read_csv(engine='c') can insert spurious rows full of NaNs #10022

Comments

jblackburne commented Apr 29, 2015

jreback commented Apr 29, 2015

jblackburne commented Apr 29, 2015

jblackburne commented Apr 30, 2015

evanpw commented Apr 30, 2015

jblackburne commented Apr 30, 2015

jblackburne commented Apr 30, 2015

evanpw commented Apr 30, 2015

jblackburne commented May 1, 2015

havardox commented Nov 23, 2024 • edited Loading

havardox commented Nov 23, 2024 •

edited

Loading