Skip to content

read_csv(engine='c') can insert spurious rows full of NaNs #10022

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jblackburne opened this issue Apr 29, 2015 · 9 comments · Fixed by #10023
Closed

read_csv(engine='c') can insert spurious rows full of NaNs #10022

jblackburne opened this issue Apr 29, 2015 · 9 comments · Fixed by #10023
Labels
IO CSV read_csv, to_csv
Milestone

Comments

@jblackburne
Copy link
Contributor

I have a well-formed CSV file with about 70k lines that parses out to a DataFrame with about 170k rows, where the extra rows are just full of NaNs. It only happens with the 'c' engine.

This is with git master.

I won't bother to upload the CSV file in question because I think I've tracked down the problem. It seems that tokenize_delimited() runs on chunks of data at a time. This problem occurs when a chunk happens to start with '\n'. I'll send a pull request.

@jreback
Copy link
Contributor

jreback commented Apr 29, 2015

skip_blank_lines=True already does this (and is the default). Pls show the version you are using.

@jreback jreback added the IO CSV read_csv, to_csv label Apr 29, 2015
@jblackburne
Copy link
Contributor Author

skip_blank_lines=True does not fix the problem for this CSV file. I will put together a self-contained example for reproducibility.

Here are my versions:

>>> import pandas
>>> pandas.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.5.final.0
python-bits: 64
OS: Linux
OS-release: 3.19.5-100.fc20.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.16.0-222-g845cec9
nose: 1.3.0
Cython: 0.21.1
numpy: 1.9.2
scipy: 0.12.1
statsmodels: None
IPython: 2.3.1
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.2
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.3.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None

@jblackburne
Copy link
Contributor Author

Here is a self-contained test case. With my patch, the assert no longer fails.

from __future__ import print_function
from StringIO import StringIO

import numpy as np
import pandas as pd


# Create the input data array
csv_buf = StringIO()
data_in = np.random.random((25000, 8)) * 1000

# This next line is necessary to line up a newline with a chunk boundary
# Yes, it's a corner case
print('a'*55, file=csv_buf, end='\r\n')

# Now write out the data in CSV format
for row in data_in:
    print('{:8.0f},{:8.0f},{:8.0f},{:8.0f},{:8.0f},{:15.7f},{:15.7f},'
          '{:25.16f}'.format(*row), file=csv_buf, end='\r\n')

# Parse the csv
csv_buf.seek(0)
data_out = pd.read_csv(csv_buf, skiprows=1, header=None)

# This fails in pandas HEAD
assert(data_out.shape == data_in.shape)

@evanpw
Copy link
Contributor

evanpw commented Apr 30, 2015

I think there's still an issue even after your fix: we can't backtrack past the beginning of the chunk, so if we cross a chunk boundary while in the state WHITESPACE_LINE and then find a non-whitespace character, then we've already thrown away some data that we need. Example:

text = 'a' * (1024 * 256 - 2) + '\n ' + 'hello'
data_out = pd.read_csv(StringIO(text), skiprows=1, header=None)
assert data_out.at[0, 0] == ' hello' # Fails

@jblackburne
Copy link
Contributor Author

Good catch @evanpw. Perhaps a better strategy than backtracking would be to cache whitespace until hitting data (->save whitespace) or newline (->discard whitespace).

I think it would be best to spin yours off into a separate issue, because my proposed change fixes my problem while not exacerbating your problem and is much less work than it would take to fix your issue.

@jblackburne
Copy link
Contributor Author

I've been looking through the unit tests, and I haven't found any that exercise the parsers, or read_csv. Am I missing something, or are these parts of the code not covered?

@evanpw
Copy link
Contributor

evanpw commented Apr 30, 2015

They're just in a different place: look at pandas/io/tests

@jblackburne
Copy link
Contributor Author

Ah, I see them now. Thanks.

@jreback jreback modified the milestones: 0.17.0, 0.16.1 May 7, 2015
@havardox
Copy link

havardox commented Nov 23, 2024

Still an issue in Pandas 2.2.2.

import urllib.request

import pandas as pd


def text_cleanup(df: pd.DataFrame):
    df = df.replace(r"\n", "", regex=True)
    df = df.replace(r'  +', ' ', regex=True)
    df = df.apply(strip_chars)
    return df

# Function to strip specific characters
def strip_chars(s: pd.Series):
    if s.dtype == "object":
        s = s.str.strip()
        s = s.str.strip('"')
        s = s.str.strip("'")
        s = s.str.lower()
        s = s.str.strip()
    return s
    
url = "https://raw.githubusercontent.com/havardox/example_json_file/refs/heads/master/example_data.json"

response = urllib.request.urlopen(url)
example_data_1 = response.read()

with open("example_data.json", 'wb') as f:
    f.write(example_data_1)

example_data_1 = pd.read_json("example_data.json")
example_data_1 = text_cleanup(example_data_1)
print(len(example_data_1))

example_data_1.to_csv("example_data.csv", index=False)

example_data_2 = pd.read_csv("example_data.csv", engine="c")
print(len(example_data_2))

This outputs:

62758
73020

Interestingly removing the text_cleanup function makes the dataframes equal in length.

import urllib.request

import pandas as pd


def text_cleanup(df: pd.DataFrame):
    df = df.replace(r"\n", "", regex=True)
    df = df.replace(r'  +', ' ', regex=True)
    df = df.apply(strip_chars)
    return df

# Function to strip specific characters
def strip_chars(s: pd.Series):
    if s.dtype == "object":
        s = s.str.strip()
        s = s.str.strip('"')
        s = s.str.strip("'")
        s = s.str.lower()
        s = s.str.strip()
    return s

url = "https://raw.githubusercontent.com/havardox/example_json_file/refs/heads/master/example_data.json"

response = urllib.request.urlopen(url)
example_data_1 = response.read()

with open("example_data.json", 'wb') as f:
    f.write(example_data_1)

example_data_1 = pd.read_json("example_data.json")
print(len(example_data_1))

example_data_1.to_csv("example_data.csv", index=False)

example_data_2 = pd.read_csv("example_data.csv", engine="c")
print(len(example_data_2))

This outputs:

62758
62758

Anyone know what's going on here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants