-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
read_csv(engine='c') can insert spurious rows full of NaNs #10022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
|
Here are my versions: >>> import pandas
>>> pandas.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.5.final.0
python-bits: 64
OS: Linux
OS-release: 3.19.5-100.fc20.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.16.0-222-g845cec9
nose: 1.3.0
Cython: 0.21.1
numpy: 1.9.2
scipy: 0.12.1
statsmodels: None
IPython: 2.3.1
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.2
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.3.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None |
Here is a self-contained test case. With my patch, the assert no longer fails. from __future__ import print_function
from StringIO import StringIO
import numpy as np
import pandas as pd
# Create the input data array
csv_buf = StringIO()
data_in = np.random.random((25000, 8)) * 1000
# This next line is necessary to line up a newline with a chunk boundary
# Yes, it's a corner case
print('a'*55, file=csv_buf, end='\r\n')
# Now write out the data in CSV format
for row in data_in:
print('{:8.0f},{:8.0f},{:8.0f},{:8.0f},{:8.0f},{:15.7f},{:15.7f},'
'{:25.16f}'.format(*row), file=csv_buf, end='\r\n')
# Parse the csv
csv_buf.seek(0)
data_out = pd.read_csv(csv_buf, skiprows=1, header=None)
# This fails in pandas HEAD
assert(data_out.shape == data_in.shape) |
I think there's still an issue even after your fix: we can't backtrack past the beginning of the chunk, so if we cross a chunk boundary while in the state
|
Good catch @evanpw. Perhaps a better strategy than backtracking would be to cache whitespace until hitting data (->save whitespace) or newline (->discard whitespace). I think it would be best to spin yours off into a separate issue, because my proposed change fixes my problem while not exacerbating your problem and is much less work than it would take to fix your issue. |
I've been looking through the unit tests, and I haven't found any that exercise the parsers, or read_csv. Am I missing something, or are these parts of the code not covered? |
They're just in a different place: look at pandas/io/tests |
Ah, I see them now. Thanks. |
Still an issue in Pandas 2.2.2. import urllib.request
import pandas as pd
def text_cleanup(df: pd.DataFrame):
df = df.replace(r"\n", "", regex=True)
df = df.replace(r' +', ' ', regex=True)
df = df.apply(strip_chars)
return df
# Function to strip specific characters
def strip_chars(s: pd.Series):
if s.dtype == "object":
s = s.str.strip()
s = s.str.strip('"')
s = s.str.strip("'")
s = s.str.lower()
s = s.str.strip()
return s
url = "https://raw.githubusercontent.com/havardox/example_json_file/refs/heads/master/example_data.json"
response = urllib.request.urlopen(url)
example_data_1 = response.read()
with open("example_data.json", 'wb') as f:
f.write(example_data_1)
example_data_1 = pd.read_json("example_data.json")
example_data_1 = text_cleanup(example_data_1)
print(len(example_data_1))
example_data_1.to_csv("example_data.csv", index=False)
example_data_2 = pd.read_csv("example_data.csv", engine="c")
print(len(example_data_2)) This outputs:
Interestingly removing the text_cleanup function makes the dataframes equal in length. import urllib.request
import pandas as pd
def text_cleanup(df: pd.DataFrame):
df = df.replace(r"\n", "", regex=True)
df = df.replace(r' +', ' ', regex=True)
df = df.apply(strip_chars)
return df
# Function to strip specific characters
def strip_chars(s: pd.Series):
if s.dtype == "object":
s = s.str.strip()
s = s.str.strip('"')
s = s.str.strip("'")
s = s.str.lower()
s = s.str.strip()
return s
url = "https://raw.githubusercontent.com/havardox/example_json_file/refs/heads/master/example_data.json"
response = urllib.request.urlopen(url)
example_data_1 = response.read()
with open("example_data.json", 'wb') as f:
f.write(example_data_1)
example_data_1 = pd.read_json("example_data.json")
print(len(example_data_1))
example_data_1.to_csv("example_data.csv", index=False)
example_data_2 = pd.read_csv("example_data.csv", engine="c")
print(len(example_data_2)) This outputs:
Anyone know what's going on here? |
I have a well-formed CSV file with about 70k lines that parses out to a DataFrame with about 170k rows, where the extra rows are just full of NaNs. It only happens with the 'c' engine.
This is with git master.
I won't bother to upload the CSV file in question because I think I've tracked down the problem. It seems that
tokenize_delimited()
runs on chunks of data at a time. This problem occurs when a chunk happens to start with'\n'
. I'll send a pull request.The text was updated successfully, but these errors were encountered: