-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: read_csv with engine pyarrow parsing multiple date columns #50056
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -107,7 +107,7 @@ def _finalize_pandas_output(self, frame: DataFrame) -> DataFrame: | |||
multi_index_named = False | |||
frame.columns = self.names | |||
# we only need the frame not the names | |||
frame.columns, frame = self._do_date_conversions(frame.columns, frame) | |||
_, frame = self._do_date_conversions(frame.columns, frame) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gives us back the frame with already changed column names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, _do_date_conversions changes the names in the data_dict/frame too.
I'm actually not too sure why names is returned from this function again. (I guess it might have been before dicts were ordered???)
EDIT: It's probably related to making a multi-index from the columns for the other engines. _do_date_conversions can always fix the frame directly, so this isn't relevant for the pyarrow engine.
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen. |
…arrow-date-parser
# In case of dict, we don't want to propagate through, so | ||
# just set to pyarrow default of None | ||
|
||
# Ideally, in future we disable pyarrow dtype inference (read in as string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to create a conversion option that is arrow only for this, otherwise we incur a big performance penalty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there's no way to disable the parsing, it'll only get parsed once as a pyarrow timestamp/date.
I think in your other PR, you set it so that date parsing will be bypassed for Arrow Timestamp cols so we won't have double parsing of the input.
So there won't be a perf penalty, just a wrong result if you didn't want pyarrow to parse the date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I am misunderstanding this, but we have to convert to NumPy to parse with to_datetime? This is what I meant with slow.
What happens if dtype_backend is set to pyarrow in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I went back and checked the output, and it looks right, except for the one case where parse_dates is a dict (that does a no-op, e.g. it maps the column to itself). This is a very uncommon case, though (I'm not sure why you would want to do that, instead of passing a list).
Do you think it'd be better to fix the root cause #52545, than to special case here (I can try to take a look at that sometime soon)?
Here's what I get in the REPL btw.
>>> import pandas as pd
>>> pd.read_csv("pandas/tests/io/data/csv/test1.csv", engine="pyarrow")
index A B C D
0 2000-01-03 0.980269 3.685731 -0.364217 -1.159738
1 2000-01-04 1.047916 -0.041232 -0.161812 0.212549
2 2000-01-05 0.498581 0.731168 -0.537677 1.346270
3 2000-01-06 1.120202 1.567621 0.003641 0.675253
4 2000-01-07 -0.487094 0.571455 -1.611639 0.103469
5 2000-01-10 0.836649 0.246462 0.588543 1.062782
6 2000-01-11 -0.157161 1.340307 1.195778 -1.097007
>>> pd.read_csv("pandas/tests/io/data/csv/test1.csv", engine="pyarrow", dtype_backend="pyarrow")
index A B C D
0 2000-01-03 00:00:00 0.980269 3.685731 -0.364217 -1.159738
1 2000-01-04 00:00:00 1.047916 -0.041232 -0.161812 0.212549
2 2000-01-05 00:00:00 0.498581 0.731168 -0.537677 1.346270
3 2000-01-06 00:00:00 1.120202 1.567621 0.003641 0.675253
4 2000-01-07 00:00:00 -0.487094 0.571455 -1.611639 0.103469
5 2000-01-10 00:00:00 0.836649 0.246462 0.588543 1.062782
6 2000-01-11 00:00:00 -0.157161 1.340307 1.195778 -1.097007
>>> pd.read_csv("pandas/tests/io/data/csv/test1.csv", engine="pyarrow", dtype_backend="pyarrow").dtypes # Default, OK
index timestamp[s][pyarrow]
A double[pyarrow]
B double[pyarrow]
C double[pyarrow]
D double[pyarrow]
dtype: object
>>> pd.read_csv("pandas/tests/io/data/csv/test1.csv", engine="pyarrow", dtype_backend="pyarrow", parse_dates=["index"]).dtypes # Parse dates as list OK
index timestamp[s][pyarrow]
A double[pyarrow]
B double[pyarrow]
C double[pyarrow]
D double[pyarrow]
dtype: object
>>> pd.read_csv("pandas/tests/io/data/csv/test1.csv", engine="pyarrow", dtype_backend="pyarrow", parse_dates=False).dtypes # The bug I was talking about, no way to disable dates from being parsed
index timestamp[s][pyarrow]
A double[pyarrow]
B double[pyarrow]
C double[pyarrow]
D double[pyarrow]
dtype: object
>>> pd.read_csv("pandas/tests/io/data/csv/test1.csv", engine="pyarrow", dtype_backend="pyarrow", parse_dates={"a": ["index"]}) # Buggy, returns datetime64[ns]
a datetime64[ns]
A double[pyarrow]
B double[pyarrow]
C double[pyarrow]
D double[pyarrow]
dtype: object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@phofl OK to punt on this and fix the remaining issues in a follow-up?
thx @lithomas1 can you open an issue about the remaining case? |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.