BUG: read_csv with engine pyarrow parsing multiple date columns #50056

lithomas1 · 2022-12-04T16:23:37Z

closes BUG: read_csv with engine pyarrow doesn't handle parsing multiple date columns #47961 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

phofl · 2022-12-04T17:46:15Z

pandas/io/parsers/arrow_parser_wrapper.py

@@ -107,7 +107,7 @@ def _finalize_pandas_output(self, frame: DataFrame) -> DataFrame:
                multi_index_named = False
            frame.columns = self.names
        # we only need the frame not the names
-        frame.columns, frame = self._do_date_conversions(frame.columns, frame)
+        _, frame = self._do_date_conversions(frame.columns, frame)


This gives us back the frame with already changed column names?

Yeah, _do_date_conversions changes the names in the data_dict/frame too.

~~I'm actually not too sure why names is returned from this function again. (I guess it might have been before dicts were ordered???)~~
EDIT: It's probably related to making a multi-index from the columns for the other engines. _do_date_conversions can always fix the frame directly, so this isn't relevant for the pyarrow engine.

github-actions · 2023-01-05T03:44:40Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

simonjayhawkins · 2023-02-22T13:48:33Z

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

…arrow-date-parser

phofl · 2023-04-24T18:58:46Z

pandas/io/parsers/arrow_parser_wrapper.py

+            # In case of dict, we don't want to propagate through, so
+            # just set to pyarrow default of None
+
+            # Ideally, in future we disable pyarrow dtype inference (read in as string)


We have to create a conversion option that is arrow only for this, otherwise we incur a big performance penalty

Since there's no way to disable the parsing, it'll only get parsed once as a pyarrow timestamp/date.

I think in your other PR, you set it so that date parsing will be bypassed for Arrow Timestamp cols so we won't have double parsing of the input.

So there won't be a perf penalty, just a wrong result if you didn't want pyarrow to parse the date.

Maybe I am misunderstanding this, but we have to convert to NumPy to parse with to_datetime? This is what I meant with slow.

What happens if dtype_backend is set to pyarrow in this case?

OK, I went back and checked the output, and it looks right, except for the one case where parse_dates is a dict (that does a no-op, e.g. it maps the column to itself). This is a very uncommon case, though (I'm not sure why you would want to do that, instead of passing a list).

Do you think it'd be better to fix the root cause #52545, than to special case here (I can try to take a look at that sometime soon)?

Here's what I get in the REPL btw.

>>> import pandas as pd >>> pd.read_csv("pandas/tests/io/data/csv/test1.csv", engine="pyarrow") index A B C D 0 2000-01-03 0.980269 3.685731 -0.364217 -1.159738 1 2000-01-04 1.047916 -0.041232 -0.161812 0.212549 2 2000-01-05 0.498581 0.731168 -0.537677 1.346270 3 2000-01-06 1.120202 1.567621 0.003641 0.675253 4 2000-01-07 -0.487094 0.571455 -1.611639 0.103469 5 2000-01-10 0.836649 0.246462 0.588543 1.062782 6 2000-01-11 -0.157161 1.340307 1.195778 -1.097007 >>> pd.read_csv("pandas/tests/io/data/csv/test1.csv", engine="pyarrow", dtype_backend="pyarrow") index A B C D 0 2000-01-03 00:00:00 0.980269 3.685731 -0.364217 -1.159738 1 2000-01-04 00:00:00 1.047916 -0.041232 -0.161812 0.212549 2 2000-01-05 00:00:00 0.498581 0.731168 -0.537677 1.346270 3 2000-01-06 00:00:00 1.120202 1.567621 0.003641 0.675253 4 2000-01-07 00:00:00 -0.487094 0.571455 -1.611639 0.103469 5 2000-01-10 00:00:00 0.836649 0.246462 0.588543 1.062782 6 2000-01-11 00:00:00 -0.157161 1.340307 1.195778 -1.097007 >>> pd.read_csv("pandas/tests/io/data/csv/test1.csv", engine="pyarrow", dtype_backend="pyarrow").dtypes # Default, OK index timestamp[s][pyarrow] A double[pyarrow] B double[pyarrow] C double[pyarrow] D double[pyarrow] dtype: object >>> pd.read_csv("pandas/tests/io/data/csv/test1.csv", engine="pyarrow", dtype_backend="pyarrow", parse_dates=["index"]).dtypes # Parse dates as list OK index timestamp[s][pyarrow] A double[pyarrow] B double[pyarrow] C double[pyarrow] D double[pyarrow] dtype: object >>> pd.read_csv("pandas/tests/io/data/csv/test1.csv", engine="pyarrow", dtype_backend="pyarrow", parse_dates=False).dtypes # The bug I was talking about, no way to disable dates from being parsed index timestamp[s][pyarrow] A double[pyarrow] B double[pyarrow] C double[pyarrow] D double[pyarrow] dtype: object >>> pd.read_csv("pandas/tests/io/data/csv/test1.csv", engine="pyarrow", dtype_backend="pyarrow", parse_dates={"a": ["index"]}) # Buggy, returns datetime64[ns] a datetime64[ns] A double[pyarrow] B double[pyarrow] C double[pyarrow] D double[pyarrow] dtype: object

@phofl OK to punt on this and fix the remaining issues in a follow-up?

pandas/io/parsers/base_parser.py

…arrow-date-parser

phofl · 2023-05-18T21:21:17Z

thx @lithomas1

can you open an issue about the remaining case?

…as-dev#50056)

BUG: read_csv with engine pyarrow parsing multiple date columns

321aaa5

lithomas1 added IO CSV read_csv, to_csv Arrow pyarrow functionality labels Dec 4, 2022

phofl reviewed Dec 4, 2022

View reviewed changes

github-actions bot added the Stale label Jan 5, 2023

simonjayhawkins closed this Feb 22, 2023

lithomas1 reopened this Apr 9, 2023

lithomas1 mentioned this pull request Apr 9, 2023

PERF: Improve performance for arrow engine and dtype_backend=pyarrow for datetime conversion #52548

Merged

5 tasks

lithomas1 added 8 commits April 9, 2023 22:06

Merge branch 'main' into fix-arrow-date-parser

eec548c

Update v2.1.0.rst

062426b

bad merge

aad1c55

Merge branch 'main' into fix-arrow-date-parser

fc5a750

Merge branch 'main' of https://github.com/pandas-dev/pandas into fix-…

ba3e3be

…arrow-date-parser

fix?

3e41179

adjust test

077abfd

Update v2.1.0.rst

3347dc9

lithomas1 requested a review from phofl April 24, 2023 16:59

lithomas1 removed the Stale label Apr 24, 2023

lithomas1 requested a review from mroeschke April 24, 2023 16:59

lithomas1 added this to the 2.1 milestone Apr 24, 2023

phofl reviewed Apr 24, 2023

View reviewed changes

lithomas1 added 2 commits April 29, 2023 09:34

Merge branch 'main' of https://github.com/pandas-dev/pandas into fix-…

4bb00ac

…arrow-date-parser

lazy copy in concat

5cbe608

lithomas1 requested a review from phofl May 1, 2023 15:29

lithomas1 added 3 commits May 8, 2023 07:40

Merge branch 'main' into fix-arrow-date-parser

5ae2dbd

Merge branch 'main' into fix-arrow-date-parser

2bb9659

Merge branch 'main' into fix-arrow-date-parser

7c52e3d

phofl approved these changes May 18, 2023

View reviewed changes

phofl merged commit 634b940 into pandas-dev:main May 18, 2023

lithomas1 mentioned this pull request May 19, 2023

BUG: read_csv raising for arrow engine and parse_dates #53295

Merged

5 tasks

topper-123 pushed a commit to topper-123/pandas that referenced this pull request May 22, 2023

BUG: read_csv with engine pyarrow parsing multiple date columns (pand…

022213c

…as-dev#50056)

Daquisu pushed a commit to Daquisu/pandas that referenced this pull request Jul 8, 2023

BUG: read_csv with engine pyarrow parsing multiple date columns (pand…

af24b64

…as-dev#50056)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_csv with engine pyarrow parsing multiple date columns #50056

BUG: read_csv with engine pyarrow parsing multiple date columns #50056

lithomas1 commented Dec 4, 2022

phofl Dec 4, 2022

lithomas1 Dec 5, 2022 •

edited

Loading

github-actions bot commented Jan 5, 2023

simonjayhawkins commented Feb 22, 2023

phofl Apr 24, 2023

lithomas1 Apr 25, 2023

phofl May 4, 2023

lithomas1 May 8, 2023

lithomas1 May 17, 2023

phofl commented May 18, 2023

BUG: read_csv with engine pyarrow parsing multiple date columns #50056

BUG: read_csv with engine pyarrow parsing multiple date columns #50056

Conversation

lithomas1 commented Dec 4, 2022

phofl Dec 4, 2022

Choose a reason for hiding this comment

lithomas1 Dec 5, 2022 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Jan 5, 2023

simonjayhawkins commented Feb 22, 2023

phofl Apr 24, 2023

Choose a reason for hiding this comment

lithomas1 Apr 25, 2023

Choose a reason for hiding this comment

phofl May 4, 2023

Choose a reason for hiding this comment

lithomas1 May 8, 2023

Choose a reason for hiding this comment

lithomas1 May 17, 2023

Choose a reason for hiding this comment

phofl commented May 18, 2023

lithomas1 Dec 5, 2022 •

edited

Loading