-
Notifications
You must be signed in to change notification settings - Fork 85
Add support for fetching and ingesting Arrow Table/RecordBatch data #375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks, @mauropagano. We will look into this and get back to you. |
Tagging @aosingh who is exploring the space (e.g. he created apache/arrow-nanoarrow#502) and working with us on what python-oracledb could do. @mauropagano if you have psuedocode of end-to-end uses of this datatype, and how it would plug in to Pandas, SQLAlchemy etc (or suggestions of what would need to be done to such libraries), please share. |
Thanks @cjbj These are a couple quick pseudo-code examples that come to mind, but really this feature would enable saving time / memory of almost all use cases where data starts or land in Oracle and any of the modern python data processing/science libs are in the mix. The main goal of this ER in my mind is to bypass going through python types just to immediately after going into arrow (or viceversa), that takes a significant toll on performance and memory for no real benefit. Some examples below will look a bit clunky because as of now there is no yet a zero-copy way to create a pandas df from arrow(1), I believe (1)lots of ways have been already added in pandas2 to go from at-rest formats to pandas with arrow backend, you just can't yet do "map" a df on top of that arrow table Extracting data from Oracle to pandas because you need to apply fancy lib X that requires pandasIn batches,
All at once, via
Ingesting data into OracleIn batches, from pq
All at once, via
From other libraries with arrow backend, e.g. polars
Hope it helps, anything else I can provide just let me know |
Note: Arrow is not the only format for storing big tables in a columnar format. I think what would generally be more useful is a SQL*Plus (and/or sqlcli) command to upload (or download) files as BLOB data from/to the client to the DB. |
@cjbj I was wondering if you guys made a decision about supporting Arrow or not |
@mauropagano work is ongoing. There are some technical and legal things to be done before we can release anything. |
An update on this project, thanks to @aosingh we are well on track for the next release of python-oracledb to have support for querying data into Python DataFrame Interchange Protocol format. (Inserting into the DB is a future task). Comments welcome. (Tagging @mauropagano) Examples might be like the following (subject to last minute changes!). To use data in Pandas:
and to write in Parquet file format
We'll do some more testing before pushing anything to GitHub. |
Thanks @cjbj , this is really exciting! Few questions based on the examples:
Again, this is really exciting, thanks for making it happen! |
@mauropagano They return OracleDataFrame which is an Oracle implementation of the DataFrame class shown in https://data-apis.org/dataframe-protocol/latest/API.html Current doc is:
and
gives:
We'd be very happy to hear of suggestions. I'm tempted to designate the support as "pre-view" which will give us scope to break the API if better ideas come up during the early days. I don't have a timeframe for INSERT support. I'd like it to be soon, but let us get this release out before we prioritize all the tasks to do. |
Thanks for sharing the docstrings. I believe these two operations cover what most people would like to do here. Is this going to work in both thin and thick mode? Out of curiosity, do you have any very-rough-and-subject-to-change numbers around speedup and memory footprint changes? |
It will work in Thin & Thick. Performance is being checked; we have one area of improvement that needs further investigation post initial release. I will update my previous comments and change to use the current API doc. |
zero copy interchange with popular data frame libraries (#375).
@mauropagano, the code has been uploaded. I have initiated a build which you can use for testing once it completes. Let us know any feedback you have! |
Quick thing, PEP 604 didn't make it until 3.10 so something like this
crashes on
for version below 3.10. Reporting it since the wheels go all the way back to 3.8 |
Thanks, @mauropagano. I changed most of those but apparently missed one. I'll get that corrected! |
Thanks @anthony-tuininga , just let me know when you get a build going so I can follow that (I can test more on older versions) |
The patches have been uploaded (including support for asyncio with data frames) and a new build has been initiated. |
Thanks @anthony-tuininga , will test as soon as it's out Out of curiosity, are the |
The primary reason is that in my understanding it doesn't make sense to have a cursor which allows for fetching rows natively, then swaps over to fetching Arrow data, then swaps back to fetching rows natively. My understanding is that if you want a dataframe, you want a dataframe containing all of the rows and not just some of them. Is there a legitimate need for swapping back and forth between data frames and native rows on the same cursor? |
IMO no, I agree with you removing the option to even open that door is a good decision. |
Just tested and it works great! Btw there is a comment in 5.1.9.2 that says From memory_profiler (SQL is just a
|
I vaguely recall that there are circumstances under which zero copy does not happen -- such as when nulls exist, if I remember correctly. @aosingh would know better but he is away at the moment. Glad to hear it is working well for you, though! |
@mauropagano what were your memory & perf stats like compared with getting the same dataframe format in python-oracledb 2.5.1? |
This sounds cool. On SQL*Net level, the data is probably still transferred the old-fashioned way. I wonder if a future SQL*Net version could support a column-based protocol as well. |
I will see what the current thoughts are around this, particularly re VECTORS. The current protocol is very efficient for what it does now. Overall the data size would presumably be roughly the same, since it's the same data. But maybe there could be time benefits in some cases e.g when the data is stored in columns. There are a lot of DB layers, disk formats, Exadata storage cell magic things etc etc going on past the SQL*Net layer. |
About double the memory trying to get pandas off
In my previous post, I believe Perf wise it's about 50% slower, 0.6 secs for Btw outputs are not equivalent in case of NULLs, |
@mauropagano Thank you for trying out the I would like to answer some questions raised in the thread. Hopefully, this helps:
pyarrow_table = pyarrow.Table.from_arrays(
arrays=odf.column_arrays(), names=odf.column_names()
)
|
@aosingh thanks so much for sharing additional info on the implementation!
This is really exciting!
Why
From the pandas API (here), I would expect the DF to be converted via Arrow. |
NVM the question on |
@aosingh asking only since you mentioned, how much do you want to support Polars here? Straight conversion from
|
You are right,
Yes, correct. It seems polars' implementation of from_dataframe doesn't support the |
We should review & update the examples. |
This has been included in python-oracledb 3.0.0 which was just released. |
Although we closed the issue, work will continue on in this area. |
And some of that data frame optimization work has already been pushed to GitHub. A package build has been initiated here. Anyone playing with Connection.fetch_df_all(), Connection.fetch_df_batches(), AsyncConnection.fetch_df_all(), or AsyncConnection.fetch_df_batches() might want to grab this new development build. |
Thanks @cjbj , @anthony-tuininga and @aosingh for making it happen! I've already pulled 3.0 and started using it! These new changes will go in 3.0.1, right? |
Yes. |
Just want to say THANK YOU to the team who worked on this!! I'm seeing, on average a 60% speed up in my extract scripts compared with With |
Glad to hear that it is working well for you! Can you share the method you use for converting to string with |
@dev-named-jay can you open a new issue (as an enhancement) so we can track & discuss your type mapping request? Is there anything else you think would enhance interaction with data frame libraries? |
Reporting some results now that I was able to test this out in the wild:
In general, this change enabled some serious performance improvements for me. |
@mauropagano thanks for sharing. Those are impressive improvements! |
Uh oh!
There was an error while loading. Please reload this page.
It's hard to deny how Arrow has become the standard de-facto for in-memory representation of tabular data.
Multiple competing product (Snowflake, BigQuery, etc) as well as libraries (ADBC, TurboODBC) enable extracting data as Arrow, unlocking better performance, lower memory usage and increased consistency / interoperability between platforms.
In the python space, there are a number of libraries, especially around data processing, data science and ML (pandas, polars, duckdb, etc) that work on Arrow data natively, normally via zero-copy.
It would be extremely beneficial to have
python-oracledb
be able to:cursor.fetch_as_arrow()
that can return either an Arrow Table or RecordBatch from the query. This method could bypass a python representation, speed up data extraction and ultimately keep Oracle closer to where some processing occurs.cursor.executemany("', arrow_object)
could skip python representation, use less memory and ultimately entice users to rely more on Oracle for that processing that works better in-db / storing of data produced elsewhereThe text was updated successfully, but these errors were encountered: