-
Notifications
You must be signed in to change notification settings - Fork 48
Allow querying the bigframes dataframe with custom BigQuery SQL like pandasql #647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It is an interesting request. Maybe we can just inject the df table into the SQL statements? But one argument is if user wants to use SQL, why they don't just use SQL with BQ? @tswast @TrevorBergeron WDYT? |
I was just talking with @shanecglass and @aliciawilliams about the potential for a similar feature. I was thinking of providing a SQL engine to our equivalent of https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.eval.html, but mimicking the pandasql interface could be even more natural. Looks like it's MIT license (https://github.com/yhat/pandasql/blob/master/LICENSE.txt) so we could pull some of that into |
As an alternative, I've just mailed #1293 which will add a Usage example: import bigframes.pandas as bpd
import bigframes.bigquery as bbq
import pandas as pd
import pyarrow as pa
s = bpd.Series(["1.5", "2.5", "3.5"])
s = s.astype(pd.ArrowDtype(pa.decimal128(38, 9)))
bbq.sql_scalar("ROUND({0}, 0, 'ROUND_HALF_EVEN')", [s]) Output:
I'm hoping to have this feature in our next release. I suspect this will provide a sufficient "escape hatch" for most, but not all, use cases that a pandasql-like API would provide. @NiloFreitas would work for you? Aside: re Usage example: import bigframes.pandas as bpd
import bigframes.bigquery as bbq
s = bpd.Series(['{"class": {"students": [{"id": 5}, {"id": 12}]}}'])
bbq.json_extract(s, json_path="$.class") Output:
|
Somewhat inspired by: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.sql.html#pyspark.sql.SparkSession.sql I'm thinking about something like a In the first iteration, we can call |
Feature Request
Supposed I used bigframes and processed my data like this:
res_df = df.assign(pred=input_remote_function.apply(generate_text))
And got this output dataframe res_df:
As you can see, the pred column is a JSON string.
Describe the solution you'd like
I would like to use BigQuery JSON Functions to explode this column into a column for each JSON attribute.
For example, I could write a custom BigQuery SQL code to run for this dataframe, like this:
res_df = df.sqldf("SELECT *, JSON_EXTRACT(pred, '$.interpretation') AS interpretation FROM df")
Getting as a result the res_df:
Describe alternatives you've considered
As an alternative I need to use remote functions, like this one, to explode the column:
Which I cannot pass multiple parameter because of #646
Additional context
The pandasql allow us to query pandas DataFrames using SQL syntax.
Bigframes supporting, pushing down the query to BQ would be helpful.
https://pypi.org/project/pandasql/
The text was updated successfully, but these errors were encountered: