Skip to content

feat: add support for pandas series & data frames as inputs for ml models. #1088

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Oct 23, 2024

Conversation

sycai
Copy link
Contributor

@sycai sycai commented Oct 15, 2024

No description provided.

@sycai sycai requested review from a team as code owners October 15, 2024 22:06
@sycai sycai requested a review from chelsea-lin October 15, 2024 22:06
@product-auto-label product-auto-label bot added the size: l Pull request size is large. label Oct 15, 2024
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. label Oct 15, 2024


def convert_to_dataframe(*input: ArrayType) -> Generator[bpd.DataFrame, None, None]:
def convert_to_dataframe(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we can merge the logic in this file into the core/convert module logic? Ideally we don't have pandas->bigframes logic in two places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Though I think it's not very straightforward to do so. This is mainly because this function returns a Generator, while the one in the core package returns a single value. We will need some extra effort to make everything consistent (function names, parameter types, return types, etc).

Considering that this RP is already not trivial, I think we can for now only focus on the feature delivery. I will migrate the conversion logic in another PR. Does it sound good to you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b/373716095 for reference

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can split up the work of course, as long as each step stands alone as an improvement. Not sure how much the generator aspect matters - but unifying the two approaches I think will be a good exercise and result in some improvements to both.

Comment on lines 105 to 106
X: Union[bpd.DataFrame, bpd.Series, pd.DataFrame, pd.Series],
y: Optional[Union[bpd.DataFrame, bpd.Series, pd.DataFrame, pd.Series]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we can define a single annotation representing this set of types and use it everywhere. This will make it easier to accomodate additional types in the future

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern is that type aliases may not be expanded/resolved when generating docs, and thus confuse our end users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated all to use type alias because Sphinx is able to resolve them

return (_convert_to_dataframe(frame) for frame in input)


def _convert_to_dataframe(frame: ArrayType) -> bpd.DataFrame:
def _convert_to_dataframe(frame: InputArrayType) -> bpd.DataFrame:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also handle array-like data like numpy arrays or even plain python list/tuples?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the effort should be trivial if we are to consolidate the conversion functions from the ml package to the core package. I can handle this in another CL as what is proposed above. Let me know your thoughts.

@sycai sycai requested a review from TrevorBergeron October 15, 2024 22:48
@@ -55,44 +63,16 @@ def _convert_to_series(frame: ArrayType) -> bpd.Series:
return typing.cast(bpd.Series, frame[label])
if isinstance(frame, bpd.Series):
return frame
if isinstance(frame, pd.DataFrame):
# Recursively call this method to re-use the length-checking logic
return _convert_to_series(bpd.read_pandas(frame))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might not always want the default session, if the other argument is a bigframes object with a non-default session. the core version uses the session from the co-argument

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I decided to use the sessions provided from the bqml_model whenever possible. The global session acts as a default.

@sycai sycai requested a review from TrevorBergeron October 16, 2024 21:36
"""Predict the result from input DataFrame.

Args:
X (bigframes.dataframe.DataFrame or bigframes.series.Series):
X (bigframes.dataframe.DataFrame or bigframes.series.Series or pandas.core.frame.DataFrame or pandas.core.series.Series):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that fully enumerating the accepted types will be too much once we further extend

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. The Google style prefers type hints over type documents, and it makes more sense. Here I'm just keeping the style consistent.

@sycai sycai enabled auto-merge (squash) October 23, 2024 17:30
@sycai sycai merged commit 30c8883 into main Oct 23, 2024
17 of 23 checks passed
@sycai sycai deleted the b362723869 branch October 23, 2024 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants