-
Notifications
You must be signed in to change notification settings - Fork 49
feat: add support for pandas series & data frames as inputs for ml models. #1088
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
|
||
def convert_to_dataframe(*input: ArrayType) -> Generator[bpd.DataFrame, None, None]: | ||
def convert_to_dataframe( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we can merge the logic in this file into the core/convert module logic? Ideally we don't have pandas->bigframes logic in two places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Though I think it's not very straightforward to do so. This is mainly because this function returns a Generator, while the one in the core
package returns a single value. We will need some extra effort to make everything consistent (function names, parameter types, return types, etc).
Considering that this RP is already not trivial, I think we can for now only focus on the feature delivery. I will migrate the conversion logic in another PR. Does it sound good to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
b/373716095 for reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can split up the work of course, as long as each step stands alone as an improvement. Not sure how much the generator aspect matters - but unifying the two approaches I think will be a good exercise and result in some improvements to both.
bigframes/ml/pipeline.py
Outdated
X: Union[bpd.DataFrame, bpd.Series, pd.DataFrame, pd.Series], | ||
y: Optional[Union[bpd.DataFrame, bpd.Series, pd.DataFrame, pd.Series]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we can define a single annotation representing this set of types and use it everywhere. This will make it easier to accomodate additional types in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern is that type aliases may not be expanded/resolved when generating docs, and thus confuse our end users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that sklearn doesn't try to define a set of types in its docs: https://scikit-learn.org/1.5/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.fit ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated all to use type alias because Sphinx is able to resolve them
bigframes/ml/utils.py
Outdated
return (_convert_to_dataframe(frame) for frame in input) | ||
|
||
|
||
def _convert_to_dataframe(frame: ArrayType) -> bpd.DataFrame: | ||
def _convert_to_dataframe(frame: InputArrayType) -> bpd.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also handle array-like data like numpy arrays or even plain python list/tuples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the effort should be trivial if we are to consolidate the conversion functions from the ml package to the core package. I can handle this in another CL as what is proposed above. Let me know your thoughts.
bigframes/ml/utils.py
Outdated
@@ -55,44 +63,16 @@ def _convert_to_series(frame: ArrayType) -> bpd.Series: | |||
return typing.cast(bpd.Series, frame[label]) | |||
if isinstance(frame, bpd.Series): | |||
return frame | |||
if isinstance(frame, pd.DataFrame): | |||
# Recursively call this method to re-use the length-checking logic | |||
return _convert_to_series(bpd.read_pandas(frame)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might not always want the default session, if the other argument is a bigframes object with a non-default session. the core version uses the session from the co-argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I decided to use the sessions provided from the bqml_model whenever possible. The global session acts as a default.
"""Predict the result from input DataFrame. | ||
|
||
Args: | ||
X (bigframes.dataframe.DataFrame or bigframes.series.Series): | ||
X (bigframes.dataframe.DataFrame or bigframes.series.Series or pandas.core.frame.DataFrame or pandas.core.series.Series): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry that fully enumerating the accepted types will be too much once we further extend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you. The Google style prefers type hints over type documents, and it makes more sense. Here I'm just keeping the style consistent.
No description provided.