Skip to content

Preserving dtype for numpy.float32 in Least Angle Regression #20155

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
May 31, 2021

Conversation

takoika
Copy link
Contributor

@takoika takoika commented May 28, 2021

Reference Issues/PRs

This PR is part of #11000 .

What does this implement/fix? Explain your changes.

This PR makes trained coefficients of Least Angle Regression to be numpy.float32 if input data is numpy.float32.

Conventionally when you pass numpy.float32 to Least Angle Regression, trained coefficients turn to be numpy.float64. This is inconsistent.

Any other comments?

We can pass training data x and target y to Least Angle Regressor. Test cases in this PR are that both are numpy.flaot32 or numpy.float64, because it is unclear which type is appropriate when x's type is different from y's type. This is the same with #13243 .

Further test cases may be required because this PR does not cover argument variations.

I used #13303 and #13243 as references to make this.

Copy link
Contributor

@cmarmo cmarmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @takoika for your pull request. A first comment about tests.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. Just a small suggestion for an alternative strategy to set return_dtype.

Also I wonder if we should document this change in the changelog or not. I think we should, despite that it seems like a detail for most users but no strong opinion.

Comment on lines 480 to 483
for input_array in (X, y, Xy, Gram):
if input_array is not None:
return_dtype = input_array.dtype
break
Copy link
Member

@ogrisel ogrisel May 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can pass training data x and target y to Least Angle Regressor. Test cases in this PR are that both are numpy.flaot32 or numpy.float64, because it is unclear which type is appropriate when x's type is different from y's type. This is the same with #13243 .

Maybe the following would feel less arbitrary?

dtypes = set(a.dtype for a in (X, y, Xy, Gram) if a is not None)
if len(dtypes) == 1:
     # use the precision level of input data if it is consistent
     return_dtype = next(iter(dtypes))
else:
     # fallback to double precision otherwise
     return_dtype = np.float64

I am not so sure. Maybe giving the priority to X and Gram over y and Xy would make more sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I which case, I think the tests should be extended to test the case where X.dtype is f32 but y.dtype is f64.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this case could happen because when the data is validated, X and y dtypes are matched

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggestion is better. I have incorporated the suggestion. Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding a case where calling fitmethod with x.dtype=np.float32 and y.dtypes=np.float64, trained coef_.dtype is np.float32. It may not be desirable. The mismatched case would be addressed later PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's the expected outcome. In all estimators where both dtypes are supported, the dtype of y is always matched to the dtype of X.

@thomasjpfan
Copy link
Member

Also I wonder if we should document this change in the changelog or not. I think we should, despite that it seems like a detail for most users but no strong opinion.

I think this PR needs a change log because it does change the behavior of least angle regression.

@takoika Please add an entry to the change log at doc/whats_new/v1.0.rst with tag |Enhancement|. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @takoika !

@jeremiedbb jeremiedbb merged commit c8753d4 into scikit-learn:main May 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants