Skip to content

[MRG] Better error message for MiniBatchKMeans partial_fit when number of features changes #12431

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

jeremiedbb
Copy link
Member

Fixes #12430
It's all in the title. I don't think test is needed for that.

@eamanu
Copy link
Contributor

eamanu commented Oct 22, 2018

Are not there any test that assert with this rasise ?

@jeremiedbb
Copy link
Member Author

Yes, they are in common tests.

@amueller
Copy link
Member

I don't understand the statement about tests. If this is tested, how did the tests pass prior?

@jeremiedbb
Copy link
Member Author

There is a common test to check that any estimator, that have a partial_fit method, raises an error when the number of features changes between calls.
An error is actually raised, so the test pass (see #12430). My point was just that the message could be a bit more clear. Because the actual message comes from numpy, about broadcasting failure, or from a function called inside partial_fit, with an error message concerning the parameters of that function.

My comment about not needing test was unclear. I meant I think we don't need test for the change I made because it's already tested in common tests.

@amueller
Copy link
Member

can we change the common test to test for a good error message?

@jeremiedbb
Copy link
Member Author

jeremiedbb commented Oct 23, 2018

can we change the common test to test for a good error message?

I totally agree. Below are the error messages we get when the number of features changes between calls to partial_fit:

MultinomialNB:  "Number of features 9 does not match previous data 10."
BernoulliNB:  "Number of features 9 does not match previous data 10."
Perceptron:  "Number of features 9 does not match previous data 10."
SGDClassifier:  "Number of features 9 does not match previous data 10."
PassiveAggressiveClassifier:  "Number of features 9 does not match previous data 10."
MLPClassifier:  "shapes (100,9) and (10,100) not aligned: 9 (dim 1) != 10 (dim 0)"
SGDRegressor:  "Number of features 9 does not match previous data 10."
MLPRegressor:  "shapes (100,9) and (10,100) not aligned: 9 (dim 1) != 10 (dim 0)"
PassiveAggressiveRegressor :  "Number of features 9 does not match previous data 10."
MiniBatchKMeans:  "Incompatible dimension for X and Y matrices: X.shape[1] == 9 while Y.shape[1] == 10"
Birch:  "Training data and predicted data do not have same number of features."
MiniBatchDictionaryLearning:  "shapes (9,10) and (9,100) not aligned: 10 (dim 1) != 9 (dim 0)"
IncrementalPCA:  "operands could not be broadcast together with shapes (10,) (9,)"
LatentDirichletAllocation:  "The provided data has 9 dimensions while the model was trained with feature size 10."
StandardScaler:  "operands could not be broadcast together with shapes (10,) (9,)"
MinMaxScaler:  "operands could not be broadcast together with shapes (10,) (9,)"
MaxAbsScaler:  "operands could not be broadcast together with shapes (10,) (9,)"

~ 2/3 have an informative message because it directly comes from the partial_fit method. For the others, it comes from a function called inside partial_fit (e.g. in MiniBatchKMeans, partial_fit calls euclidean_distances) or even from numpy broadcasting error.

It would be nice to have a consistent message across all estimators. The way I see that is to add a function in utils/validation, something like check_X_partial_fit, to check that number of features is consistent between calls. And use that check inside all partial_fit methods.

Do you think it's the right way to do that ?

@amueller
Copy link
Member

Yes, consistent messages would be nice and allows us to test for it.
Not entirely sure if a helper function is the right way to go if it's a single line but that would be a way to make sure the message is consistent.
So maybe implement the helper function and change the common tests to check for it.
It would be nice in Birch to actually show the numbers as well.

@jeremiedbb
Copy link
Member Author

jeremiedbb commented Oct 26, 2018

@amueller I implemented that in #12465. We can move the discussion there.
I'll close this PR if the other one turns out to be a good solution.

@amueller amueller added the Superseded PR has been replace by a newer PR label Aug 5, 2019
@jeremiedbb
Copy link
Member Author

closing. superseeded by n_features_in

@jeremiedbb jeremiedbb closed this Jul 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:cluster Superseded PR has been replace by a newer PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MiniBatchKMeans partial_fit uninformative error message when number of features changes
3 participants