Skip to content

FIX initialization build on kmedoids #91

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 7, 2021

Conversation

TimotheeMathieu
Copy link
Contributor

Fix typo in cython build initialization.

Answer to #90

Result :

import sklearn, numpy
import sklearn_extra.cluster

# Data set 20news
import sklearn.datasets
X, y = sklearn.datasets.fetch_20newsgroups_vectorized(return_X_y=True)
X, y = sklearn.utils.shuffle((X, y), random_state=1)

# Precompute cosine distance matrix
import sklearn.metrics.pairwise
diss = sklearn.metrics.pairwise.cosine_distances(X)

# run PAM from scikit-learn-extra
ske = sklearn_extra.cluster.KMedoids(20, "precomputed", method="pam", init="build")
ske.fit(diss)
print("n_iter: ", ske.n_iter_)
print("inertia: ",ske.inertia_)
> n_iter:  1
> inertia:  4994.733982868267

@TimotheeMathieu TimotheeMathieu changed the title fix build fix initialization build on kmedoids Feb 3, 2021
@TimotheeMathieu
Copy link
Contributor Author

If there is no remark, I will merge this to fix the bug.

Copy link
Contributor

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @TimotheeMathieu !

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add a regression test to test this fix?

@TimotheeMathieu
Copy link
Contributor Author

Noob question: what is a regression test ?
I imagine that you are not talking about linear regression.

@rth
Copy link
Contributor

rth commented Feb 17, 2021

It's a unit test that failed on master and that passes in this PR as a consequence of the underlying issue being fixed. To avoid this issue coming back in the future. For instance the example from the PR desciption would do. In this case, we can try to add it, but it kind of relies on the overall convergence behavior so it might not be very stable with future code changes.

@TimotheeMathieu
Copy link
Contributor Author

Ok, I will add a test with the same example but only the initialization because the fix is only on the build initialization and this is deterministic and should always return the same number.

@TimotheeMathieu
Copy link
Contributor Author

I sub-sampled to 500 to avoid long computational time.

test on master gave inertia >= 250 and 2 unique labels.

@adrinjalali
Copy link
Member

Is the failing documentation build related to this PR? It seems to be taking too long.

@TimotheeMathieu
Copy link
Contributor Author

I don't understand the failing documentation.
Locally on my computer it runs alright, I constructed the doc in under a minute.

@TimotheeMathieu
Copy link
Contributor Author

Update : I just reran the doc in PR #78 , this PR was fine and the doc was alright before and now that I just reran the test without changing the code, the doc failed. I think there is something that changed with circleci that makes it fail but I don't know what.
Do you have any idea @rth ?

@TimotheeMathieu
Copy link
Contributor Author

If no one object, I will merge this PR as it has been approved.
The failing doc is for me an enigma, as it fails on a PR that succeeded previously. Maybe it is a problem with circle ci.

@adrinjalali
Copy link
Member

well, this PR is touching something on the circleci side, and one of those tests is failing, so we should figure that out before merging. One way to diagnose the issue is to replicate the environments locally, and see if some of the tests run with a different speed on different environments (since the error is a timeout)

@TimotheeMathieu
Copy link
Contributor Author

I tried to set up a docker image (docker image circleci/python:3.8) but for a reason or another my docker image does not has access to the internet so I didn't succeed in checking what is going on with circle ci, and I didn't succeed in installing the circleci cli tool meant to run cicleci locally (this tool is quite new from what I gathered).

@rth
Copy link
Contributor

rth commented Apr 7, 2021

Thanks for continuing to work on this package @TimotheeMathieu !

I didn't succeed in installing the circleci cli tool meant to run cicleci locally

yeah that tools is quite cumbersome to use last time I tried.

With CircleCI the easiest is to click on "Restart Job with SSH" https://circleci.com/docs/2.0/ssh-access-jobs/ Then you need to activate the same conda env that's used for tests.

Something like,

~/miniconda/bin/conda init bash
# disconnect, then restart a new SSH connection to take changes into account
conda activate testenv

Here it looks like examples/eigenpro/plot_eigenpro_synthetic.py got a major performance regression, likely due to a change in dependencies,

  • on master in the example we had,
    EigenPro Classification with 20 features in 0.89 seconds. Error: 8.6
    SupportVector Classification with 20 features in 0.13 seconds. Error: 8.0
    
  • in this PR running that examples shows,
    EigenPro Classification with 20 features in 51.09 seconds. Error: 8.6
    SupportVector Classification with 20 features in 0.20 seconds. Error: 8.0
    

Which is likely due to CPU oversubscription: something (likely BLAS, since scikit-learn should already handle this correctly I think) detects 32 CPU available, but only 2 we can actually used, leading to lots of unnecessary thread creation overhead. Setting OMP_NUM_THREADS to 2 should fix it.

@rth
Copy link
Contributor

rth commented Apr 7, 2021

CI is green now, merging. Thanks!

@rth rth changed the title fix initialization build on kmedoids FIX initialization build on kmedoids Apr 7, 2021
@rth rth merged commit 44ac430 into scikit-learn-contrib:master Apr 7, 2021
@TimotheeMathieu TimotheeMathieu deleted the bug_build_pam branch April 8, 2021 06:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants