Skip to content

ENH Add retry mechanism to fetch_xx functions. #28160

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

fkdosilovic
Copy link
Contributor

Reference Issues/PRs

Supersedes #21691. (Addressed the remaining comments #21691 (review))

What does this implement/fix? Explain your changes.

This PR adds n_retries and delay argument to the following functions as mentioned in #21178:

  • fetch_20newsgroups
  • fetch_20newsgroups_vectorized
  • fetch_california_housing
  • fetch_covtype
  • fetch_kddcup99
  • fetch_lfw_pairs
  • fetch_lfw_people
  • fetch_olivetti_faces
  • fetch_rcv1
  • fetch_species_distributions

The only exception is fetch_openml, which was covered in #21397.

Any other comments?

None.

@fkdosilovic fkdosilovic changed the title ENH Add arguments to fetch_xx functions. ENH Add retry download to fetch_xx functions. Jan 17, 2024
Copy link

github-actions bot commented Jan 17, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 6547842. Link to the linter CI: here

@fkdosilovic fkdosilovic changed the title ENH Add retry download to fetch_xx functions. ENH Add retry to fetch_xx functions. Jan 17, 2024
@fkdosilovic fkdosilovic changed the title ENH Add retry to fetch_xx functions. ENH Add retry mechanism to fetch_xx functions. Jan 18, 2024
@fkdosilovic
Copy link
Contributor Author

Ping @thomasjpfan and @Micky774 for review.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment otherwise LGTM.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thomasjpfan thomasjpfan added the Waiting for Second Reviewer First reviewer is done, need a second one! label Feb 14, 2024
@lesteve
Copy link
Member

lesteve commented Feb 22, 2024

One inconsistency I see compared to the similar functionality that was added in #21901 for fetch_openml:

I think it would make sense to have delay be a float in this PR.

Also not sure whether that was discussed in related issues/PRs (I looked quickly but could not find any discussion about this), but were there some thoughts about reusing the same code both in this PR and for fetch_openml? I would be completely fine if this was done in a further PR.

@fkdosilovic
Copy link
Contributor Author

I think it would make sense to have delay be a float in this PR.

I'll update the delay parameter to float.

Also not sure whether that was discussed in related issues/PRs (I looked quickly but could not find any discussion about this), but were there some thoughts about reusing the same code both in this PR and for fetch_openml? I would be completely fine if this was done in a further PR.

I do not think there was any discussion about reusing the same code. After finishing this PR I can open another one with the decoupled retry mechanism into a separate function.

cc @lesteve

@fkdosilovic
Copy link
Contributor Author

fkdosilovic commented Feb 22, 2024

Also not sure whether that was discussed in related issues/PRs (I looked quickly but could not find any discussion about this), but were there some thoughts about reusing the same code both in this PR and for fetch_openml? I would be completely fine if this was done in a further PR.

I do not think there was any discussion about reusing the same code. After finishing this PR I can open another one with the decoupled retry mechanism into a separate function.

I jumped the here. What did you mean when you say "reusing the same code both in this PR and for fetch_openml."? Did you mean reusing the _fetch_remote function, where retry and delay are implemented, in fetch_openml?

Also, please look at the fetch_20newsgroups_vectorized because I managed to miss it the first time!

cc @lesteve

@fkdosilovic
Copy link
Contributor Author

fkdosilovic commented Feb 22, 2024

Also pointing out another inconsistency:

In this PR the functions use the following constraints on delay parameter: (Real, 1.0, None, closed="left"), while the fetch_openml function uses (Real, 0.0, None, closed="right"). Should I switch to the latter?

@lesteve
Copy link
Member

lesteve commented Feb 22, 2024

It makes sense to use (Real, 0.0, None, closed="right") I think, since the delay can be between 0 and 1.

What did you mean when you say "reusing the same code both in this PR and for fetch_openml."? Did you mean reusing the _fetch_remote function, where retry and delay are implemented, in fetch_openml?

This was a rather vague idea since I did not look carefully at the code, but it feels like maybe the retry code is very similar so there could be a function that is reused between the two implementations.

@fkdosilovic
Copy link
Contributor Author

fkdosilovic commented Feb 22, 2024

It makes sense to use (Real, 0.0, None, closed="right") I think, since the delay can be between 0 and 1.

Do not see why a delay time between retries should be limited to a maximum of 1 second, or confined to a range between 0 and 1 seconds.

@lesteve
Copy link
Member

lesteve commented Feb 23, 2024

Actually can you use Interval(Real, 0.0, None, closed="neither")?

Sorry I was not explicit enough in my previous message Interval(Real, 0.0, None, closed="neither") means between 0 and +infinity with both 0 and infinity excluded.

The current code [Interval(Real, 1.0, None, closed="left")] means between 1 and +infinity, 1 included. We do want to accept a delay between 0 and +infinity.

@fkdosilovic
Copy link
Contributor Author

Actually can you use Interval(Real, 0.0, None, closed="neither")?

Yes, I believe this is the most sensible solution.

@lesteve
Copy link
Member

lesteve commented Feb 24, 2024

LGTM, thanks!

I fixed a conflict in the changelog and set auto-merge: this will automatically be merged when CI is green.

@lesteve lesteve enabled auto-merge (squash) February 24, 2024 08:39
@lesteve
Copy link
Member

lesteve commented Feb 24, 2024

So the failures are in main too, see #27700 (comment)

@lesteve lesteve merged commit 74d1307 into scikit-learn:main Feb 26, 2024
@glemaitre
Copy link
Member

Apparently we broke something. Since we skip the test for the fetcher we did not catch it.
I'll make a PR to fix it.

glemaitre added a commit to glemaitre/scikit-learn that referenced this pull request Feb 27, 2024
adrinjalali pushed a commit that referenced this pull request Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:datasets Waiting for Second Reviewer First reviewer is done, need a second one!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants