Skip to content

Conversation

fkdosilovic
Copy link
Contributor

Reference Issues/PRs

Supersedes #21691. (Addressed the remaining comments #21691 (review))

What does this implement/fix? Explain your changes.

This PR adds n_retries and delay argument to the following functions as mentioned in #21178:

  • fetch_20newsgroups
  • fetch_20newsgroups_vectorized
  • fetch_california_housing
  • fetch_covtype
  • fetch_kddcup99
  • fetch_lfw_pairs
  • fetch_lfw_people
  • fetch_olivetti_faces
  • fetch_rcv1
  • fetch_species_distributions

The only exception is fetch_openml, which was covered in #21397.

Any other comments?

None.

@fkdosilovic fkdosilovic changed the title ENH Add arguments to fetch_xx functions. ENH Add retry download to fetch_xx functions. Jan 17, 2024
Copy link

github-actions bot commented Jan 17, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 6547842. Link to the linter CI: here

@fkdosilovic fkdosilovic changed the title ENH Add retry download to fetch_xx functions. ENH Add retry to fetch_xx functions. Jan 17, 2024
@fkdosilovic fkdosilovic changed the title ENH Add retry to fetch_xx functions. ENH Add retry mechanism to fetch_xx functions. Jan 18, 2024
@fkdosilovic
Copy link
Contributor Author

Ping @thomasjpfan and @Micky774 for review.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment otherwise LGTM.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thomasjpfan thomasjpfan added the Waiting for Second Reviewer First reviewer is done, need a second one! label Feb 14, 2024
@lesteve
Copy link
Member

lesteve commented Feb 22, 2024

One inconsistency I see compared to the similar functionality that was added in #21901 for fetch_openml:

I think it would make sense to have delay be a float in this PR.

Also not sure whether that was discussed in related issues/PRs (I looked quickly but could not find any discussion about this), but were there some thoughts about reusing the same code both in this PR and for fetch_openml? I would be completely fine if this was done in a further PR.

@fkdosilovic
Copy link
Contributor Author

I think it would make sense to have delay be a float in this PR.

I'll update the delay parameter to float.

Also not sure whether that was discussed in related issues/PRs (I looked quickly but could not find any discussion about this), but were there some thoughts about reusing the same code both in this PR and for fetch_openml? I would be completely fine if this was done in a further PR.

I do not think there was any discussion about reusing the same code. After finishing this PR I can open another one with the decoupled retry mechanism into a separate function.

cc @lesteve

@fkdosilovic
Copy link
Contributor Author

fkdosilovic commented Feb 22, 2024

Also not sure whether that was discussed in related issues/PRs (I looked quickly but could not find any discussion about this), but were there some thoughts about reusing the same code both in this PR and for fetch_openml? I would be completely fine if this was done in a further PR.

I do not think there was any discussion about reusing the same code. After finishing this PR I can open another one with the decoupled retry mechanism into a separate function.

I jumped the here. What did you mean when you say "reusing the same code both in this PR and for fetch_openml."? Did you mean reusing the _fetch_remote function, where retry and delay are implemented, in fetch_openml?

Also, please look at the fetch_20newsgroups_vectorized because I managed to miss it the first time!

cc @lesteve

@fkdosilovic
Copy link
Contributor Author

fkdosilovic commented Feb 22, 2024

Also pointing out another inconsistency:

In this PR the functions use the following constraints on delay parameter: (Real, 1.0, None, closed="left"), while the fetch_openml function uses (Real, 0.0, None, closed="right"). Should I switch to the latter?

@lesteve
Copy link
Member

lesteve commented Feb 22, 2024

It makes sense to use (Real, 0.0, None, closed="right") I think, since the delay can be between 0 and 1.

What did you mean when you say "reusing the same code both in this PR and for fetch_openml."? Did you mean reusing the _fetch_remote function, where retry and delay are implemented, in fetch_openml?

This was a rather vague idea since I did not look carefully at the code, but it feels like maybe the retry code is very similar so there could be a function that is reused between the two implementations.

@fkdosilovic
Copy link
Contributor Author

fkdosilovic commented Feb 22, 2024

It makes sense to use (Real, 0.0, None, closed="right") I think, since the delay can be between 0 and 1.

Do not see why a delay time between retries should be limited to a maximum of 1 second, or confined to a range between 0 and 1 seconds.

@lesteve
Copy link
Member

lesteve commented Feb 23, 2024

Actually can you use Interval(Real, 0.0, None, closed="neither")?

Sorry I was not explicit enough in my previous message Interval(Real, 0.0, None, closed="neither") means between 0 and +infinity with both 0 and infinity excluded.

The current code [Interval(Real, 1.0, None, closed="left")] means between 1 and +infinity, 1 included. We do want to accept a delay between 0 and +infinity.

@fkdosilovic
Copy link
Contributor Author

Actually can you use Interval(Real, 0.0, None, closed="neither")?

Yes, I believe this is the most sensible solution.

@lesteve
Copy link
Member

lesteve commented Feb 24, 2024

LGTM, thanks!

I fixed a conflict in the changelog and set auto-merge: this will automatically be merged when CI is green.

@lesteve lesteve enabled auto-merge (squash) February 24, 2024 08:39
@lesteve
Copy link
Member

lesteve commented Feb 24, 2024

So the failures are in main too, see #27700 (comment)

@lesteve lesteve merged commit 74d1307 into scikit-learn:main Feb 26, 2024
@glemaitre
Copy link
Member

Apparently we broke something. Since we skip the test for the fetcher we did not catch it.
I'll make a PR to fix it.

glemaitre added a commit to glemaitre/scikit-learn that referenced this pull request Feb 27, 2024
adrinjalali pushed a commit that referenced this pull request Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:datasets Waiting for Second Reviewer First reviewer is done, need a second one!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants