-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MNT Make modules private in sklearn.datasets #15307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MNT Make modules private in sklearn.datasets #15307
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have two options with these functions:
A third option is to have a twenty_newsgroups
folder/module with proper __init__
and those functions as public ones only available under that module. Same for the other one.
@@ -24,7 +24,7 @@ | |||
from matplotlib import pyplot as plt | |||
|
|||
from sklearn.datasets import make_checkerboard | |||
from sklearn.datasets import samples_generator as sg | |||
from sklearn.datasets import _samples_generator as sg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's odd to import a private module in a public example, and then to use a private function (_shuffle
) from that module here. If those functions are useful, they should be public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were using a private function in a public example initially. I would move the shuffle method into the example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know, I didn't mean that you're doing something odd. I just mean "hah, that's odd, now that we've noticed it, we should fix it" :)
@@ -23,7 +23,7 @@ | |||
from matplotlib import pyplot as plt | |||
|
|||
from sklearn.datasets import make_biclusters | |||
from sklearn.datasets import samples_generator as sg | |||
from sklearn.datasets import _samples_generator as sg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
Thanks @thomasjpfan Regarding They are basic short helpers, it doesn't even make sense to publicly support that kind of utilities, especially since they are not documented at all (how useful is an example if it's using undocumented things?) |
def strip_newsgroup_quoting(text): | ||
""" | ||
Given text in "news" format, strip lines beginning with the quote | ||
characters > or |, plus lines that often introduce a quoted section | ||
(for example, because they contain the string 'writes:'.) | ||
|
||
Parameters | ||
---------- | ||
text : string | ||
The text from which to remove the signature block. | ||
""" | ||
good_lines = [line for line in text.split('\n') | ||
if not _QUOTE_RE.search(line)] | ||
return '\n'.join(good_lines) | ||
|
||
|
||
def strip_newsgroup_footer(text): | ||
""" | ||
Given text in "news" format, attempt to remove a signature block. | ||
|
||
As a rough heuristic, we assume that signatures are set apart by either | ||
a blank line or a line made of hyphens, and that it is the last such line | ||
in the file (disregarding blank lines at the end). | ||
|
||
Parameters | ||
---------- | ||
text : string | ||
The text from which to remove the signature block. | ||
""" | ||
lines = text.strip().split('\n') | ||
for line_num in range(len(lines) - 1, -1, -1): | ||
line = lines[line_num] | ||
if line.strip().strip('-') == '': | ||
break | ||
|
||
if line_num > 0: | ||
return '\n'.join(lines[:line_num]) | ||
else: | ||
return text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized instead of having these here, you can pass remove=('footer', 'quoting')
to fetch_20newsgroups
down in the bottom of the file, and remove these from the example, and simplify the SubjectBodyExtractor
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR updated to reflect this change.
from sklearn.cluster import SpectralCoclustering | ||
from sklearn.metrics import consensus_score | ||
|
||
|
||
def shuffle(data, random_state=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we use sklearn.utils.shuffle
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not think sklearn.utils.shuffle
can be used since this "shuffles" in both dimensions.
I removed the shuffle
function and did the shuffle in a few lines of code directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM now.
CC @NicolasHug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should sg._shuffle be removed now?
LGTM otherwise, Thanks @thomasjpfan
from sklearn.cluster import SpectralBiclustering | ||
from sklearn.metrics import consensus_score | ||
|
||
|
||
def shuffle(data, random_state=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why creating the helper here but not in the other example?
It is still used internally in |
Merging since CI is only failing because of python 3.8 |
Reference Issues/PRs
Partial Addresses #9250
What does this implement/fix? Explain your changes.
species_distributions
andtwenty_newsgroups
were not made private yet because:datasets.species_distributions
containsconstruct_grids
, which is very specific tofetch_species_distributions
. This is used inplot_species_distribution_modeling
andplot_species_kde
.datasets.twenty_newsgroups
containsstrip_newsgroup_quoting
andstrip_newsgroup_footer
, which is specific tofetch_20newsgroups
. They are used inplot_species_distribution_modeling
andplot_species_kde
.We have two options with these functions:
datasets.__init__