Skip to content

MNT Make modules private in sklearn.datasets #15307

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Oct 27, 2019

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Oct 21, 2019

Reference Issues/PRs

Partial Addresses #9250

What does this implement/fix? Explain your changes.

species_distributions and twenty_newsgroups were not made private yet because:

  1. datasets.species_distributions contains construct_grids, which is very specific to fetch_species_distributions. This is used in plot_species_distribution_modeling and plot_species_kde.
  2. datasets.twenty_newsgroups contains strip_newsgroup_quoting and strip_newsgroup_footer, which is specific to fetch_20newsgroups. They are used in plot_species_distribution_modeling and plot_species_kde.

We have two options with these functions:

  1. Make them public by adding them to datasets.__init__
  2. Make them private, and have the examples use private API. i.e:
from sklearn.datasets._twenty_newsgroups import strip_newsgroup_footer

@thomasjpfan
Copy link
Member Author

CC @NicolasHug @adrinjalali

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have two options with these functions:

A third option is to have a twenty_newsgroups folder/module with proper __init__ and those functions as public ones only available under that module. Same for the other one.

@@ -24,7 +24,7 @@
from matplotlib import pyplot as plt

from sklearn.datasets import make_checkerboard
from sklearn.datasets import samples_generator as sg
from sklearn.datasets import _samples_generator as sg
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's odd to import a private module in a public example, and then to use a private function (_shuffle) from that module here. If those functions are useful, they should be public.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were using a private function in a public example initially. I would move the shuffle method into the example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, I didn't mean that you're doing something odd. I just mean "hah, that's odd, now that we've noticed it, we should fix it" :)

@@ -23,7 +23,7 @@
from matplotlib import pyplot as plt

from sklearn.datasets import make_biclusters
from sklearn.datasets import samples_generator as sg
from sklearn.datasets import _samples_generator as sg
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@NicolasHug
Copy link
Member

NicolasHug commented Oct 21, 2019

Thanks @thomasjpfan

Regarding construct_grids, strip_newsgroup_quoting and strip_newsgroup_footer: I think we should just copy the code into the examples they're used in.

They are basic short helpers, it doesn't even make sense to publicly support that kind of utilities, especially since they are not documented at all (how useful is an example if it's using undocumented things?)

Comment on lines 49 to 87
def strip_newsgroup_quoting(text):
"""
Given text in "news" format, strip lines beginning with the quote
characters > or |, plus lines that often introduce a quoted section
(for example, because they contain the string 'writes:'.)

Parameters
----------
text : string
The text from which to remove the signature block.
"""
good_lines = [line for line in text.split('\n')
if not _QUOTE_RE.search(line)]
return '\n'.join(good_lines)


def strip_newsgroup_footer(text):
"""
Given text in "news" format, attempt to remove a signature block.

As a rough heuristic, we assume that signatures are set apart by either
a blank line or a line made of hyphens, and that it is the last such line
in the file (disregarding blank lines at the end).

Parameters
----------
text : string
The text from which to remove the signature block.
"""
lines = text.strip().split('\n')
for line_num in range(len(lines) - 1, -1, -1):
line = lines[line_num]
if line.strip().strip('-') == '':
break

if line_num > 0:
return '\n'.join(lines[:line_num])
else:
return text
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized instead of having these here, you can pass remove=('footer', 'quoting') to fetch_20newsgroups down in the bottom of the file, and remove these from the example, and simplify the SubjectBodyExtractor as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR updated to reflect this change.

from sklearn.cluster import SpectralCoclustering
from sklearn.metrics import consensus_score


def shuffle(data, random_state=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we use sklearn.utils.shuffle instead?

Copy link
Member Author

@thomasjpfan thomasjpfan Oct 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not think sklearn.utils.shuffle can be used since this "shuffles" in both dimensions.

I removed the shuffle function and did the shuffle in a few lines of code directly.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM now.

@thomasjpfan
Copy link
Member Author

CC @NicolasHug

@thomasjpfan thomasjpfan added this to the 0.22 milestone Oct 25, 2019
Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should sg._shuffle be removed now?

LGTM otherwise, Thanks @thomasjpfan

from sklearn.cluster import SpectralBiclustering
from sklearn.metrics import consensus_score


def shuffle(data, random_state=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why creating the helper here but not in the other example?

@thomasjpfan
Copy link
Member Author

Should sg._shuffle be removed now?

It is still used internally in _samples_generator.

@NicolasHug NicolasHug changed the title [MRG] Make modules private in sklearn.datasets MNT Make modules private in sklearn.datasets Oct 27, 2019
@NicolasHug NicolasHug merged commit 4a95e33 into scikit-learn:master Oct 27, 2019
@NicolasHug
Copy link
Member

Merging since CI is only failing because of python 3.8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants