MNT Make modules private in sklearn.datasets #15307

thomasjpfan · 2019-10-21T02:43:48Z

Reference Issues/PRs

Partial Addresses #9250

What does this implement/fix? Explain your changes.

species_distributions and twenty_newsgroups were not made private yet because:

datasets.species_distributions contains construct_grids, which is very specific to fetch_species_distributions. This is used in plot_species_distribution_modeling and plot_species_kde.
datasets.twenty_newsgroups contains strip_newsgroup_quoting and strip_newsgroup_footer, which is specific to fetch_20newsgroups. They are used in plot_species_distribution_modeling and plot_species_kde.

We have two options with these functions:

Make them public by adding them to datasets.__init__
Make them private, and have the examples use private API. i.e:

from sklearn.datasets._twenty_newsgroups import strip_newsgroup_footer

thomasjpfan · 2019-10-21T02:44:09Z

CC @NicolasHug @adrinjalali

adrinjalali

We have two options with these functions:

A third option is to have a twenty_newsgroups folder/module with proper __init__ and those functions as public ones only available under that module. Same for the other one.

adrinjalali · 2019-10-21T13:15:44Z

examples/bicluster/plot_spectral_biclustering.py

@@ -24,7 +24,7 @@
 from matplotlib import pyplot as plt

 from sklearn.datasets import make_checkerboard
-from sklearn.datasets import samples_generator as sg
+from sklearn.datasets import _samples_generator as sg


It's odd to import a private module in a public example, and then to use a private function (_shuffle) from that module here. If those functions are useful, they should be public.

We were using a private function in a public example initially. I would move the shuffle method into the example.

I know, I didn't mean that you're doing something odd. I just mean "hah, that's odd, now that we've noticed it, we should fix it" :)

adrinjalali · 2019-10-21T13:17:56Z

examples/bicluster/plot_spectral_coclustering.py

@@ -23,7 +23,7 @@
 from matplotlib import pyplot as plt

 from sklearn.datasets import make_biclusters
-from sklearn.datasets import samples_generator as sg
+from sklearn.datasets import _samples_generator as sg


NicolasHug · 2019-10-21T13:29:10Z

Thanks @thomasjpfan

Regarding construct_grids, strip_newsgroup_quoting and strip_newsgroup_footer: I think we should just copy the code into the examples they're used in.

They are basic short helpers, it doesn't even make sense to publicly support that kind of utilities, especially since they are not documented at all (how useful is an example if it's using undocumented things?)

adrinjalali · 2019-10-23T14:53:05Z

examples/compose/plot_column_transformer.py

+def strip_newsgroup_quoting(text):
+    """
+    Given text in "news" format, strip lines beginning with the quote
+    characters > or |, plus lines that often introduce a quoted section
+    (for example, because they contain the string 'writes:'.)
+
+    Parameters
+    ----------
+    text : string
+        The text from which to remove the signature block.
+    """
+    good_lines = [line for line in text.split('\n')
+                  if not _QUOTE_RE.search(line)]
+    return '\n'.join(good_lines)
+
+
+def strip_newsgroup_footer(text):
+    """
+    Given text in "news" format, attempt to remove a signature block.
+
+    As a rough heuristic, we assume that signatures are set apart by either
+    a blank line or a line made of hyphens, and that it is the last such line
+    in the file (disregarding blank lines at the end).
+
+    Parameters
+    ----------
+    text : string
+        The text from which to remove the signature block.
+    """
+    lines = text.strip().split('\n')
+    for line_num in range(len(lines) - 1, -1, -1):
+        line = lines[line_num]
+        if line.strip().strip('-') == '':
+            break
+
+    if line_num > 0:
+        return '\n'.join(lines[:line_num])
+    else:
+        return text


I just realized instead of having these here, you can pass remove=('footer', 'quoting') to fetch_20newsgroups down in the bottom of the file, and remove these from the example, and simplify the SubjectBodyExtractor as well.

PR updated to reflect this change.

adrinjalali · 2019-10-23T14:53:14Z

examples/bicluster/plot_spectral_coclustering.py

 from sklearn.cluster import SpectralCoclustering
 from sklearn.metrics import consensus_score

+
+def shuffle(data, random_state=None):


can't we use sklearn.utils.shuffle instead?

Do not think sklearn.utils.shuffle can be used since this "shuffles" in both dimensions.

I removed the shuffle function and did the shuffle in a few lines of code directly.

adrinjalali

Thanks, LGTM now.

thomasjpfan · 2019-10-25T20:08:31Z

CC @NicolasHug

NicolasHug

Should sg._shuffle be removed now?

LGTM otherwise, Thanks @thomasjpfan

NicolasHug · 2019-10-25T20:24:40Z

examples/bicluster/plot_spectral_biclustering.py

 from sklearn.cluster import SpectralBiclustering
 from sklearn.metrics import consensus_score

+
+def shuffle(data, random_state=None):


Why creating the helper here but not in the other example?

thomasjpfan · 2019-10-25T21:40:06Z

Should sg._shuffle be removed now?

It is still used internally in _samples_generator.

NicolasHug · 2019-10-27T21:17:44Z

Merging since CI is only failing because of python 3.8

thomasjpfan added 2 commits October 20, 2019 19:48

API Deprecated paths in datasets

5ae5291

Merge remote-tracking branch 'upstream/master' into deprecated_datasets

8aca443

NicolasHug mentioned this pull request Oct 21, 2019

Make all non-canonical modules private? #9250

Closed

adrinjalali reviewed Oct 21, 2019

View reviewed changes

thomasjpfan added 2 commits October 22, 2019 15:25

Merge remote-tracking branch 'upstream/master' into deprecated_datasets

af9fe3a

DEP Deprecates species_dist and twenty_news

2fecb37

adrinjalali reviewed Oct 23, 2019

View reviewed changes

thomasjpfan added 3 commits October 24, 2019 13:48

Merge remote-tracking branch 'upstream/master' into deprecated_datasets

39ed242

CLN Address adrins comments

ee0aecd

CLN Address adrins comments

0377410

adrinjalali approved these changes Oct 25, 2019

View reviewed changes

thomasjpfan added this to the 0.22 milestone Oct 25, 2019

NicolasHug approved these changes Oct 25, 2019

View reviewed changes

thomasjpfan added 3 commits October 25, 2019 16:35

Merge remote-tracking branch 'upstream/master' into deprecated_datasets

8110d76

CLN Removes shuffle definition

59165d6

Merge remote-tracking branch 'upstream/master' into deprecated_datasets

67f09d3

NicolasHug changed the title ~~[MRG] Make modules private in sklearn.datasets~~ MNT Make modules private in sklearn.datasets Oct 27, 2019

NicolasHug merged commit 4a95e33 into scikit-learn:master Oct 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MNT Make modules private in sklearn.datasets #15307

MNT Make modules private in sklearn.datasets #15307

thomasjpfan commented Oct 21, 2019 •

edited by glemaitre

Loading

thomasjpfan commented Oct 21, 2019

adrinjalali left a comment

adrinjalali Oct 21, 2019

thomasjpfan Oct 21, 2019

adrinjalali Oct 21, 2019

adrinjalali Oct 21, 2019

NicolasHug commented Oct 21, 2019 •

edited

Loading

adrinjalali Oct 23, 2019

thomasjpfan Oct 24, 2019

adrinjalali Oct 23, 2019

thomasjpfan Oct 24, 2019 •

edited

Loading

adrinjalali left a comment

thomasjpfan commented Oct 25, 2019

NicolasHug left a comment

NicolasHug Oct 25, 2019

thomasjpfan commented Oct 25, 2019

NicolasHug commented Oct 27, 2019

MNT Make modules private in sklearn.datasets #15307

MNT Make modules private in sklearn.datasets #15307

Conversation

thomasjpfan commented Oct 21, 2019 • edited by glemaitre Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

thomasjpfan commented Oct 21, 2019

adrinjalali left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NicolasHug commented Oct 21, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasjpfan Oct 24, 2019 • edited Loading

Choose a reason for hiding this comment

adrinjalali left a comment

Choose a reason for hiding this comment

thomasjpfan commented Oct 25, 2019

NicolasHug left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasjpfan commented Oct 25, 2019

NicolasHug commented Oct 27, 2019

thomasjpfan commented Oct 21, 2019 •

edited by glemaitre

Loading

NicolasHug commented Oct 21, 2019 •

edited

Loading

thomasjpfan Oct 24, 2019 •

edited

Loading