Skip to content

Add compatibility for tarfile filter in Python <3.12 #31523

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

AHB30
Copy link

@AHB30 AHB30 commented Jun 11, 2025

Ensures compatibility across Python versions by conditionally applying filter="data" only when the Python version is >= 3.12.

Prevents TypeError on older versions when using sklearn.datasets.fetch_20newsgroups().

What does this implement/fix?

This update prevents a runtime error when using the dataset loader on Python versions before 3.12, by conditionally applying the filter argument based on sys.version_info. The filter="data" option improves security, and this change ensures that it's used only where supported.

Fixes #31521.

Ensures compatibility across Python versions by conditionally applying filter="data" only when the Python version is >= 3.12 or higher. Prevents TypeError on older versions when using sklearn.datasets.fetch_20newsgroups().
Related to issue scikit-learn#31521 in the original repository.
Copy link

❌ Linting issues

This PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling pre-commit hooks. Instructions to enable them can be found here.

You can see the details of the linting issues under the lint job here


ruff check

ruff detected issues. Please run ruff check --fix --output-format=full locally, fix the remaining issues, and push the changes. Here you can see the detected issues. Note that the installed ruff version is ruff=0.11.7.


sklearn/datasets/_twenty_newsgroups.py:28:1: I001 [*] Import block is un-sorted or un-formatted
   |
26 |   # SPDX-License-Identifier: BSD-3-Clause
27 |
28 | / import codecs
29 | | import logging
30 | | import os
31 | | import pickle
32 | | import re
33 | | import shutil
34 | | import tarfile
35 | | from contextlib import suppress
36 | | from numbers import Integral, Real
37 | |
38 | | import sys
39 | | import joblib
40 | | import numpy as np
41 | | import scipy.sparse as sp
42 | |
43 | | from .. import preprocessing
44 | | from ..feature_extraction.text import CountVectorizer
45 | | from ..utils import Bunch, check_random_state
46 | | from ..utils._param_validation import Interval, StrOptions, validate_params
47 | | from . import get_data_home, load_files
48 | | from ._base import (
49 | |     RemoteFileMetadata,
50 | |     _convert_data_dataframe,
51 | |     _fetch_remote,
52 | |     _pkl_filepath,
53 | |     load_descr,
54 | | )
   | |_^ I001
55 |
56 |   logger = logging.getLogger(__name__)
   |
   = help: Organize imports

sklearn/datasets/_twenty_newsgroups.py:88:1: W293 [*] Blank line contains whitespace
   |
86 |         # For more details, see
87 |         # https://docs.python.org/3.9/library/tarfile.html#tarfile.TarFile.extractall
88 |         
   | ^^^^^^^^ W293
89 |         if sys.version_info >= (3, 12):
90 |             fp.extractall(path=target_dir, filter="data")
   |
   = help: Remove whitespace from blank line

Found 2 errors.
[*] 2 fixable with the `--fix` option.

ruff format

ruff detected issues. Please run ruff format locally and push the changes. Here you can see the detected issues. Note that the installed ruff version is ruff=0.11.7.


--- sklearn/datasets/_twenty_newsgroups.py
+++ sklearn/datasets/_twenty_newsgroups.py
@@ -85,7 +85,7 @@
         # Use filter="data" to prevent the most dangerous security issues.
         # For more details, see
         # https://docs.python.org/3.9/library/tarfile.html#tarfile.TarFile.extractall
-        
+
         if sys.version_info >= (3, 12):
             fp.extractall(path=target_dir, filter="data")
         else:

1 file would be reformatted, 922 files already formatted

Generated for commit: 0c882aa. Link to the linter CI: here

@lesteve
Copy link
Member

lesteve commented Jun 11, 2025

Thanks for the PR, can you tackle my comment in #31521 (comment)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TarFile.extractall() got an unexpected keyword argument 'filter'
2 participants