Skip to content

Conversation

lesteve
Copy link
Member

@lesteve lesteve commented Aug 27, 2025

  • going for Python 3.14 seems like the best bet. I get segmentation faults from time to time on Python 3.13 when running locally. They seem related to warnings looking at the Python traceback.

Edit: scipy free-threaded Python 3.14 package was added in conda-forge so the following 2 points are not a problem anymore:

Copy link

github-actions bot commented Aug 27, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 511e811. Link to the linter CI: here

@lesteve lesteve force-pushed the free-threaded-pytest-run-parallel branch from 5e0ee8f to 47b7d33 Compare August 27, 2025 10:43
@@ -1316,6 +1316,7 @@ def _check_stop_words_consistency(estimator):
return estimator._check_stop_words_consistency(stop_words, preprocess, tokenize)


@pytest.mark.thread_unsafe
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why this is needed? I thought that the warnings module was made thread-safe in Python 3.14.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got an error when running the test locally but I did not investigate this one more closely:

❯ PYTHON_GIL=1 pytest sklearn/feature_extraction/tests/test_text.py --parallel-threads 4 --iterations 100 -k inconsistent -vl
===================================================================================================================================================== test session starts ======================================================================================================================================================
platform linux -- Python 3.14.0rc1, pytest-8.4.1, pluggy-1.6.0 -- /home/lesteve/micromamba/envs/py314t/bin/python
cachedir: .pytest_cache
rootdir: /home/lesteve/dev/alt-scikit-learn
configfile: pyproject.toml
plugins: run-parallel-0.6.1
collected 131 items / 130 deselected / 1 selected                                                                                                                                                                                                                                                                              
Collected 128 items to run in parallel

sklearn/feature_extraction/tests/test_text.py::test_vectorizer_stop_words_inconsistent PARALLEL FAILED                                                                                                                                                                                                                   [100%]

============================================================================================================================================================ ERRORS ============================================================================================================================================================
___________________________________________________________________________________________________________________________________ ERROR at call of test_vectorizer_stop_words_inconsistent ___________________________________________________________________________________________________________________________________

    def test_vectorizer_stop_words_inconsistent():
        lstr = r"\['and', 'll', 've'\]"
        message = (
            "Your stop_words may be inconsistent with your "
            "preprocessing. Tokenizing the stop words generated "
            "tokens %s not in stop_words." % lstr
        )
        for vec in [CountVectorizer(), TfidfVectorizer(), HashingVectorizer()]:
            vec.set_params(stop_words=["you've", "you", "you'll", "AND"])
>           with pytest.warns(UserWarning, match=message):
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E           Failed: DID NOT WARN. No warnings of type (<class 'UserWarning'>,) were emitted.
E            Emitted warnings: [].

lstr       = "\\['and', 'll', 've'\\]"
message    = "Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens \\['and', 'll', 've'\\] not in stop_words."
vec        = HashingVectorizer(stop_words=["you've", 'you', "you'll", 'AND'])

sklearn/feature_extraction/tests/test_text.py:1329: Failed
---------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout setup -----------------------------------------------------------------------------------------------------------------------------------------------------
I: Seeding RNGs with 1478313413
************************************************************************************************************************************************** pytest-run-parallel report **************************************************************************************************************************************************
3 tests were not run in parallel because of use of thread-unsafe functionality, to list the tests that were not run in parallel, re-run while setting PYTEST_RUN_PARALLEL_VERBOSE=1 in your shell environment
=================================================================================================================================================== short test summary info ====================================================================================================================================================
PARALLEL FAILED sklearn/feature_extraction/tests/test_text.py::test_vectorizer_stop_words_inconsistent - Failed: DID NOT WARN. No warnings of type (<class 'UserWarning'>,) were emitted.
======================================================================================================================================== 130 deselected, 860 warnings, 1 error in 1.84s ========================================================================================================================================

Copy link
Member Author

@lesteve lesteve Aug 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking a bit at it, the test passes if I had a uuid.uuid1() or threading.get_ident() in the warning message. Maybe a bug in pytest or in the default warnings "once" strategy 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the warnings semantics w.r.t. thread-safety are different depending on flags that have different values on free-threading and regular builds:

https://docs.python.org/3.14/whatsnew/3.14.html#free-threaded-mode

Copy link
Member Author

@lesteve lesteve Aug 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I have seen this, but I don't have a good understanding of the implications yet ... it seems that's at least part of the reason behind the remaining failures but this needs more investigation.

@@ -1329,18 +1329,19 @@ def test_vectorizer_stop_words_inconsistent():
vec.fit_transform(["hello world"])
# reset stop word validation
del vec._stop_words_id
assert _check_stop_words_consistency(vec) is False
with pytest.warns(UserWarning, match=message):
Copy link
Member Author

@lesteve lesteve Sep 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the actual fix, see previous discussion in #32023 (comment).

My guess right now is that the first call _check_stop_words_consistency was raising a warning and that the next pytest.warns was sometimes failing because of the default warnings once strategy plus some unfavourable thread ordering. By putting this inside pytest.warns we avoid the side-effect and the warning is always issued.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants