Skip to content

Fix Add _indic_tokenizer for Indic support #31130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

andrelrodriguess
Copy link

Reference Issues/PRs

Fixes #30935.

What does this implement/fix? Explain your changes.

This implements a new _indic_tokenizer function to add support for tokenization of Indic languages in scikit-learn. The issue #30935 reported that the existing tokenization tools lacked proper handling for scripts used in languages like Hindi, Tamil, and Bengali, which require specific segmentation rules due to their syllabic nature.

The changes include:

  • Added _indic_tokenizer in sklearn/feature_extraction/text.py, which uses syllable-based segmentation tailored for Indic scripts.
  • Updated the unit tests in tests/test_text.py to include cases for Hindi and Tamil text, ensuring the tokenizer correctly splits words and handles edge cases (e.g., conjunct consonants).
  • Ensured compatibility with existing tokenization pipelines by making _indic_tokenizer an optional parameter in relevant classes.

Copy link

github-actions bot commented Apr 2, 2025

❌ Linting issues

This PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling pre-commit hooks. Instructions to enable them can be found here.

You can see the details of the linting issues under the lint job here


ruff format

ruff detected issues. Please run ruff format locally and push the changes. Here you can see the detected issues. Note that the installed ruff version is ruff=0.11.0.


--- benchmarks/bench_hist_gradient_boosting_adult.py
+++ benchmarks/bench_hist_gradient_boosting_adult.py
@@ -46,7 +46,7 @@
     toc = time()
     roc_auc = roc_auc_score(target_test, predicted_proba_test[:, 1])
     acc = accuracy_score(target_test, predicted_test)
-    print(f"predicted in {toc - tic:.3f}s, ROC AUC: {roc_auc:.4f}, ACC: {acc :.4f}")
+    print(f"predicted in {toc - tic:.3f}s, ROC AUC: {roc_auc:.4f}, ACC: {acc:.4f}")
 
 
 data = fetch_openml(data_id=179, as_frame=True)  # adult dataset

--- benchmarks/bench_hist_gradient_boosting_higgsboson.py
+++ benchmarks/bench_hist_gradient_boosting_higgsboson.py
@@ -74,7 +74,7 @@
     toc = time()
     roc_auc = roc_auc_score(target_test, predicted_proba_test[:, 1])
     acc = accuracy_score(target_test, predicted_test)
-    print(f"predicted in {toc - tic:.3f}s, ROC AUC: {roc_auc:.4f}, ACC: {acc :.4f}")
+    print(f"predicted in {toc - tic:.3f}s, ROC AUC: {roc_auc:.4f}, ACC: {acc:.4f}")
 
 
 df = load_data()

--- build_tools/get_comment.py
+++ build_tools/get_comment.py
@@ -55,9 +55,7 @@
     if end not in log:
         return ""
     res = (
-        "-----------------------------------------------\n"
-        f"### {title}\n\n"
-        f"{message}\n\n"
+        f"-----------------------------------------------\n### {title}\n\n{message}\n\n"
     )
     if details:
         res += (

--- examples/applications/plot_species_distribution_modeling.py
+++ examples/applications/plot_species_distribution_modeling.py
@@ -109,7 +109,7 @@
 
 
 def plot_species_distribution(
-    species=("bradypus_variegatus_0", "microryzomys_minutus_0")
+    species=("bradypus_variegatus_0", "microryzomys_minutus_0"),
 ):
     """
     Plot the species distribution.

--- examples/applications/plot_time_series_lagged_features.py
+++ examples/applications/plot_time_series_lagged_features.py
@@ -265,7 +265,7 @@
     time = cv_results["fit_time"]
     scores["fit_time"].append(f"{time.mean():.2f} ± {time.std():.2f} s")
 
-    scores["loss"].append(f"quantile {int(quantile*100)}")
+    scores["loss"].append(f"quantile {int(quantile * 100)}")
     for key, value in cv_results.items():
         if key.startswith("test_"):
             metric = key.split("test_")[1]

--- examples/applications/plot_topics_extraction_with_nmf_lda.py
+++ examples/applications/plot_topics_extraction_with_nmf_lda.py
@@ -50,7 +50,7 @@
 
         ax = axes[topic_idx]
         ax.barh(top_features, weights, height=0.7)
-        ax.set_title(f"Topic {topic_idx +1}", fontdict={"fontsize": 30})
+        ax.set_title(f"Topic {topic_idx + 1}", fontdict={"fontsize": 30})
         ax.tick_params(axis="both", which="major", labelsize=20)
         for i in "top right left".split():
             ax.spines[i].set_visible(False)

--- examples/ensemble/plot_bias_variance.py
+++ examples/ensemble/plot_bias_variance.py
@@ -177,8 +177,8 @@
 
     plt.subplot(2, n_estimators, n_estimators + n + 1)
     plt.plot(X_test, y_error, "r", label="$error(x)$")
-    plt.plot(X_test, y_bias, "b", label="$bias^2(x)$"),
-    plt.plot(X_test, y_var, "g", label="$variance(x)$"),
+    (plt.plot(X_test, y_bias, "b", label="$bias^2(x)$"),)
+    (plt.plot(X_test, y_var, "g", label="$variance(x)$"),)
     plt.plot(X_test, y_noise, "c", label="$noise(x)$")
 
     plt.xlim([-5, 5])

--- examples/linear_model/plot_tweedie_regression_insurance_claims.py
+++ examples/linear_model/plot_tweedie_regression_insurance_claims.py
@@ -606,8 +606,9 @@
             "predicted, frequency*severity model": np.sum(
                 exposure * glm_freq.predict(X) * glm_sev.predict(X)
             ),
-            "predicted, tweedie, power=%.2f"
-            % glm_pure_premium.power: np.sum(exposure * glm_pure_premium.predict(X)),
+            "predicted, tweedie, power=%.2f" % glm_pure_premium.power: np.sum(
+                exposure * glm_pure_premium.predict(X)
+            ),
         }
     )
 

--- examples/manifold/plot_lle_digits.py
+++ examples/manifold/plot_lle_digits.py
@@ -10,7 +10,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 # %%
 # Load digits dataset
 # -------------------

--- examples/manifold/plot_manifold_sphere.py
+++ examples/manifold/plot_manifold_sphere.py
@@ -50,7 +50,7 @@
 t = random_state.rand(n_samples) * np.pi
 
 # Sever the poles from the sphere.
-indices = (t < (np.pi - (np.pi / 8))) & (t > ((np.pi / 8)))
+indices = (t < (np.pi - (np.pi / 8))) & (t > (np.pi / 8))
 colors = p[indices]
 x, y, z = (
     np.sin(t[indices]) * np.cos(p[indices]),

--- examples/model_selection/plot_likelihood_ratios.py
+++ examples/model_selection/plot_likelihood_ratios.py
@@ -40,7 +40,7 @@
 from sklearn.datasets import make_classification
 
 X, y = make_classification(n_samples=10_000, weights=[0.9, 0.1], random_state=0)
-print(f"Percentage of people carrying the disease: {100*y.mean():.2f}%")
+print(f"Percentage of people carrying the disease: {100 * y.mean():.2f}%")
 
 # %%
 # A machine learning model is built to diagnose if a person with some given

--- examples/model_selection/plot_roc.py
+++ examples/model_selection/plot_roc.py
@@ -152,9 +152,9 @@
 #
 # We can briefly demo the effect of :func:`numpy.ravel`:
 
-print(f"y_score:\n{y_score[0:2,:]}")
+print(f"y_score:\n{y_score[0:2, :]}")
 print()
-print(f"y_score.ravel():\n{y_score[0:2,:].ravel()}")
+print(f"y_score.ravel():\n{y_score[0:2, :].ravel()}")
 
 # %%
 # In a multi-class classification setup with highly imbalanced classes,
@@ -359,7 +359,7 @@
     plt.plot(
         fpr_grid,
         mean_tpr[ix],
-        label=f"Mean {label_a} vs {label_b} (AUC = {mean_score :.2f})",
+        label=f"Mean {label_a} vs {label_b} (AUC = {mean_score:.2f})",
         linestyle=":",
         linewidth=4,
     )

--- sklearn/_loss/tests/test_loss.py
+++ sklearn/_loss/tests/test_loss.py
@@ -204,7 +204,8 @@
 
 
 @pytest.mark.parametrize(
-    "loss, y_true_success, y_true_fail", Y_COMMON_PARAMS + Y_TRUE_PARAMS  # type: ignore[operator]
+    "loss, y_true_success, y_true_fail",
+    Y_COMMON_PARAMS + Y_TRUE_PARAMS,  # type: ignore[operator]
 )
 def test_loss_boundary_y_true(loss, y_true_success, y_true_fail):
     """Test boundaries of y_true for loss functions."""
@@ -215,7 +216,8 @@
 
 
 @pytest.mark.parametrize(
-    "loss, y_pred_success, y_pred_fail", Y_COMMON_PARAMS + Y_PRED_PARAMS  # type: ignore[operator]
+    "loss, y_pred_success, y_pred_fail",
+    Y_COMMON_PARAMS + Y_PRED_PARAMS,  # type: ignore[operator]
 )
 def test_loss_boundary_y_pred(loss, y_pred_success, y_pred_fail):
     """Test boundaries of y_pred for loss functions."""
@@ -501,12 +503,14 @@
         sample_weight=sample_weight,
         loss_out=out_l1,
     )
-    loss.closs.loss(
-        y_true=y_true,
-        raw_prediction=raw_prediction,
-        sample_weight=sample_weight,
-        loss_out=out_l2,
-    ),
+    (
+        loss.closs.loss(
+            y_true=y_true,
+            raw_prediction=raw_prediction,
+            sample_weight=sample_weight,
+            loss_out=out_l2,
+        ),
+    )
     assert_allclose(out_l1, out_l2)
     loss.gradient(
         y_true=y_true,

--- sklearn/cluster/_feature_agglomeration.py
+++ sklearn/cluster/_feature_agglomeration.py
@@ -6,7 +6,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import numpy as np
 from scipy.sparse import issparse
 

--- sklearn/cross_decomposition/tests/test_pls.py
+++ sklearn/cross_decomposition/tests/test_pls.py
@@ -404,12 +404,12 @@
 
     X_orig = X.copy()
     with pytest.raises(AssertionError):
-        pls.transform(X, Y, copy=False),
+        (pls.transform(X, Y, copy=False),)
         assert_array_almost_equal(X, X_orig)
 
     X_orig = X.copy()
     with pytest.raises(AssertionError):
-        pls.predict(X, copy=False),
+        (pls.predict(X, copy=False),)
         assert_array_almost_equal(X, X_orig)
 
     # Make sure copy=True gives same transform and predictions as predict=False

--- sklearn/datasets/tests/test_openml.py
+++ sklearn/datasets/tests/test_openml.py
@@ -105,9 +105,9 @@
         )
 
     def _mock_urlopen_shared(url, has_gzip_header, expected_prefix, suffix):
-        assert url.startswith(
-            expected_prefix
-        ), f"{expected_prefix!r} does not match {url!r}"
+        assert url.startswith(expected_prefix), (
+            f"{expected_prefix!r} does not match {url!r}"
+        )
 
         data_file_name = _file_name(url, suffix)
         data_file_path = resources.files(data_module) / data_file_name
@@ -156,9 +156,9 @@
         )
 
     def _mock_urlopen_data_list(url, has_gzip_header):
-        assert url.startswith(
-            url_prefix_data_list
-        ), f"{url_prefix_data_list!r} does not match {url!r}"
+        assert url.startswith(url_prefix_data_list), (
+            f"{url_prefix_data_list!r} does not match {url!r}"
+        )
 
         data_file_name = _file_name(url, ".json")
         data_file_path = resources.files(data_module) / data_file_name

--- sklearn/datasets/tests/test_samples_generator.py
+++ sklearn/datasets/tests/test_samples_generator.py
@@ -138,17 +138,17 @@
             signs = signs.view(dtype="|S{0}".format(signs.strides[0])).ravel()
             unique_signs, cluster_index = np.unique(signs, return_inverse=True)
 
-            assert (
-                len(unique_signs) == n_clusters
-            ), "Wrong number of clusters, or not in distinct quadrants"
+            assert len(unique_signs) == n_clusters, (
+                "Wrong number of clusters, or not in distinct quadrants"
+            )
 
             clusters_by_class = defaultdict(set)
             for cluster, cls in zip(cluster_index, y):
                 clusters_by_class[cls].add(cluster)
             for clusters in clusters_by_class.values():
-                assert (
-                    len(clusters) == n_clusters_per_class
-                ), "Wrong number of clusters per class"
+                assert len(clusters) == n_clusters_per_class, (
+                    "Wrong number of clusters per class"
+                )
             assert len(clusters_by_class) == n_classes, "Wrong number of classes"
 
             assert_array_almost_equal(
@@ -412,9 +412,9 @@
     X, y = make_blobs(n_samples=n_samples, n_features=2, random_state=0)
 
     assert X.shape == (sum(n_samples), 2), "X shape mismatch"
-    assert all(
-        np.bincount(y, minlength=len(n_samples)) == n_samples
-    ), "Incorrect number of samples per blob"
+    assert all(np.bincount(y, minlength=len(n_samples)) == n_samples), (
+        "Incorrect number of samples per blob"
+    )
 
 
 def test_make_blobs_n_samples_list_with_centers():
@@ -426,9 +426,9 @@
     )
 
     assert X.shape == (sum(n_samples), 2), "X shape mismatch"
-    assert all(
-        np.bincount(y, minlength=len(n_samples)) == n_samples
-    ), "Incorrect number of samples per blob"
+    assert all(np.bincount(y, minlength=len(n_samples)) == n_samples), (
+        "Incorrect number of samples per blob"
+    )
     for i, (ctr, std) in enumerate(zip(centers, cluster_stds)):
         assert_almost_equal((X[y == i] - ctr).std(), std, 1, "Unexpected std")
 
@@ -441,9 +441,9 @@
     X, y = make_blobs(n_samples=n_samples, centers=centers, random_state=0)
 
     assert X.shape == (sum(n_samples), 2), "X shape mismatch"
-    assert all(
-        np.bincount(y, minlength=len(n_samples)) == n_samples
-    ), "Incorrect number of samples per blob"
+    assert all(np.bincount(y, minlength=len(n_samples)) == n_samples), (
+        "Incorrect number of samples per blob"
+    )
 
 
 def test_make_blobs_return_centers():
@@ -681,9 +681,9 @@
 
 def test_make_moons_unbalanced():
     X, y = make_moons(n_samples=(7, 5))
-    assert (
-        np.sum(y == 0) == 7 and np.sum(y == 1) == 5
-    ), "Number of samples in a moon is wrong"
+    assert np.sum(y == 0) == 7 and np.sum(y == 1) == 5, (
+        "Number of samples in a moon is wrong"
+    )
     assert X.shape == (12, 2), "X shape mismatch"
     assert y.shape == (12,), "y shape mismatch"
 

--- sklearn/ensemble/_bagging.py
+++ sklearn/ensemble/_bagging.py
@@ -3,7 +3,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import itertools
 import numbers
 from abc import ABCMeta, abstractmethod

--- sklearn/ensemble/_forest.py
+++ sklearn/ensemble/_forest.py
@@ -35,7 +35,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import threading
 from abc import ABCMeta, abstractmethod
 from numbers import Integral, Real

--- sklearn/ensemble/tests/test_forest.py
+++ sklearn/ensemble/tests/test_forest.py
@@ -168,11 +168,12 @@
     reg = ForestRegressor(n_estimators=5, criterion=criterion, random_state=1)
     reg.fit(X_reg, y_reg)
     score = reg.score(X_reg, y_reg)
-    assert (
-        score > 0.93
-    ), "Failed with max_features=None, criterion %s and score = %f" % (
-        criterion,
-        score,
+    assert score > 0.93, (
+        "Failed with max_features=None, criterion %s and score = %f"
+        % (
+            criterion,
+            score,
+        )
     )
 
     reg = ForestRegressor(
@@ -1068,10 +1069,10 @@
         node_weights = np.bincount(out, weights=weights)
         # drop inner nodes
         leaf_weights = node_weights[node_weights != 0]
-        assert (
-            np.min(leaf_weights) >= total_weight * est.min_weight_fraction_leaf
-        ), "Failed with {0} min_weight_fraction_leaf={1}".format(
-            name, est.min_weight_fraction_leaf
+        assert np.min(leaf_weights) >= total_weight * est.min_weight_fraction_leaf, (
+            "Failed with {0} min_weight_fraction_leaf={1}".format(
+                name, est.min_weight_fraction_leaf
+            )
         )
 
 

--- sklearn/experimental/enable_hist_gradient_boosting.py
+++ sklearn/experimental/enable_hist_gradient_boosting.py
@@ -13,7 +13,6 @@
 # Don't remove this file, we don't want to break users code just because the
 # feature isn't experimental anymore.
 
-
 import warnings
 
 warnings.warn(

--- sklearn/feature_selection/_univariate_selection.py
+++ sklearn/feature_selection/_univariate_selection.py
@@ -3,7 +3,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import warnings
 from numbers import Integral, Real
 

--- sklearn/gaussian_process/tests/test_gpc.py
+++ sklearn/gaussian_process/tests/test_gpc.py
@@ -147,8 +147,9 @@
     # Define a dummy optimizer that simply tests 10 random hyperparameters
     def optimizer(obj_func, initial_theta, bounds):
         rng = np.random.RandomState(global_random_seed)
-        theta_opt, func_min = initial_theta, obj_func(
-            initial_theta, eval_gradient=False
+        theta_opt, func_min = (
+            initial_theta,
+            obj_func(initial_theta, eval_gradient=False),
         )
         for _ in range(10):
             theta = np.atleast_1d(

--- sklearn/gaussian_process/tests/test_gpr.py
+++ sklearn/gaussian_process/tests/test_gpr.py
@@ -394,8 +394,9 @@
     # Define a dummy optimizer that simply tests 50 random hyperparameters
     def optimizer(obj_func, initial_theta, bounds):
         rng = np.random.RandomState(0)
-        theta_opt, func_min = initial_theta, obj_func(
-            initial_theta, eval_gradient=False
+        theta_opt, func_min = (
+            initial_theta,
+            obj_func(initial_theta, eval_gradient=False),
         )
         for _ in range(50):
             theta = np.atleast_1d(

--- sklearn/inspection/_plot/tests/test_plot_partial_dependence.py
+++ sklearn/inspection/_plot/tests/test_plot_partial_dependence.py
@@ -1186,9 +1186,9 @@
     )
 
     line = disp.lines_[0, 0, -1]
-    assert (
-        line.get_color() == expected_colors[0]
-    ), f"{line.get_color()}!={expected_colors[0]}\n{line_kw} and {pd_line_kw}"
+    assert line.get_color() == expected_colors[0], (
+        f"{line.get_color()}!={expected_colors[0]}\n{line_kw} and {pd_line_kw}"
+    )
     if pd_line_kw is not None:
         if "linestyle" in pd_line_kw:
             assert line.get_linestyle() == pd_line_kw["linestyle"]
@@ -1198,9 +1198,9 @@
         assert line.get_linestyle() == "--"
 
     line = disp.lines_[0, 0, 0]
-    assert (
-        line.get_color() == expected_colors[1]
-    ), f"{line.get_color()}!={expected_colors[1]}"
+    assert line.get_color() == expected_colors[1], (
+        f"{line.get_color()}!={expected_colors[1]}"
+    )
     if ice_lines_kw is not None:
         if "linestyle" in ice_lines_kw:
             assert line.get_linestyle() == ice_lines_kw["linestyle"]

--- sklearn/linear_model/_glm/_newton_solver.py
+++ sklearn/linear_model/_glm/_newton_solver.py
@@ -254,7 +254,7 @@
             check = loss_improvement <= t * armijo_term
             if is_verbose:
                 print(
-                    f"    line search iteration={i+1}, step size={t}\n"
+                    f"    line search iteration={i + 1}, step size={t}\n"
                     f"      check loss improvement <= armijo term: {loss_improvement} "
                     f"<= {t * armijo_term} {check}"
                 )
@@ -300,7 +300,7 @@
         self.raw_prediction = raw
         if is_verbose:
             print(
-                f"    line search successful after {i+1} iterations with "
+                f"    line search successful after {i + 1} iterations with "
                 f"loss={self.loss_value}."
             )
 

--- sklearn/linear_model/_linear_loss.py
+++ sklearn/linear_model/_linear_loss.py
@@ -537,9 +537,9 @@
                 # The L2 penalty enters the Hessian on the diagonal only. To add those
                 # terms, we use a flattened view of the array.
                 order = "C" if hess.flags.c_contiguous else "F"
-                hess.reshape(-1, order=order)[
-                    : (n_features * n_dof) : (n_dof + 1)
-                ] += l2_reg_strength
+                hess.reshape(-1, order=order)[: (n_features * n_dof) : (n_dof + 1)] += (
+                    l2_reg_strength
+                )
 
             if self.fit_intercept:
                 # With intercept included as added column to X, the hessian becomes

--- sklearn/linear_model/_ridge.py
+++ sklearn/linear_model/_ridge.py
@@ -5,7 +5,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import numbers
 import warnings
 from abc import ABCMeta, abstractmethod

--- sklearn/linear_model/_theil_sen.py
+++ sklearn/linear_model/_theil_sen.py
@@ -5,7 +5,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import warnings
 from itertools import combinations
 from numbers import Integral, Real

--- sklearn/linear_model/tests/test_ridge.py
+++ sklearn/linear_model/tests/test_ridge.py
@@ -859,9 +859,9 @@
     loo_ridge.fit(X, y)
     gcv_ridge.fit(X, y)
 
-    assert gcv_ridge.alpha_ == pytest.approx(
-        loo_ridge.alpha_
-    ), f"{gcv_ridge.alpha_=}, {loo_ridge.alpha_=}"
+    assert gcv_ridge.alpha_ == pytest.approx(loo_ridge.alpha_), (
+        f"{gcv_ridge.alpha_=}, {loo_ridge.alpha_=}"
+    )
     assert_allclose(gcv_ridge.coef_, loo_ridge.coef_, rtol=1e-3)
     assert_allclose(gcv_ridge.intercept_, loo_ridge.intercept_, rtol=1e-3)
 
@@ -1519,9 +1519,9 @@
     X = rng.randn(n_samples, n_features)
 
     ridge_est = Estimator(alphas=alphas)
-    assert (
-        ridge_est.alphas is alphas
-    ), f"`alphas` was mutated in `{Estimator.__name__}.__init__`"
+    assert ridge_est.alphas is alphas, (
+        f"`alphas` was mutated in `{Estimator.__name__}.__init__`"
+    )
 
     ridge_est.fit(X, y)
     assert_array_equal(ridge_est.alphas, np.asarray(alphas))

--- sklearn/manifold/_spectral_embedding.py
+++ sklearn/manifold/_spectral_embedding.py
@@ -3,7 +3,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import warnings
 from numbers import Integral, Real
 

--- sklearn/manifold/_t_sne.py
+++ sklearn/manifold/_t_sne.py
@@ -964,9 +964,9 @@
             P = _joint_probabilities(distances, self.perplexity, self.verbose)
             assert np.all(np.isfinite(P)), "All probabilities should be finite"
             assert np.all(P >= 0), "All probabilities should be non-negative"
-            assert np.all(
-                P <= 1
-            ), "All probabilities should be less or then equal to one"
+            assert np.all(P <= 1), (
+                "All probabilities should be less or then equal to one"
+            )
 
         else:
             # Compute the number of nearest neighbors to find.

--- sklearn/metrics/_ranking.py
+++ sklearn/metrics/_ranking.py
@@ -10,7 +10,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import warnings
 from functools import partial
 from numbers import Integral, Real

--- sklearn/metrics/cluster/_supervised.py
+++ sklearn/metrics/cluster/_supervised.py
@@ -7,7 +7,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import warnings
 from math import log
 from numbers import Real

--- sklearn/metrics/cluster/_unsupervised.py
+++ sklearn/metrics/cluster/_unsupervised.py
@@ -3,7 +3,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import functools
 from numbers import Integral
 

--- sklearn/metrics/tests/test_common.py
+++ sklearn/metrics/tests/test_common.py
@@ -640,7 +640,6 @@
 
 @pytest.mark.parametrize("name", sorted(NOT_SYMMETRIC_METRICS))
 def test_not_symmetric_metric(name):
-
     # Test the symmetry of score and loss functions
     random_state = check_random_state(0)
     metric = ALL_METRICS[name]
@@ -1004,7 +1003,8 @@
 @pytest.mark.parametrize("metric", CLASSIFICATION_METRICS.values())
 @pytest.mark.parametrize(
     "y_true, y_score",
-    invalids_nan_inf +
+    invalids_nan_inf
+    +
     # Add an additional case for classification only
     # non-regression test for:
     # https://github.com/scikit-learn/scikit-learn/issues/6809
@@ -2103,7 +2103,6 @@
 
 
 def check_array_api_metric_pairwise(metric, array_namespace, device, dtype_name):
-
     X_np = np.array([[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]], dtype=dtype_name)
     Y_np = np.array([[0.2, 0.3, 0.4], [0.5, 0.6, 0.7]], dtype=dtype_name)
 

--- sklearn/metrics/tests/test_pairwise_distances_reduction.py
+++ sklearn/metrics/tests/test_pairwise_distances_reduction.py
@@ -228,9 +228,9 @@
     # on average. Yielding too many results would make the test slow (because
     # checking the results is expensive for large result sets), yielding 0 most
     # of the time would make the test useless.
-    assert (
-        precomputed_dists is not None or metric is not None
-    ), "Either metric or precomputed_dists must be provided."
+    assert precomputed_dists is not None or metric is not None, (
+        "Either metric or precomputed_dists must be provided."
+    )
 
     if precomputed_dists is None:
         assert X is not None

--- sklearn/mixture/tests/test_bayesian_mixture.py
+++ sklearn/mixture/tests/test_bayesian_mixture.py
@@ -118,7 +118,7 @@
     )
     msg = (
         "The parameter 'degrees_of_freedom_prior' should be greater than"
-        f" {n_features -1}, but got {bad_degrees_of_freedom_prior_:.3f}."
+        f" {n_features - 1}, but got {bad_degrees_of_freedom_prior_:.3f}."
     )
     with pytest.raises(ValueError, match=msg):
         bgmm.fit(X)

--- sklearn/model_selection/_validation.py
+++ sklearn/model_selection/_validation.py
@@ -6,7 +6,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import numbers
 import time
 import warnings
@@ -819,9 +818,9 @@
     progress_msg = ""
     if verbose > 2:
         if split_progress is not None:
-            progress_msg = f" {split_progress[0]+1}/{split_progress[1]}"
+            progress_msg = f" {split_progress[0] + 1}/{split_progress[1]}"
         if candidate_progress and verbose > 9:
-            progress_msg += f"; {candidate_progress[0]+1}/{candidate_progress[1]}"
+            progress_msg += f"; {candidate_progress[0] + 1}/{candidate_progress[1]}"
 
     if verbose > 1:
         if parameters is None:

--- sklearn/model_selection/tests/test_search.py
+++ sklearn/model_selection/tests/test_search.py
@@ -2419,9 +2419,9 @@
     for _pairwise_setting in [True, False]:
         est.set_params(pairwise=_pairwise_setting)
         cv = GridSearchCV(est, {"n_neighbors": [10]})
-        assert (
-            _pairwise_setting == cv.__sklearn_tags__().input_tags.pairwise
-        ), attr_message
+        assert _pairwise_setting == cv.__sklearn_tags__().input_tags.pairwise, (
+            attr_message
+        )
 
 
 def test_search_cv_pairwise_property_equivalence_of_precomputed():

--- sklearn/model_selection/tests/test_split.py
+++ sklearn/model_selection/tests/test_split.py
@@ -885,9 +885,9 @@
         bf = stats.binom(n_splits, p)
         for count in idx_counts:
             prob = bf.pmf(count)
-            assert (
-                prob > threshold
-            ), "An index is not drawn with chance corresponding to even draws"
+            assert prob > threshold, (
+                "An index is not drawn with chance corresponding to even draws"
+            )
 
     for n_samples in (6, 22):
         groups = np.array((n_samples // 2) * [0, 1])

--- sklearn/multioutput.py
+++ sklearn/multioutput.py
@@ -8,7 +8,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import warnings
 from abc import ABCMeta, abstractmethod
 from numbers import Integral
@@ -687,7 +686,6 @@
             )
 
         if self.base_estimator != "deprecated":
-
             warning_msg = (
                 "`base_estimator` as an argument was deprecated in 1.7 and will be"
                 " removed in 1.9. Use `estimator` instead."

--- sklearn/neighbors/tests/test_neighbors.py
+++ sklearn/neighbors/tests/test_neighbors.py
@@ -655,10 +655,12 @@
             assert_allclose(np.concatenate(list(ind)), np.concatenate(list(ind1)))
 
         for i in range(len(results) - 1):
-            assert_allclose(
-                np.concatenate(list(results[i][0])),
-                np.concatenate(list(results[i + 1][0])),
-            ),
+            (
+                assert_allclose(
+                    np.concatenate(list(results[i][0])),
+                    np.concatenate(list(results[i + 1][0])),
+                ),
+            )
             assert_allclose(
                 np.concatenate(list(results[i][1])),
                 np.concatenate(list(results[i + 1][1])),

--- sklearn/preprocessing/tests/test_function_transformer.py
+++ sklearn/preprocessing/tests/test_function_transformer.py
@@ -36,13 +36,13 @@
     )
 
     # The function should only have received X.
-    assert args_store == [
-        X
-    ], "Incorrect positional arguments passed to func: {args}".format(args=args_store)
+    assert args_store == [X], (
+        "Incorrect positional arguments passed to func: {args}".format(args=args_store)
+    )
 
-    assert (
-        not kwargs_store
-    ), "Unexpected keyword arguments passed to func: {args}".format(args=kwargs_store)
+    assert not kwargs_store, (
+        "Unexpected keyword arguments passed to func: {args}".format(args=kwargs_store)
+    )
 
     # reset the argument stores.
     args_store[:] = []
@@ -56,13 +56,13 @@
     )
 
     # The function should have received X
-    assert args_store == [
-        X
-    ], "Incorrect positional arguments passed to func: {args}".format(args=args_store)
+    assert args_store == [X], (
+        "Incorrect positional arguments passed to func: {args}".format(args=args_store)
+    )
 
-    assert (
-        not kwargs_store
-    ), "Unexpected keyword arguments passed to func: {args}".format(args=kwargs_store)
+    assert not kwargs_store, (
+        "Unexpected keyword arguments passed to func: {args}".format(args=kwargs_store)
+    )
 
 
 def test_np_log():

--- sklearn/semi_supervised/_self_training.py
+++ sklearn/semi_supervised/_self_training.py
@@ -217,8 +217,7 @@
         # TODO(1.8) remove
         elif self.estimator is None and self.base_estimator == "deprecated":
             raise ValueError(
-                "You must pass an estimator to SelfTrainingClassifier."
-                " Use `estimator`."
+                "You must pass an estimator to SelfTrainingClassifier. Use `estimator`."
             )
         elif self.estimator is not None and self.base_estimator != "deprecated":
             raise ValueError(

--- sklearn/tests/metadata_routing_common.py
+++ sklearn/tests/metadata_routing_common.py
@@ -74,9 +74,9 @@
     for record in all_records:
         # first check that the names of the metadata passed are the same as
         # expected. The names are stored as keys in `record`.
-        assert set(kwargs.keys()) == set(
-            record.keys()
-        ), f"Expected {kwargs.keys()} vs {record.keys()}"
+        assert set(kwargs.keys()) == set(record.keys()), (
+            f"Expected {kwargs.keys()} vs {record.keys()}"
+        )
         for key, value in kwargs.items():
             recorded_value = record[key]
             # The following condition is used to check for any specified parameters
@@ -87,9 +87,9 @@
                 if isinstance(recorded_value, np.ndarray):
                     assert_array_equal(recorded_value, value)
                 else:
-                    assert (
-                        recorded_value is value
-                    ), f"Expected {recorded_value} vs {value}. Method: {method}"
+                    assert recorded_value is value, (
+                        f"Expected {recorded_value} vs {value}. Method: {method}"
+                    )
 
 
 record_metadata_not_default = partial(record_metadata, record_default=False)

--- sklearn/tests/test_common.py
+++ sklearn/tests/test_common.py
@@ -298,7 +298,6 @@
     "transformer", GET_FEATURES_OUT_ESTIMATORS, ids=_get_check_estimator_ids
 )
 def test_transformers_get_feature_names_out(transformer):
-
     with ignore_warnings(category=(FutureWarning)):
         check_transformer_get_feature_names_out(
             transformer.__class__.__name__, transformer

--- sklearn/tests/test_discriminant_analysis.py
+++ sklearn/tests/test_discriminant_analysis.py
@@ -305,16 +305,16 @@
     clf_lda_eigen = LinearDiscriminantAnalysis(solver="eigen")
     clf_lda_eigen.fit(X, y)
     assert_almost_equal(clf_lda_eigen.explained_variance_ratio_.sum(), 1.0, 3)
-    assert clf_lda_eigen.explained_variance_ratio_.shape == (
-        2,
-    ), "Unexpected length for explained_variance_ratio_"
+    assert clf_lda_eigen.explained_variance_ratio_.shape == (2,), (
+        "Unexpected length for explained_variance_ratio_"
+    )
 
     clf_lda_svd = LinearDiscriminantAnalysis(solver="svd")
     clf_lda_svd.fit(X, y)
     assert_almost_equal(clf_lda_svd.explained_variance_ratio_.sum(), 1.0, 3)
-    assert clf_lda_svd.explained_variance_ratio_.shape == (
-        2,
-    ), "Unexpected length for explained_variance_ratio_"
+    assert clf_lda_svd.explained_variance_ratio_.shape == (2,), (
+        "Unexpected length for explained_variance_ratio_"
+    )
 
     assert_array_almost_equal(
         clf_lda_svd.explained_variance_ratio_, clf_lda_eigen.explained_variance_ratio_

--- sklearn/tests/test_metaestimators.py
+++ sklearn/tests/test_metaestimators.py
@@ -157,11 +157,12 @@
             if method in delegator_data.skip_methods:
                 continue
             assert hasattr(delegate, method)
-            assert hasattr(
-                delegator, method
-            ), "%s does not have method %r when its delegate does" % (
-                delegator_data.name,
-                method,
+            assert hasattr(delegator, method), (
+                "%s does not have method %r when its delegate does"
+                % (
+                    delegator_data.name,
+                    method,
+                )
             )
             # delegation before fit raises a NotFittedError
             if method == "score":
@@ -191,11 +192,12 @@
             delegate = SubEstimator(hidden_method=method)
             delegator = delegator_data.construct(delegate)
             assert not hasattr(delegate, method)
-            assert not hasattr(
-                delegator, method
-            ), "%s has method %r when its delegate does not" % (
-                delegator_data.name,
-                method,
+            assert not hasattr(delegator, method), (
+                "%s has method %r when its delegate does not"
+                % (
+                    delegator_data.name,
+                    method,
+                )
             )
 
 

--- sklearn/tree/tests/test_monotonic_tree.py
+++ sklearn/tree/tests/test_monotonic_tree.py
@@ -80,9 +80,9 @@
     est.fit(X_train, y_train)
     proba_test = est.predict_proba(X_test)
 
-    assert np.logical_and(
-        proba_test >= 0.0, proba_test <= 1.0
-    ).all(), "Probability should always be in [0, 1] range."
+    assert np.logical_and(proba_test >= 0.0, proba_test <= 1.0).all(), (
+        "Probability should always be in [0, 1] range."
+    )
     assert_allclose(proba_test.sum(axis=1), 1.0)
 
     # Monotonic increase constraint, it applies to the positive class

--- sklearn/tree/tests/test_tree.py
+++ sklearn/tree/tests/test_tree.py
@@ -198,10 +198,10 @@
 
 
 def assert_tree_equal(d, s, message):
-    assert (
-        s.node_count == d.node_count
-    ), "{0}: inequal number of node ({1} != {2})".format(
-        message, s.node_count, d.node_count
+    assert s.node_count == d.node_count, (
+        "{0}: inequal number of node ({1} != {2})".format(
+            message, s.node_count, d.node_count
+        )
     )
 
     assert_array_equal(
@@ -330,9 +330,9 @@
     reg = Tree(criterion=criterion, random_state=0)
     reg.fit(diabetes.data, diabetes.target)
     score = mean_squared_error(diabetes.target, reg.predict(diabetes.data))
-    assert score == pytest.approx(
-        0
-    ), f"Failed with {name}, criterion = {criterion} and score = {score}"
+    assert score == pytest.approx(0), (
+        f"Failed with {name}, criterion = {criterion} and score = {score}"
+    )
 
 
 @skip_if_32bit
@@ -697,10 +697,10 @@
         node_weights = np.bincount(out, weights=weights)
         # drop inner nodes
         leaf_weights = node_weights[node_weights != 0]
-        assert (
-            np.min(leaf_weights) >= total_weight * est.min_weight_fraction_leaf
-        ), "Failed with {0} min_weight_fraction_leaf={1}".format(
-            name, est.min_weight_fraction_leaf
+        assert np.min(leaf_weights) >= total_weight * est.min_weight_fraction_leaf, (
+            "Failed with {0} min_weight_fraction_leaf={1}".format(
+                name, est.min_weight_fraction_leaf
+            )
         )
 
     # test case with no weights passed in
@@ -720,10 +720,10 @@
         node_weights = np.bincount(out)
         # drop inner nodes
         leaf_weights = node_weights[node_weights != 0]
-        assert (
-            np.min(leaf_weights) >= total_weight * est.min_weight_fraction_leaf
-        ), "Failed with {0} min_weight_fraction_leaf={1}".format(
-            name, est.min_weight_fraction_leaf
+        assert np.min(leaf_weights) >= total_weight * est.min_weight_fraction_leaf, (
+            "Failed with {0} min_weight_fraction_leaf={1}".format(
+                name, est.min_weight_fraction_leaf
+            )
         )
 
 
@@ -845,10 +845,10 @@
             (est3, 0.0001),
             (est4, 0.1),
         ):
-            assert (
-                est.min_impurity_decrease <= expected_decrease
-            ), "Failed, min_impurity_decrease = {0} > {1}".format(
-                est.min_impurity_decrease, expected_decrease
+            assert est.min_impurity_decrease <= expected_decrease, (
+                "Failed, min_impurity_decrease = {0} > {1}".format(
+                    est.min_impurity_decrease, expected_decrease
+                )
             )
             est.fit(X, y)
             for node in range(est.tree_.node_count):
@@ -879,10 +879,10 @@
                         imp_parent - wtd_avg_left_right_imp
                     )
 
-                    assert (
-                        actual_decrease >= expected_decrease
-                    ), "Failed with {0} expected min_impurity_decrease={1}".format(
-                        actual_decrease, expected_decrease
+                    assert actual_decrease >= expected_decrease, (
+                        "Failed with {0} expected min_impurity_decrease={1}".format(
+                            actual_decrease, expected_decrease
+                        )
                     )
 
 
@@ -923,9 +923,9 @@
         assert type(est2) == est.__class__
 
         score2 = est2.score(X, y)
-        assert (
-            score == score2
-        ), "Failed to generate same score  after pickling with {0}".format(name)
+        assert score == score2, (
+            "Failed to generate same score  after pickling with {0}".format(name)
+        )
         for attribute in fitted_attribute:
             assert_array_equal(
                 getattr(est2.tree_, attribute),
@@ -2614,9 +2614,9 @@
     # Check that the tree can learn the predictive feature
     # over an average of cross-validation fits.
     tree_cv_score = cross_val_score(tree, X, y, cv=5).mean()
-    assert (
-        tree_cv_score >= expected_score
-    ), f"Expected CV score: {expected_score} but got {tree_cv_score}"
+    assert tree_cv_score >= expected_score, (
+        f"Expected CV score: {expected_score} but got {tree_cv_score}"
+    )
 
 
 @pytest.mark.parametrize(

--- sklearn/utils/_metadata_requests.py
+++ sklearn/utils/_metadata_requests.py
@@ -1098,8 +1098,9 @@
             method_mapping = MethodMapping()
             for method in METHODS:
                 method_mapping.add(caller=method, callee=method)
-            yield "$self_request", RouterMappingPair(
-                mapping=method_mapping, router=self._self_request
+            yield (
+                "$self_request",
+                RouterMappingPair(mapping=method_mapping, router=self._self_request),
             )
         for name, route_mapping in self._route_mappings.items():
             yield (name, route_mapping)

--- sklearn/utils/_test_common/instance_generator.py
+++ sklearn/utils/_test_common/instance_generator.py
@@ -961,8 +961,7 @@
     },
     HalvingGridSearchCV: {
         "check_fit2d_1sample": (
-            "Fail during parameter check since min/max resources requires"
-            " more samples"
+            "Fail during parameter check since min/max resources requires more samples"
         ),
         "check_estimators_nan_inf": "FIXME",
         "check_classifiers_one_label_sample_weights": "FIXME",
@@ -972,8 +971,7 @@
     },
     HalvingRandomSearchCV: {
         "check_fit2d_1sample": (
-            "Fail during parameter check since min/max resources requires"
-            " more samples"
+            "Fail during parameter check since min/max resources requires more samples"
         ),
         "check_estimators_nan_inf": "FIXME",
         "check_classifiers_one_label_sample_weights": "FIXME",

--- sklearn/utils/estimator_checks.py
+++ sklearn/utils/estimator_checks.py
@@ -4769,9 +4769,9 @@
     else:
         n_features_out = X_transform.shape[1]
 
-    assert (
-        len(feature_names_out) == n_features_out
-    ), f"Expected {n_features_out} feature names, got {len(feature_names_out)}"
+    assert len(feature_names_out) == n_features_out, (
+        f"Expected {n_features_out} feature names, got {len(feature_names_out)}"
+    )
 
 
 def check_transformer_get_feature_names_out_pandas(name, transformer_orig):
@@ -4826,9 +4826,9 @@
     else:
         n_features_out = X_transform.shape[1]
 
-    assert (
-        len(feature_names_out_default) == n_features_out
-    ), f"Expected {n_features_out} feature names, got {len(feature_names_out_default)}"
+    assert len(feature_names_out_default) == n_features_out, (
+        f"Expected {n_features_out} feature names, got {len(feature_names_out_default)}"
+    )
 
 
 def check_param_validation(name, estimator_orig):
@@ -5339,9 +5339,7 @@
                 'Only binary classification is supported. The type of the target '
                 f'is {{y_type}}.'
         )
-    """.format(
-        name=name
-    )
+    """.format(name=name)
     err_msg = textwrap.dedent(err_msg)
 
     with raises(

--- sklearn/utils/tests/test_indexing.py
+++ sklearn/utils/tests/test_indexing.py
@@ -588,7 +588,6 @@
 
 
 def test_notimplementederror():
-
     with pytest.raises(
         NotImplementedError,
         match="Resampling with sample_weight is only implemented for replace=True.",

--- sklearn/utils/tests/test_multiclass.py
+++ sklearn/utils/tests/test_multiclass.py
@@ -366,17 +366,17 @@
                     )
                 ]
                 for exmpl_sparse in examples_sparse:
-                    assert sparse_exp == is_multilabel(
-                        exmpl_sparse
-                    ), f"is_multilabel({exmpl_sparse!r}) should be {sparse_exp}"
+                    assert sparse_exp == is_multilabel(exmpl_sparse), (
+                        f"is_multilabel({exmpl_sparse!r}) should be {sparse_exp}"
+                    )
 
             # Densify sparse examples before testing
             if issparse(example):
                 example = example.toarray()
 
-            assert dense_exp == is_multilabel(
-                example
-            ), f"is_multilabel({example!r}) should be {dense_exp}"
+            assert dense_exp == is_multilabel(example), (
+                f"is_multilabel({example!r}) should be {dense_exp}"
+            )
 
 
 @pytest.mark.parametrize(
@@ -396,9 +396,9 @@
             example = xp.asarray(example, device=device)
 
             with config_context(array_api_dispatch=True):
-                assert dense_exp == is_multilabel(
-                    example
-                ), f"is_multilabel({example!r}) should be {dense_exp}"
+                assert dense_exp == is_multilabel(example), (
+                    f"is_multilabel({example!r}) should be {dense_exp}"
+                )
 
 
 def test_check_classification_targets():
@@ -416,12 +416,13 @@
 def test_type_of_target():
     for group, group_examples in EXAMPLES.items():
         for example in group_examples:
-            assert (
-                type_of_target(example) == group
-            ), "type_of_target(%r) should be %r, got %r" % (
-                example,
-                group,
-                type_of_target(example),
+            assert type_of_target(example) == group, (
+                "type_of_target(%r) should be %r, got %r"
+                % (
+                    example,
+                    group,
+                    type_of_target(example),
+                )
             )
 
     for example in NON_ARRAY_LIKE_EXAMPLES:

--- sklearn/utils/tests/test_seq_dataset.py
+++ sklearn/utils/tests/test_seq_dataset.py
@@ -154,30 +154,34 @@
 
 def test_buffer_dtype_mismatch_error():
     with pytest.raises(ValueError, match="Buffer dtype mismatch"):
-        ArrayDataset64(X32, y32, sample_weight32, seed=42),
+        (ArrayDataset64(X32, y32, sample_weight32, seed=42),)
 
     with pytest.raises(ValueError, match="Buffer dtype mismatch"):
-        ArrayDataset32(X64, y64, sample_weight64, seed=42),
+        (ArrayDataset32(X64, y64, sample_weight64, seed=42),)
 
     for csr_container in CSR_CONTAINERS:
         X_csr32 = csr_container(X32)
         X_csr64 = csr_container(X64)
         with pytest.raises(ValueError, match="Buffer dtype mismatch"):
-            CSRDataset64(
-                X_csr32.data,
-                X_csr32.indptr,
-                X_csr32.indices,
-                y32,
-                sample_weight32,
-                seed=42,
-            ),
+            (
+                CSRDataset64(
+                    X_csr32.data,
+                    X_csr32.indptr,
+                    X_csr32.indices,
+                    y32,
+                    sample_weight32,
+                    seed=42,
+                ),
+            )
 
         with pytest.raises(ValueError, match="Buffer dtype mismatch"):
-            CSRDataset32(
-                X_csr64.data,
-                X_csr64.indptr,
-                X_csr64.indices,
-                y64,
-                sample_weight64,
-                seed=42,
-            ),
+            (
+                CSRDataset32(
+                    X_csr64.data,
+                    X_csr64.indptr,
+                    X_csr64.indices,
+                    y64,
+                    sample_weight64,
+                    seed=42,
+                ),
+            )

--- sklearn/utils/tests/test_tags.py
+++ sklearn/utils/tests/test_tags.py
@@ -565,7 +565,6 @@
     assert _to_new_tags(_to_old_tags(new_tags), estimator=estimator) == new_tags
 
     class MyClass:
-
         def fit(self, X, y=None):
             return self  # pragma: no cover
 

--- sklearn/utils/tests/test_validation.py
+++ sklearn/utils/tests/test_validation.py
@@ -849,9 +849,9 @@
         def fit(self, X, y, sample_weight=None):
             pass
 
-    assert has_fit_parameter(
-        TestClassWithDeprecatedFitMethod, "sample_weight"
-    ), "has_fit_parameter fails for class with deprecated fit method."
+    assert has_fit_parameter(TestClassWithDeprecatedFitMethod, "sample_weight"), (
+        "has_fit_parameter fails for class with deprecated fit method."
+    )
 
 
 def test_check_symmetric():

--- sklearn/utils/validation.py
+++ sklearn/utils/validation.py
@@ -1547,8 +1547,7 @@
         # hasattr(estimator, "fit") makes it so that we don't fail for an estimator
         # that does not have a `fit` method during collection of checks. The right
         # checks will fail later.
-        hasattr(estimator, "fit")
-        and parameter in signature(estimator.fit).parameters
+        hasattr(estimator, "fit") and parameter in signature(estimator.fit).parameters
     )
 
 

60 files would be reformatted, 860 files already formatted

Generated for commit: 42d2da8. Link to the linter CI: here

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same issue could exist in all other non-latin languages then. We should test with some arabic as well as korean/japanese/chinese like strings and call the function sth like "unicode_tokenizer" maybe?

The function should also probably be invisible to the user. We can accept maybe "unicode" as a string in the constructor argument? I'm not sure.

In general, I'm not sure if we want to go down this rabbithole, since it's not really the core of the library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The default token pattern in CountVectorizer breaks Indic sentences into non-sensical tokens
2 participants