Skip to content

Conversation

NEREUScode
Copy link

Commit Description:

Replaced the data_binary fixture that filtered classes from a multiclass dataset with a new fixture generating a synthetic binary classification dataset using make_classification. This ensures consistent data characteristics, introduces label noise, and better simulates real-world classification challenges.


PR Description:

Summary of Changes:

This PR refactors the data_binary fixture in the test_roc_curve_display.py file. The previous fixture filtered a multiclass dataset (Iris) to create a binary classification task. However, this approach resulted in AUC values consistently reaching 1.0, which does not reflect real-world challenges.

The new fixture utilizes make_classification from sklearn.datasets to generate a synthetic binary classification dataset with the following characteristics:

  • 200 samples and 20 features.
  • 5 informative features and 2 redundant features.
  • 10% label noise (flip_y=0.1) to simulate real-world imperfections in the data.
  • Class separation (class_sep=0.8) set to avoid perfect separation.

These changes provide a more complex and representative dataset for testing the roc_curve_display function and other related metrics, thereby improving the robustness of tests.

Reference Issues/PRs:


For Reviewers:

  • This change ensures that the dataset used for testing is more reflective of real-world data, particularly in classification tasks that may involve noise and less clear separation between classes.

mohammed benyamna added 3 commits April 25, 2025 19:33
Replaced the `data_binary` fixture that filtered classes from a multiclass dataset 
with a new fixture generating a synthetic binary classification dataset using 
`make_classification`. This ensures consistent data characteristics, introduces 
label noise, and better simulates real-world classification challenges.
Copy link

github-actions bot commented Apr 25, 2025

❌ Linting issues

This PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling pre-commit hooks. Instructions to enable them can be found here.

You can see the details of the linting issues under the lint job here


ruff format

ruff detected issues. Please run ruff format locally and push the changes. Here you can see the detected issues. Note that the installed ruff version is ruff=0.11.2.


--- sklearn/metrics/_plot/tests/test_roc_curve_display.py
+++ sklearn/metrics/_plot/tests/test_roc_curve_display.py
@@ -31,8 +31,8 @@
         n_features=20,
         n_informative=5,
         n_redundant=2,
-        flip_y=0.1,        # Add some label noise
-        class_sep=0.8,     # Reduce separation for more overlap
+        flip_y=0.1,  # Add some label noise
+        class_sep=0.8,  # Reduce separation for more overlap
         random_state=42,
     )
     return X, y

1 file would be reformatted, 918 files already formatted

Generated for commit: 4cfe688. Link to the linter CI: here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use more complex data in test_roc_curve_display.py
1 participant