Skip to content

ENH improve ARFF parser using pandas #21938

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 305 commits into from
May 12, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
305 commits
Select commit Hold shift + click to select a range
d94a53b
iter
glemaitre Mar 7, 2022
7de00ca
address comments by thomas
glemaitre Mar 8, 2022
40e9550
address thomas comment
glemaitre Mar 10, 2022
aae0d16
fix
glemaitre Mar 10, 2022
221437d
DOC: use notebook-style for wikipedia_principal_eigenvector.py (#22704)
AmarCodes-22 Mar 7, 2022
54f6e07
CLN Removes unused fused type (#22727)
thomasjpfan Mar 8, 2022
819dcb3
ENH Adds feature_names_out for most of kernel_approximation (#22694)
thomasjpfan Mar 8, 2022
842dc16
DOC Fixes VotingClassifier.transform docstring (#22698)
thomasjpfan Mar 8, 2022
f055862
ENH Adds get_feature_names_out for AdditiveChi2Sampler (#22137)
thomasjpfan Mar 8, 2022
43be0df
ENH Improve error message for top_k_accuracy_score (#22284)
thomasjpfan Mar 8, 2022
aa2190a
CI Fixes job check-manifest dependency (#22732)
thomasjpfan Mar 8, 2022
a12f746
CI don't run check_manifest on forks (#22729)
jeremiedbb Mar 8, 2022
a139b20
ENH Adds feature_names_out to stacking estimators (#22695)
thomasjpfan Mar 8, 2022
d17594c
ENH Add get_feature_names_out for RandomTreesEmbedding module (#21762)
MaxwellLZH Mar 8, 2022
03572e3
ENH Add feature_names_out to voting estimators (#22697)
thomasjpfan Mar 8, 2022
3def6a1
DOC Update notebook-style example plot_affinity_propagation (#22559)
AmarCodes-22 Mar 9, 2022
c6b2e98
MNT Refactor KMeans and MiniBatchKMeans to inherit from a common base…
jeremiedbb Mar 9, 2022
4094c09
FIX Remove validation from __init__ and set_params for ColumnTransfor…
iofall Mar 9, 2022
085e6c4
DOC: use notebook-style for plot_mean_shift.py (#22713)
gaborberei Mar 9, 2022
3e0acd8
DOC Fix typo in _kmeans.py (#22736)
w4bo Mar 9, 2022
836caae
DOC Ensures that function passes numpydoc validation: f1_score (#22358)
NumberPiOso Mar 9, 2022
8271a22
FIX Fixes KBinsDiscretizer for encode=ordinal (#22735)
thomasjpfan Mar 9, 2022
e2c3c0a
Add Blog to top menu bar (#22737)
francoisgoupil Mar 9, 2022
65245bf
DOC Ensures that sklearn.metrics._ranking.average_precision_score pas…
danifernandes-hub Mar 10, 2022
770f12c
DOC Increase execution speed of plot_cluster_comparison.py (#21624)
Iglesys347 Mar 10, 2022
205a610
FIX Correct fac multiplier in LDA (#22696)
Micky774 Mar 10, 2022
8f6464a
DOC Ensures ledoit_wolf_shrinkage passes numpydoc (#22748)
AmarCodes-22 Mar 10, 2022
90ed6dc
DOC make plot_permutation_importance example run on VS code (#22752)
francoisgoupil Mar 10, 2022
40851ce
MAINT Refactor the common logic for GEMM in wrapper (#22719)
jjerphan Mar 11, 2022
c519cc3
CLN clean _preprocess_data in linear_model (#22762)
lorentzenchr Mar 11, 2022
c802a53
TST check sparse dense equality for Lasso and ElasticNet (#22767)
lorentzenchr Mar 11, 2022
cf1b2a0
DOC Fix link to the minimal reproducible example (#22772)
jeremiedbb Mar 11, 2022
54432ce
Simplify conda installation. (#22771)
cmarmo Mar 11, 2022
4c01ad9
Fixed sklearn.model_selection._split.check_cv docstring in #21350 (#2…
chagaz Mar 12, 2022
256ccdf
DOC updated to notebook style for grid_search_text_feature_extraction…
brendo-k Mar 12, 2022
4e79672
fix docstring r-regression (#22785)
cdrig Mar 12, 2022
5dee1a6
DOC Ensures that paired_euclidean_distances passes numpydoc validatio…
verakye Mar 12, 2022
35061c9
DOC Fix docstring in sklearn.metrics._ranking.label_ranking_loss (#22…
fatimazahraegasmi Mar 12, 2022
fbb8832
DOC Fix docstring in quantile_transform function #21350 (#22780)
sakinaOuisrani Mar 12, 2022
f7f31c1
add documentation to haversine_distances (#22791)
marenwestermann Mar 12, 2022
d504561
fix(doc): fix sklearn.linear_model._ridge.ridge_regression (#22788)
MarieSacksick Mar 12, 2022
57c7063
DOC adding numpydoc to the development dependencies (#22792)
MarieSacksick Mar 12, 2022
049db46
DOC make_regression sample_generator (#22784)
mathieu-sabatier Mar 12, 2022
22300db
DOC Ensures that check_random_state passes numpydoc validation (#22787)
verakye Mar 12, 2022
b7cecf6
fix docstrings on preprocessing._data.normalize (#22795)
ducanne Mar 12, 2022
4c0ad08
chore(notebook_example): improve examples/cluster/plot_feature_agglom…
MarieSacksick Mar 12, 2022
af6b957
fix docstring of dict_learning.sparse_encode and multiclass.check_cl…
sakinaOuisrani Mar 12, 2022
a4f6b8a
Changing docstring for binzrize (#22801)
magalimorin18 Mar 12, 2022
902bb9c
Improved display of function docstring (#21247)
uditgt Mar 13, 2022
eb4be0c
DOC Use notebook style for plot_lasso_dense_vs_sparse_data (#22789)
chagaz Mar 13, 2022
5e5a6d9
DOC changed some typo of _shrunk_covariance.ledoit_wolf_shrinkage (#2…
victoirelouis Mar 13, 2022
260ed5d
fix sklearn.datasets._samples_generator.make_multilabel_classificatio…
mathieu-sabatier Mar 13, 2022
1b1e8f8
DOC fix numpydoc errors in classification_report (#22803)
WeijiaDu Mar 13, 2022
2c71eb6
DOC fix the docstring of sklearn.datasets._samples_generator.make_bic…
cdrig Mar 13, 2022
504e1de
FIX DBSCAN and TSNE are missing the pairwise estimator tag (#22814)
Mar 13, 2022
ed87b21
DOC fix typo in contributing guide (#22815)
ahmadjubair33 Mar 13, 2022
103e89d
MNT Replace if_delegate_has_method with available_if in feature_selec…
jackzyliu Mar 13, 2022
9a11c6d
TST Replaces pytest.warns in test_affinity_propagation (#22819)
ShanDeng123 Mar 13, 2022
3fb76a5
FIX Fixed `self.n_components` typo in `kernal_pca` (#22812)
Micky774 Mar 13, 2022
8e370c2
DOC: document classification example more readable (#22820)
GaelVaroquaux Mar 14, 2022
0926dfa
TST Convert warnings into errors in test_affinity_propgation (#22824)
thomasjpfan Mar 14, 2022
f9e036d
TST introducing the random_seed fixture (#22749)
ogrisel Mar 14, 2022
91ece8b
MNT Replace if_delegate_has_method with available_if in ensemble and …
jackzyliu Mar 14, 2022
86d9cbc
DOC corrected docstring on make_classification (#22797)
DeaMariaLeon Mar 14, 2022
346ddd9
ENH Add inverse_transform to random projection transformers (#21701)
ageron Mar 14, 2022
511d232
add num_threads in kmeans init_bounds (#22773)
jeremiedbb Mar 14, 2022
77ebe59
ENH Adds infrequent categories to OneHotEncoder (#16018)
thomasjpfan Mar 14, 2022
630cfb5
MNT Remove utf-8 encoding declarations (#21260)
DimitriPapadopoulos Mar 14, 2022
7362031
FIX Fixes visualization for nested meta-estimators (#21310)
thomasjpfan Mar 14, 2022
55eeb02
TST Replaces pytest.warns(None) in test_optics (#22831)
ShanDeng123 Mar 14, 2022
657c132
DOC Fix wrong link title for PCA in "Dim. red." (#22835)
jchazalon Mar 14, 2022
73b7795
TST Replaces pytest.warns(None) in test_pls (#22832)
ShanDeng123 Mar 15, 2022
1797108
DOC Fix the formatting for environment variables in docs (#22833)
ogrisel Mar 15, 2022
3771cbb
TST Replaces pytest.warns(None) in test_fastica (#22846)
ShanDeng123 Mar 15, 2022
be5c4d8
TST Replaces pytest.warns(None) in test_dict_learning (#22845)
ShanDeng123 Mar 15, 2022
c000059
TST Fixes global random seed with pytest-xdist (#22844)
thomasjpfan Mar 15, 2022
8fd4606
FIX Fix ColumnTransformer.get_feature_names_out with slices (#22775)
randomgeek78 Mar 15, 2022
cacd2a9
TST ensure that sklearn/_loss/tests/test_loss.py is seed insensitive …
ogrisel Mar 15, 2022
8abdbd7
DOC: Fix latex \max and not max in model selection (#22858)
agramfort Mar 16, 2022
7407d56
MNT Some clean-up in the random_projection module (#22761)
jeremiedbb Mar 16, 2022
f62a47a
ENH avoid unecessary memory copy in pdp (#21930)
glemaitre Mar 16, 2022
2a4f964
DOC Clarify the LS term in example (#22156)
JorgeFCS Mar 16, 2022
0001f65
TST Ensure that `sklearn/metrics/tests/test_pairwise_distances_reduct…
jjerphan Mar 17, 2022
eeab3d3
Removing pytest.warns(None) (#22873)
ShanDeng123 Mar 17, 2022
afe78e6
MAINT Fix plot_gallery warning when building docs (#22869)
thomasjpfan Mar 17, 2022
fb6b8ef
DOC update notebook-style example plot_cv_diabetes.py (#22740)
AmarCodes-22 Mar 17, 2022
5dd2372
MAINT Import from public SciPy in pubilc namespace (#22875)
thomasjpfan Mar 17, 2022
37633e0
DOC update notebook-style for plot_calibration.py (#22734)
AmarCodes-22 Mar 17, 2022
fb7135e
TST removing pytest.warns(None) in test_kernel_pca (#22872)
ShanDeng123 Mar 17, 2022
9ae357d
TST replace pytest.warns(None) in linear_model test_base.py (#22876)
glemaitre Apr 6, 2022
94a10ba
MAINT Open issue on tracker for pypy errors (#22870)
thomasjpfan Mar 17, 2022
787d23d
TST Replaces pytest.warns(None) in test_feature_agglomeration (#22871)
ShanDeng123 Mar 17, 2022
43b04b2
DOC Update notebook-style for example plot_image_denoising (#22739)
AmarCodes-22 Mar 17, 2022
25c307b
TST Add minimal setup to be able to run test suite on float32 (#22690)
jjerphan Mar 17, 2022
41cd366
API Deprecate if_delegate_has_method (#22830)
jeremiedbb Mar 17, 2022
80c2da1
MAINT `PairwiseDistancesReduction`: Do correctly warn on unused metri…
jjerphan Mar 17, 2022
0915b71
TST replace pytest.warns(None) in test_coordinate_descent.py (#22878)
richardt94 Mar 17, 2022
56b1874
Update plot_rbf_parameters.py (#22724)
zempleni Mar 17, 2022
5f1198f
DOC Use proper tags for get_feature_names_out in whats_new (#22883)
thomasjpfan Mar 17, 2022
abade87
MAINT Create a private extension for sorting utilities (#22760)
jjerphan Mar 17, 2022
aa3f2e8
TST Replace pytest.warns(None) in test_omp.py (#22886)
richardt94 Mar 18, 2022
6c5115b
MNT Removes externals._pilutil and uses Pillow directly (#22743)
thomasjpfan Mar 18, 2022
30fceba
DOC accelerate plot_kernel_ridge_regression.py (#21791)
melemo2 Mar 18, 2022
aaca3af
DOC use notebook-style for plot_bayesian_ridge.py (#22794)
fkaren27 Mar 18, 2022
6535407
format notebook plot_roc_crossval.py (#22799)
cdrig Mar 18, 2022
55b0cf3
DOC use notebook-style for plot_svm_anova.py (#22779)
JihaneBennis Mar 18, 2022
76abb9f
replace pytest.warns(None) in test_least_angle.py (#22889)
richardt94 Mar 18, 2022
4024241
TST remove pytest.warns(None) in test_logistic.py (#22877)
richardt94 Mar 18, 2022
26eadbf
TST Removes pytest.warns(None) in test_iforest (#22874)
ShanDeng123 Mar 18, 2022
fc914cb
DOC update notebook style for plot_lda_qda (#22528)
bijilsubhash Mar 18, 2022
30c0c2e
DOC use notebook-style for plot_sparse_cov.py (#22807)
fkaren27 Mar 18, 2022
905aa1a
FIX Fixes OneVsOneClassifier.predict for Estimators with only predict…
thomasjpfan Mar 18, 2022
a4704b4
FIX LinearRegression sparse + intercept + sample_weight (#22891)
jeremiedbb Mar 18, 2022
e29ea50
MNT remove sparse_lsqr from utils.fixes (#22894)
lorentzenchr Mar 18, 2022
94d7adc
DOC convert examples/cluster/plot_mini_batch_kmeans.py to notebook st…
jsilke Mar 19, 2022
28c99e6
API get_scorer returns a copy and introduce get_scorer_names (#22866)
adrinjalali Mar 19, 2022
fe6e5a7
ENH Adds better error message for GitHub in html repr (#22902)
thomasjpfan Mar 19, 2022
d5a32fa
DOC, MNT Typos found by codespell (#22906)
DimitriPapadopoulos Mar 20, 2022
5ea3eb1
MNT spelling fix (#22912)
DimitriPapadopoulos Mar 21, 2022
dc3a908
API Config: change default display to "diagram" (#22856)
jeremiedbb Mar 21, 2022
a5909c8
ENH Use simultaenous sort in tree splitter (#22868)
thomasjpfan Mar 21, 2022
eb7b0b1
DOC fetch_california_housing passes numpydoc validation (#22882)
DeaMariaLeon Mar 21, 2022
d2ebdf7
DOC Ensures that load_wine passes numpydoc (#22469)
sharmadharmpal Mar 21, 2022
b6e2ff9
DOC adds two dots from datasets._base.load_sample_image (#22805)
victoirelouis Mar 21, 2022
30aeff5
DOC Ensure that ledoit_wolf is passing numpydoc validation (#22496)
kungfudeuce Mar 21, 2022
8edf647
DOC Ensures that preprocessing._data.power_transform passes numpydoc …
ducanne Mar 21, 2022
7ba9a49
Doc make make_sparse_coded_signal pass numpydoc (#22817)
victoirelouis Mar 21, 2022
6f370f7
DOC Ensures that sklearn.metrics._plot.confusion_matrix.plot_confusio…
danifernandes-hub Mar 22, 2022
83b0641
DOC make min_max_axis pass numpydoc (#22839)
cdrig Mar 22, 2022
6b9b35f
ENH Allow `SelectFromModel`'s `max_features` to accept callables (#22…
Micky774 Mar 22, 2022
0ad4a7d
Update plot_label_propagation_digits.py (#22725)
zempleni Mar 22, 2022
01ce0ae
DOC Update plot_label_propagation_structure.py to notebook style (#22…
zempleni Mar 22, 2022
4b0f058
MNT Replace pytest.warns(None) in test_ridge.py (#22917)
richardt94 Mar 22, 2022
a99e4f9
API Add data_transposed argument and warning to make_sparse_coded_sig…
g4brielvs Mar 22, 2022
76ac5dd
MAINT use the default CPU_COUNT=2 for the macOS builds (#22919)
ogrisel Mar 22, 2022
d096e0b
MNT Clean deprecation of dtype='numeric' + array of strings in check_…
jeremiedbb Mar 22, 2022
54d57e1
Fix Ridge sparse + sample_weight + intercept (#22899)
jeremiedbb Mar 22, 2022
09d7c3f
DOC make fetch_covtype pass numpydoc (#22918)
DeaMariaLeon Mar 22, 2022
3eacbcc
DOC improve phrasing precision recall (#22924)
Kaminyou Mar 23, 2022
cd95bf9
ENH Adds encoded_missing_value to OrdinalEncoder (#21988)
thomasjpfan Mar 23, 2022
c041b15
API Deprecate max_feature=`auto` for tree classes (#22476)
MaxwellLZH Mar 23, 2022
8f8992e
TST Fix test failing scipy nightly (#22935)
jeremiedbb Mar 24, 2022
b5d7c62
TST replace pytest.warns(None) in metrics/test_classification.py (#22…
danifernandes-hub Mar 24, 2022
2bd4508
BLD Monkeypatch windows build to stablize build (#22693)
thomasjpfan Mar 24, 2022
200466f
FIX PowerTransformer Yeo-Johnson auto-tuning on significantly non-Gau…
thomasjpfan Mar 24, 2022
1a8c291
TST removed pytest.warns(None) in test_data.py (#22938)
danifernandes-hub Mar 24, 2022
3a053e6
TST replace pytest.warns(None) in test_function_transformer.py (#22937)
danifernandes-hub Mar 24, 2022
bc15a31
MNT fix typo in tree test name (#22943)
lorentzenchr Mar 24, 2022
bd6e198
[MAINT] Separate unit tests in `test_tree.py` for pickling and min_im…
adam2392 Mar 25, 2022
b38f955
FIX Removes warning in HGBT when fitting on dataframes (#22908)
thomasjpfan Mar 25, 2022
601a8df
ENH add sample_weight to sparse coordinade descent (#22808)
lorentzenchr Mar 25, 2022
3f9c7b5
FIX Fix recall in multilabel classification when true labels are all …
varunagrawal Mar 25, 2022
75d9f34
DOC Update comms team (#22942)
GaelVaroquaux Mar 25, 2022
db1a0e6
TST use global_dtype in feature_selection/tests/test_mutual_info.py …
jjerphan Mar 25, 2022
7f8c1dd
DOC Makes Sphinx reference to Bunch a class (#22948)
thomasjpfan Mar 25, 2022
8c31ddc
DOC fix typo (#22958)
JosephSM Mar 26, 2022
089608c
TST replace pytest.warns(None) in preprocessing/test_common.py (#22936)
danifernandes-hub Mar 26, 2022
d3d0dc1
TST replace pytest.warns(None) in metrics/tests/test_pairwise.py (#22…
danifernandes-hub Mar 26, 2022
e8a95fe
TST Replace pytest.warns(None) in test_gpr.py (#22959)
TheisFerre Mar 26, 2022
ae57788
TST replace pytest.warns(None) in metrics/cluster/test_unsupervised.p…
danifernandes-hub Mar 27, 2022
cdadbd7
TST replace pytest.warns(None) in metrics/cluster/test_supervised.py …
danifernandes-hub Mar 27, 2022
f5bdd24
TST Replace pytest.warns(None) in test_gpc.py (#22960)
TheisFerre Mar 27, 2022
ad53c73
TST Replace pytest.warns(None) in utils/tests/test_utils.py (#22961)
danifernandes-hub Mar 27, 2022
1a40697
TST Replace pytest.warns(None) in manifold/tests/test_t_sne.py (#22963)
yiyangq Mar 27, 2022
49b353b
TST Replace pytest.warns(None) in feature_extraction/tests/test_text.…
yiyangq Mar 27, 2022
b9d0c8c
FIX make coef_ in PLS estimator consistent with linear models (#22016)
glemaitre Mar 28, 2022
d4c03d5
ENH Preserving dtypes for ICA (#22806)
JihaneBennis Mar 28, 2022
b624e9f
ENH migrate GLMs / TweedieRegressor to linear loss (#22548)
lorentzenchr Mar 28, 2022
d5dcd4b
MNT Update to black 22.3.0 to resolve click error (#22983)
thomasjpfan Mar 29, 2022
fabe7c4
FEA Add DecisionBoundaryDisplay (#16061)
thomasjpfan Mar 29, 2022
cf6d47b
MNT accelerate examples/kernel_approximation/plot_scalable_poly_kerne…
jsilke Mar 29, 2022
f6932e7
MAINT Refactor vector sentinel into utils (#22728)
thomasjpfan Mar 29, 2022
f270c8b
DOC Fixes nav bar by dynamically changing searchbar size (#22954)
thomasjpfan Mar 29, 2022
b779407
MNT Uses memoryviews in tree criterion (#22921)
thomasjpfan Mar 29, 2022
8a829f0
DOC no longer funded by sydney university (#22980)
jnothman Mar 29, 2022
7a40462
ENH enable LSQR solver with intercept term in Ridge with sparse input…
lorentzenchr Mar 29, 2022
70a0785
DOC Link directly developer docs in the navbar (#22550)
thomasjpfan Mar 29, 2022
86e18a8
[MRG] Refactor MiniBatchDictionaryLearning and add stopping criterion…
jeremiedbb Mar 30, 2022
cf4296e
DOC: fix typo (#22994)
Mar 30, 2022
2429258
TST use global_dtype in sklearn/neighbors/tests/test_neighbors.py (#2…
jjerphan Mar 30, 2022
a4db3ac
DOC Update notebook style for plot_bayesian_ridge_curvefit (#22916)
2357juan Mar 30, 2022
f2b73c2
DOC fix docstring of EllipticEnvelope.fit (#22997)
Ben3940 Mar 30, 2022
34addd1
MNT Fix pytest random seed pluging with vscode test discovery (#22976)
adrinjalali Mar 30, 2022
e29a6ef
TST tight and clean tests for Ridge (#22910)
lorentzenchr Mar 31, 2022
f0ce884
DOC Switch to gender neutral terms for sister function (#23003)
inclusive-coding-bot Mar 31, 2022
37985a8
FIX ColumnTransformer.get_feature_names_out with string slices (#22913)
randomgeek78 Mar 31, 2022
0d90b45
Rename triage team to contributor experience team (#22970)
jeremiedbb Mar 31, 2022
9c17d0b
DOC Ensures that homogeneity_score passes numpydoc validation (#23006)
aj-white Mar 31, 2022
b5870a9
CI use circleci artifact redirector GH action (#22991)
jeremiedbb Apr 1, 2022
820bf63
MNT remove artifact_path file now unused (#23012)
jeremiedbb Apr 1, 2022
72e8d48
MAINT Convert OpenMP scheduling to 'static' in pairwise distances rad…
jjerphan Apr 1, 2022
eaacba6
DOC precise stopping criteria for coordinate descent
lorentzenchr Apr 1, 2022
9726d16
MNT Revert "DOC precise stopping criteria for coordinate descent"
lorentzenchr Apr 1, 2022
d228fd2
TST use global_dtype in sklearn/cluster/tests/test_optics.py (#22668)
jjerphan Apr 1, 2022
d88d886
TST use global_dtype in sklearn/manifold/tests/test_locally_linear.py…
jjerphan Apr 1, 2022
a4220d5
DOC Ensure completeness_score passes numpydoc validation (#23016)
aj-white Apr 1, 2022
78554d7
MNT ensure creation of dataset is deterministic in SGD (#19716)
PierreAttard Apr 2, 2022
988c811
TST replace pytest.warns(None) in test_label_propagation.py (#23010)
Ben3940 Apr 4, 2022
95a1808
TST Replace pytest.warns(None) in test_feature_select (#23041)
iasoon Apr 4, 2022
285db1b
TST remove pytest.warns(None) in test_svm.py (#23030)
iasoon Apr 4, 2022
0085147
TST remove pytest.warns(None) in utils/tests/test_validation.py (#23029)
MegaGonz Apr 4, 2022
e74b8a2
DOC Ensures that laplacian_kernel passes numpydoc validation (#23005)
gustavo-ren Apr 4, 2022
9bc0637
DOC Updates the MAxAbsScaler description in scalers example (#22951)
chalmerlowe Apr 4, 2022
66d659f
DOC Fixing a documentation issue on SVC parameter decision_function_s…
mehrdadmoradii Apr 5, 2022
5b561c9
FIX Feature Union: Checking if feautre union is fitted fails (#22953)
randomgeek78 Apr 5, 2022
c7809b2
CI Increases test time for pypy [pypy] (#23049)
thomasjpfan Apr 5, 2022
38c792d
DOC use notebook-style for plot_theilsen (#23002)
arthurmello Apr 6, 2022
d3fbe85
Merge remote-tracking branch 'origin/main' into pandas_arff_reader
glemaitre Apr 6, 2022
acc6b2a
iter
glemaitre Apr 6, 2022
257fc15
iter
glemaitre Apr 6, 2022
54daca0
change to warning
glemaitre Apr 6, 2022
23ced6b
update test_fetch_openml_consistency_parser
ogrisel Apr 6, 2022
b36ea90
Finer assertions in test_fetch_openml_consistency_parser
ogrisel Apr 7, 2022
94c9683
Trigger [doc build]
ogrisel Apr 7, 2022
d668bac
Merge remote-tracking branch 'origin/main' into pandas_arff_reader
glemaitre Apr 26, 2022
b743729
Merge remote-tracking branch 'origin/main' into pandas_arff_reader
glemaitre Apr 27, 2022
652d464
Apply suggestions from code review
glemaitre Apr 27, 2022
14a2890
Merge remote-tracking branch 'origin/main' into pandas_arff_reader
glemaitre Apr 29, 2022
781cc1c
rename variable for consistency
glemaitre Apr 29, 2022
f30fb32
iter
glemaitre Apr 29, 2022
0adfe89
iter
glemaitre Apr 29, 2022
0f3fbe1
Merge remote-tracking branch 'glemaitre/pandas_arff_reader' into pand…
glemaitre Apr 29, 2022
e9f4dfc
fix
glemaitre Apr 29, 2022
4638469
iter
glemaitre Apr 29, 2022
b50cc07
Apply suggestions from code review
glemaitre May 2, 2022
6c71a0f
Update sklearn/datasets/_openml.py
glemaitre May 2, 2022
046e4e3
Update sklearn/datasets/tests/test_openml.py
glemaitre May 2, 2022
61ed115
iter
glemaitre May 2, 2022
58f7c6b
iter
glemaitre May 2, 2022
c835c11
Merge branch 'main' into pandas_arff_reader
glemaitre May 2, 2022
ae94ef6
remove _cast_frame
glemaitre May 2, 2022
c753e3b
update asv benchmark
glemaitre May 2, 2022
9ac3ad7
update lof and isolation forest bench
glemaitre May 2, 2022
8a0cab3
update mnist bench
glemaitre May 2, 2022
8a5c14d
update bench randomized svd
glemaitre May 2, 2022
67c06a7
update tsne bench
glemaitre May 2, 2022
e4afde4
add more documentation
glemaitre May 2, 2022
8fa26df
update all examples
glemaitre May 2, 2022
18859a2
Update sklearn/datasets/_openml.py
glemaitre May 3, 2022
5493907
Update sklearn/datasets/_openml.py
glemaitre May 3, 2022
60c3881
fix encoding
glemaitre May 3, 2022
1a68491
better doc
glemaitre May 3, 2022
5a71cd1
add new test for parser difference
glemaitre May 3, 2022
9ab3c74
Merge branch 'main' into pandas_arff_reader
glemaitre May 4, 2022
1d94eb2
Apply suggestions from code review
glemaitre May 12, 2022
90898ca
Update sklearn/datasets/tests/test_openml.py
glemaitre May 12, 2022
961b581
Merge remote-tracking branch 'origin/main' into pandas_arff_reader
glemaitre May 12, 2022
a599a2f
change version
glemaitre May 12, 2022
e010572
Merge remote-tracking branch 'origin/main' into pandas_arff_reader
glemaitre May 12, 2022
a8add98
Update doc/datasets/loading_other_datasets.rst
glemaitre May 12, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion asv_benchmarks/benchmarks/datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,9 @@ def _20newsgroups_lowdim_dataset(n_components=100, ngrams=(1, 1), dtype=np.float

@M.cache
def _mnist_dataset(dtype=np.float32):
X, y = fetch_openml("mnist_784", version=1, return_X_y=True, as_frame=False)
X, y = fetch_openml(
"mnist_784", version=1, return_X_y=True, as_frame=False, parser="pandas"
)
X = X.astype(dtype, copy=False)
X = MaxAbsScaler().fit_transform(X)

Expand Down
23 changes: 18 additions & 5 deletions benchmarks/bench_hist_gradient_boosting_adult.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,15 @@
from time import time

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.datasets import fetch_openml
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.ensemble._hist_gradient_boosting.utils import get_equivalent_estimator
from sklearn.preprocessing import OrdinalEncoder


parser = argparse.ArgumentParser()
Expand Down Expand Up @@ -47,22 +50,32 @@ def predict(est, data_test, target_test):
print(f"predicted in {toc - tic:.3f}s, ROC AUC: {roc_auc:.4f}, ACC: {acc :.4f}")


data = fetch_openml(data_id=179, as_frame=False) # adult dataset
data = fetch_openml(data_id=179, as_frame=True, parser="pandas") # adult dataset
X, y = data.data, data.target

# Ordinal encode the categories to use the native support available in HGBDT
cat_columns = make_column_selector(dtype_include="category")(X)
preprocessing = make_column_transformer(
(OrdinalEncoder(), cat_columns),
remainder="passthrough",
verbose_feature_names_out=False,
)
X = pd.DataFrame(
preprocessing.fit_transform(X),
columns=preprocessing.get_feature_names_out(),
)

n_classes = len(np.unique(y))
n_features = X.shape[1]
n_categorical_features = len(data.categories)
n_categorical_features = len(cat_columns)
n_numerical_features = n_features - n_categorical_features
print(f"Number of features: {n_features}")
print(f"Number of categorical features: {n_categorical_features}")
print(f"Number of numerical features: {n_numerical_features}")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Note: no need to use an OrdinalEncoder because categorical features are
# already clean
is_categorical = [name in data.categories for name in data.feature_names]
is_categorical = [True] * n_categorical_features + [False] * n_numerical_features
est = HistGradientBoostingClassifier(
loss="log_loss",
learning_rate=lr,
Expand Down
4 changes: 2 additions & 2 deletions benchmarks/bench_isolation_forest.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,9 +64,9 @@ def print_outlier_ratio(y):
y = dataset.target

if dat == "shuttle":
dataset = fetch_openml("shuttle")
dataset = fetch_openml("shuttle", as_frame=False, parser="pandas")
X = dataset.data
y = dataset.target
y = dataset.target.astype(np.int64)
X, y = sh(X, y, random_state=random_state)
# we remove data with label 4
# normal data are then those of class 1
Expand Down
4 changes: 2 additions & 2 deletions benchmarks/bench_lof.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,9 +44,9 @@
y = dataset.target

if dataset_name == "shuttle":
dataset = fetch_openml("shuttle")
dataset = fetch_openml("shuttle", as_frame=False, parser="pandas")
X = dataset.data
y = dataset.target
y = dataset.target.astype(np.int64)
# we remove data with label 4
# normal data are then those of class 1
s = y != 4
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/bench_mnist.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ def load_data(dtype=np.float32, order="F"):
######################################################################
# Load dataset
print("Loading dataset...")
data = fetch_openml("mnist_784")
data = fetch_openml("mnist_784", as_frame=True, parser="pandas")
X = check_array(data["data"], dtype=dtype, order=order)
y = data["target"]

Expand Down
8 changes: 4 additions & 4 deletions benchmarks/bench_plot_randomized_svd.py
Original file line number Diff line number Diff line change
Expand Up @@ -191,7 +191,7 @@ def get_data(dataset_name):
del row
del col
else:
X = fetch_openml(dataset_name).data
X = fetch_openml(dataset_name, parser="auto").data
return X


Expand Down Expand Up @@ -281,9 +281,9 @@ def svd_timing(
U, mu, V = randomized_svd(
X,
n_comps,
n_oversamples,
n_iter,
power_iteration_normalizer,
n_oversamples=n_oversamples,
n_iter=n_iter,
power_iteration_normalizer=power_iteration_normalizer,
random_state=random_state,
transpose=False,
)
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/bench_tsne_mnist.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@
def load_data(dtype=np.float32, order="C", shuffle=True, seed=0):
"""Load the data, then cache and memmap the train/test split"""
print("Loading dataset...")
data = fetch_openml("mnist_784")
data = fetch_openml("mnist_784", as_frame=True, parser="pandas")

X = check_array(data["data"], dtype=dtype, order=order)
y = data["target"]
Expand Down
55 changes: 47 additions & 8 deletions doc/datasets/loading_other_datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ from the repository using the function
For example, to download a dataset of gene expressions in mice brains::

>>> from sklearn.datasets import fetch_openml
>>> mice = fetch_openml(name='miceprotein', version=4)
>>> mice = fetch_openml(name='miceprotein', version=4, parser="auto")

To fully specify a dataset, you need to provide a name and a version, though
the version is optional, see :ref:`openml_versions` below.
Expand Down Expand Up @@ -147,7 +147,7 @@ dataset on the openml website::

The ``data_id`` also uniquely identifies a dataset from OpenML::

>>> mice = fetch_openml(data_id=40966)
>>> mice = fetch_openml(data_id=40966, parser="auto")
>>> mice.details # doctest: +SKIP
{'id': '4550', 'name': 'MiceProtein', 'version': '1', 'format': 'ARFF',
'creator': ...,
Expand All @@ -171,8 +171,8 @@ which can contain entirely different datasets.
If a particular version of a dataset has been found to contain significant
issues, it might be deactivated. Using a name to specify a dataset will yield
the earliest version of a dataset that is still active. That means that
``fetch_openml(name="miceprotein")`` can yield different results at different
times if earlier versions become inactive.
``fetch_openml(name="miceprotein", parser="auto")`` can yield different results
at different times if earlier versions become inactive.
You can see that the dataset with ``data_id`` 40966 that we fetched above is
the first version of the "miceprotein" dataset::

Expand All @@ -182,19 +182,19 @@ the first version of the "miceprotein" dataset::
In fact, this dataset only has one version. The iris dataset on the other hand
has multiple versions::

>>> iris = fetch_openml(name="iris")
>>> iris = fetch_openml(name="iris", parser="auto")
>>> iris.details['version'] #doctest: +SKIP
'1'
>>> iris.details['id'] #doctest: +SKIP
'61'

>>> iris_61 = fetch_openml(data_id=61)
>>> iris_61 = fetch_openml(data_id=61, parser="auto")
>>> iris_61.details['version']
'1'
>>> iris_61.details['id']
'61'

>>> iris_969 = fetch_openml(data_id=969)
>>> iris_969 = fetch_openml(data_id=969, parser="auto")
>>> iris_969.details['version']
'3'
>>> iris_969.details['id']
Expand All @@ -212,7 +212,7 @@ binarized version of the data::
You can also specify both the name and the version, which also uniquely
identifies the dataset::

>>> iris_version_3 = fetch_openml(name="iris", version=3)
>>> iris_version_3 = fetch_openml(name="iris", version=3, parser="auto")
>>> iris_version_3.details['version']
'3'
>>> iris_version_3.details['id']
Expand All @@ -225,6 +225,45 @@ identifies the dataset::
machine learning" ACM SIGKDD Explorations Newsletter, 15(2), 49-60, 2014.
<1407.7722>`

.. _openml_parser:

ARFF parser
~~~~~~~~~~~

From version 1.2, scikit-learn provides a new keyword argument `parser` that
provides several options to parse the ARFF files provided by OpenML. The legacy
parser (i.e. `parser="liac-arff"`) is based on the project
`LIAC-ARFF <https://github.com/renatopp/liac-arff>`_. This parser is however
slow and consume more memory than required. A new parser based on pandas
(i.e. `parser="pandas"`) is both faster and more memory efficient.
However, this parser does not support sparse data.
Therefore, we recommend using `parser="auto"` which will use the best parser
available for the requested dataset.

The `"pandas"` and `"liac-arff"` parsers can lead to different data types in
the output. The notable differences are the following:

- The `"liac-arff"` parser always encodes categorical features as `str`
objects. To the contrary, the `"pandas"` parser instead infers the type while
reading and numerical categories will be casted into integers whenever
possible.
- The `"liac-arff"` parser uses float64 to encode numerical features tagged as
'REAL' and 'NUMERICAL' in the metadata. The `"pandas"` parser instead infers
if these numerical features corresponds to integers and uses panda's Integer
extension dtype.
- In particular, classification datasets with integer categories are typically
loaded as such `(0, 1, ...)` with the `"pandas"` parser while `"liac-arff"`
will force the use of string encoded class labels such as `"0"`, `"1"` and so
on.

In addition, when `as_frame=False` is used, the `"liac-arff"` parser returns
ordinally encoded data where the categories are provided in the attribute
`categories` of the `Bunch` instance. Instead, `"pandas"` returns a NumPy array
were the categories. Then it's up to the user to design a feature
engineering pipeline with an instance of `OneHotEncoder` or
`OrdinalEncoder` typically wrapped in a `ColumnTransformer` to
preprocess the categorical columns explicitly. See for instance: :ref:`sphx_glr_auto_examples_compose_plot_column_transformer_mixed_types.py`.

.. _external_datasets:

Loading from external datasets
Expand Down
4 changes: 2 additions & 2 deletions doc/whats_new/v1.1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ Changelog
classifier that always predicts the positive class: recall=100% and
precision=class balance.
:pr:`23214` by :user:`Stéphane Collot <stephanecollot>` and :user:`Max Baak <mbaak>`.

:mod:`sklearn.utils`
....................

Expand Down Expand Up @@ -208,7 +208,7 @@ Changelog
:pr:`23194` by `Thomas Fan`_.

- |Enhancement| Added an extension in doc/conf.py to automatically generate
the list of estimators that handle NaN values.
the list of estimators that handle NaN values.
:pr:`23198` by `Lise Kleiber <lisekleiber>`_, :user:`Zhehao Liu <MaxwellLZH>`
and :user:`Chiara Marmo <cmarmo>`.

Expand Down
14 changes: 14 additions & 0 deletions doc/whats_new/v1.2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,20 @@ Changelog
- |Enhancement| :class:`cluster.Birch` now preserves dtype for `numpy.float32`
inputs. :pr:`22968` by `Meekail Zain <micky774>`.

:mod:`sklearn.datasets`
.......................

- |Enhancement| Introduce the new parameter `parser` in
:func:`datasets.fetch_openml`. `parser="pandas"` allows to use the very CPU
and memory efficient `pandas.read_csv` parser to load dense ARFF
formatted dataset files. It is possible to pass `parser="liac-arff"`
to use the old LIAC parser.
When `parser="auto"`, dense datasets are loaded with "pandas" and sparse
datasets are loaded with "liac-arff".
Currently, `parser="liac-arff"` by default and will change to `parser="auto"`
in version 1.4
:pr:`21938` by :user:`Guillaume Lemaitre <glemaitre>`.

:mod:`sklearn.ensemble`
.......................

Expand Down
4 changes: 3 additions & 1 deletion examples/applications/plot_cyclical_feature_engineering.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,9 @@
# We start by loading the data from the OpenML repository.
from sklearn.datasets import fetch_openml

bike_sharing = fetch_openml("Bike_Sharing_Demand", version=2, as_frame=True)
bike_sharing = fetch_openml(
"Bike_Sharing_Demand", version=2, as_frame=True, parser="pandas"
)
df = bike_sharing.frame

# %%
Expand Down
2 changes: 1 addition & 1 deletion examples/applications/plot_digits_denoising.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

X, y = fetch_openml(data_id=41082, as_frame=False, return_X_y=True)
X, y = fetch_openml(data_id=41082, as_frame=False, return_X_y=True, parser="pandas")
X = MinMaxScaler().fit_transform(X)

# %%
Expand Down
4 changes: 3 additions & 1 deletion examples/compose/plot_column_transformer_mixed_types.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,9 @@

# %%
# Load data from https://www.openml.org/d/40945
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
X, y = fetch_openml(
"titanic", version=1, as_frame=True, return_X_y=True, parser="pandas"
)

# Alternatively X and y can be obtained directly from the frame attribute:
# X = titanic.frame.drop('survived', axis=1)
Expand Down
2 changes: 1 addition & 1 deletion examples/compose/plot_transformed_target.py
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,7 @@
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import QuantileTransformer, quantile_transform

ames = fetch_openml(name="house_prices", as_frame=True)
ames = fetch_openml(name="house_prices", as_frame=True, parser="pandas")
# Keep only numeric columns
X = ames.data.select_dtypes(np.number)
# Remove columns with NaN or Inf values
Expand Down
2 changes: 1 addition & 1 deletion examples/ensemble/plot_gradient_boosting_categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@
# are either categorical or numerical:
from sklearn.datasets import fetch_openml

X, y = fetch_openml(data_id=42165, as_frame=True, return_X_y=True)
X, y = fetch_openml(data_id=42165, as_frame=True, return_X_y=True, parser="pandas")

# Select only a subset of features of X to make the example faster to run
categorical_columns_subset = [
Expand Down
6 changes: 4 additions & 2 deletions examples/ensemble/plot_stack_predictors.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@


def load_ames_housing():
df = fetch_openml(name="house_prices", as_frame=True)
df = fetch_openml(name="house_prices", as_frame=True, parser="pandas")
X = df.data
y = df.target

Expand Down Expand Up @@ -117,7 +117,9 @@ def load_ames_housing():
from sklearn.preprocessing import OrdinalEncoder

cat_tree_processor = OrdinalEncoder(
handle_unknown="use_encoded_value", unknown_value=-1
handle_unknown="use_encoded_value",
unknown_value=-1,
encoded_missing_value=-2,
)
num_tree_processor = SimpleImputer(strategy="mean", add_indicator=True)

Expand Down
2 changes: 1 addition & 1 deletion examples/gaussian_process/plot_gpr_co2.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@
# in OpenML.
from sklearn.datasets import fetch_openml

co2 = fetch_openml(data_id=41187, as_frame=True)
co2 = fetch_openml(data_id=41187, as_frame=True, parser="pandas")
co2.frame.head()

# %%
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@

from sklearn.datasets import fetch_openml

survey = fetch_openml(data_id=534, as_frame=True)
survey = fetch_openml(data_id=534, as_frame=True, parser="pandas")

# %%
# Then, we identify features `X` and targets `y`: the column WAGE is our
Expand Down
4 changes: 3 additions & 1 deletion examples/inspection/plot_permutation_importance.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,9 @@
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
X, y = fetch_openml(
"titanic", version=1, as_frame=True, return_X_y=True, parser="pandas"
)
rng = np.random.RandomState(seed=42)
X["random_cat"] = rng.randint(3, size=X.shape[0])
X["random_num"] = rng.randn(X.shape[0])
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@
from sklearn.datasets import fetch_openml


df = fetch_openml(data_id=41214, as_frame=True).frame
df = fetch_openml(data_id=41214, as_frame=True, parser="pandas").frame
df

# %%
Expand Down
2 changes: 1 addition & 1 deletion examples/linear_model/plot_sgd_early_stopping.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@
def load_mnist(n_samples=None, class_0="0", class_1="8"):
"""Load MNIST, select two classes, shuffle and return only n_samples."""
# Load data from http://openml.org/d/554
mnist = fetch_openml("mnist_784", version=1, as_frame=False)
mnist = fetch_openml("mnist_784", version=1, as_frame=False, parser="pandas")

# take only two classes for binary classification
mask = np.logical_or(mnist.target == class_0, mnist.target == class_1)
Expand Down
Loading