From 271f33935830a87dfcf9d724de53ef6ac15c8430 Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Fri, 30 Jun 2017 12:38:33 +1000 Subject: [PATCH 01/19] DOC cleaning up what's new for 0.19 --- doc/whats_new.rst | 311 ++++++++++++++++++++++++++++------------------ 1 file changed, 189 insertions(+), 122 deletions(-) diff --git a/doc/whats_new.rst b/doc/whats_new.rst index d367c627c27c4..ec8709a3c8e66 100644 --- a/doc/whats_new.rst +++ b/doc/whats_new.rst @@ -10,6 +10,28 @@ Version 0.19 **In Development** +Highlights +---------- + +TODO: + +This release includes a number of great new features including Local Outlier Factor for anomaly detection, QuantileTransformer for robust feature transformation, and ClassifierChain to simply account for dependencies between classes in multilabel problems. + +Pipeline caching makes grid search over pipelines including slow transformations much more efficient. + +Multinomial logistic regression with L1 loss. + +?Rewrite of TSNE + +Multi-metric grid search and cross validation + +Major deprecations +------------------ + +TODO + +We have deprecated RandomizedLasso and RandomizedLogisticRegression and LSHForest because they weren't appropriate or up to standards. We have deprecated a number of utilities no longer necessary now that we require Scipy 0.13.3 and Numpy 1.8.2 at a minimum. + Changed models -------------- @@ -19,6 +41,7 @@ occurs due to changes in the modelling logic (bug fixes or enhancements), or in random sampling procedures. * :class:`sklearn.ensemble.IsolationForest` (bug fix) + * TODO Details are listed in the changelog below. @@ -31,32 +54,17 @@ Changelog New features ............ - - Added :class:`multioutput.ClassifierChain` for multi-label - classification. By `Adam Kleczewski `_. +Configuration - Validation that input data contains no NaN or inf can now be suppressed using :func:`config_context`, at your own risk. This will save on runtime, and may be particularly useful for prediction time. :issue:`7548` by `Joel Nothman`_. - - Added the :class:`neighbors.LocalOutlierFactor` class for anomaly - detection based on nearest neighbors. - :issue:`5279` by `Nicolas Goix`_ and `Alexandre Gramfort`_. - - - The new solver ``'mu'`` implements a Multiplicate Update in - :class:`decomposition.NMF`, allowing the optimization of all - beta-divergences, including the Frobenius norm, the generalized - Kullback-Leibler divergence and the Itakura-Saito divergence. - :issue:`5295` by `Tom Dupre la Tour`_. - - - Added the :class:`model_selection.RepeatedKFold` and - :class:`model_selection.RepeatedStratifiedKFold`. - :issue:`8120` by `Neeraj Gangwar`_. +Classifiers and regressors - - Added :func:`metrics.mean_squared_log_error`, which computes - the mean square error of the logarithmic transformation of targets, - particularly useful for targets with an exponential trend. - :issue:`7655` by :user:`Karan Desai `. + - Added :class:`multioutput.ClassifierChain` for multi-label + classification. By `Adam Kleczewski `_. - Added solver ``'saga'`` that implements the improved version of Stochastic Average Gradient, in :class:`linear_model.LogisticRegression` and @@ -65,6 +73,12 @@ New features during the first epochs of ridge and logistic regression. :issue:`8446` by `Arthur Mensch`_. +Other estimators + + - Added the :class:`neighbors.LocalOutlierFactor` class for anomaly + detection based on nearest neighbors. + :issue:`5279` by `Nicolas Goix`_ and `Alexandre Gramfort`_. + - Added :class:`preprocessing.QuantileTransformer` class and :func:`preprocessing.quantile_transform` function for features normalization based on quantiles. @@ -72,47 +86,33 @@ New features :user:`Guillaume Lemaitre `, `Olivier Grisel`_, `Raghav RV`_, :user:`Thierry Guillemot `, and `Gael Varoquaux`_. + - The new solver ``'mu'`` implements a Multiplicate Update in + :class:`decomposition.NMF`, allowing the optimization of all + beta-divergences, including the Frobenius norm, the generalized + Kullback-Leibler divergence and the Itakura-Saito divergence. + :issue:`5295` by `Tom Dupre la Tour`_. + +Model selection and evaluation + + - Added :func:`metrics.mean_squared_log_error`, which computes + the mean square error of the logarithmic transformation of targets, + particularly useful for targets with an exponential trend. + :issue:`7655` by :user:`Karan Desai `. + - Added :func:`metrics.dcg_score` and :func:`metrics.ndcg_score`, which compute Discounted cumulative gain (DCG) and Normalized discounted cumulative gain (NDCG). :issue:`7739` by :user:`David Gasquez `. -Enhancements -............ - - - :func:`metrics.matthews_corrcoef` now support multiclass classification. - :issue:`8094` by :user:`Jon Crall `. - - Update Sphinx-Gallery from 0.1.4 to 0.1.7 for resolving links in - documentation build with Sphinx>1.5 :issue:`8010`, :issue:`7986` by - :user:`Oscar Najera ` - - :class:`multioutput.MultiOutputRegressor` and :class:`multioutput.MultiOutputClassifier` - now support online learning using `partial_fit`. - issue: `8053` by :user:`Peng Yu `. - - :class:`pipeline.Pipeline` allows to cache transformers - within a pipeline by using the ``memory`` constructor parameter. - :issue:`7990` by :user:`Guillaume Lemaitre `. - - - :class:`decomposition.PCA`, :class:`decomposition.IncrementalPCA` and - :class:`decomposition.TruncatedSVD` now expose the singular values - from the underlying SVD. They are stored in the attribute - ``singular_values_``, like in :class:`decomposition.IncrementalPCA`. + - Added the :class:`model_selection.RepeatedKFold` and + :class:`model_selection.RepeatedStratifiedKFold`. + :issue:`8120` by `Neeraj Gangwar`_. - - :class:`cluster.MiniBatchKMeans` and :class:`cluster.KMeans` - now uses significantly less memory when assigning data points to their - nearest cluster center. :issue:`7721` by :user:`Jon Crall `. - - Added ``classes_`` attribute to :class:`model_selection.GridSearchCV`, - :class:`model_selection.RandomizedSearchCV`, :class:`grid_search.GridSearchCV`, - and :class:`grid_search.RandomizedSearchCV` that matches the ``classes_`` - attribute of ``best_estimator_``. :issue:`7661` and :issue:`8295` - by :user:`Alyssa Batula `, :user:`Dylan Werner-Meier `, - and :user:`Stephen Hoover `. +Enhancements +............ - - Relax assumption on the data for the - :class:`kernel_approximation.SkewedChi2Sampler`. Since the Skewed-Chi2 - kernel is defined on the open interval :math:`(-skewedness; +\infty)^d`, - the transform function should not check whether ``X < 0`` but whether ``X < - -self.skewedness``. :issue:`7573` by :user:`Romain Brault `. +Trees and ensembles - The ``min_weight_fraction_leaf`` constraint in tree construction is now more efficient, taking a fast path to declare a node a leaf if its weight @@ -120,33 +120,16 @@ Enhancements different from previous versions where ``min_weight_fraction_leaf`` is used. :issue:`7441` by :user:`Nelson Liu `. - - Added ``average`` parameter to perform weights averaging in - :class:`linear_model.PassiveAggressiveClassifier`. :issue:`4939` - by :user:`Andrea Esuli `. - - - Custom metrics for the :mod:`sklearn.neighbors` binary trees now have - fewer constraints: they must take two 1d-arrays and return a float. - :issue:`6288` by `Jake Vanderplas`_. - - :class:`ensemble.GradientBoostingClassifier` and :class:`ensemble.GradientBoostingRegressor` now support sparse input for prediction. :issue:`6101` by :user:`Ibraim Ganiev `. - - Added ``shuffle`` and ``random_state`` parameters to shuffle training - data before taking prefixes of it based on training sizes in - :func:`model_selection.learning_curve`. - :issue:`7506` by :user:`Narine Kokhlikyan `. - - - Added ``norm_order`` parameter to :class:`feature_selection.SelectFromModel` - to enable selection of the norm order when ``coef_`` is more than 1D. - :issue:`6181` by :user:`Antoine Wendlinger `. - - - Added ``sample_weight`` parameter to :meth:`pipeline.Pipeline.score`. - :issue:`7723` by :user:`Mikhail Korobov `. + - :class:`ensemble.VotingClassifier` now allow changing estimators by using + :meth:`ensemble.VotingClassifier.set_params`. Estimators can also be + removed by setting it to `None`. + :issue:`7674` by :user:`Yichuan Liu `. - - ``check_estimator`` now attempts to ensure that methods transform, predict, etc. - do not set attributes on the estimator. - :issue:`7533` by :user:`Ekaterina Krivich `. +Linear, kernelized and related models - :class:`linear_model.SGDClassifier`, :class:`linear_model.SGDRegressor`, :class:`linear_model.PassiveAggressiveClassifier`, @@ -157,10 +140,9 @@ Enhancements a ``n_iter_`` attribute, with actual number of iterations before convergence. By `Tom Dupre la Tour`_. - - For sparse matrices, :func:`preprocessing.normalize` with ``return_norm=True`` - will now raise a ``NotImplementedError`` with 'l1' or 'l2' norm and with - norm 'max' the norms returned will be the same as for dense matrices. - :issue:`7771` by `Ang Lu `_. + - Added ``average`` parameter to perform weight averaging in + :class:`linear_model.PassiveAggressiveClassifier`. :issue:`4939` + by :user:`Andrea Esuli `. - :class:`linear_model.RANSACRegressor` no longer throws an error when calling ``fit`` if no inliers are found in its first iteration. @@ -168,73 +150,118 @@ Enhancements attributes, ``n_skips_*``. :issue:`7914` by :user:`Michael Horrell `. - - :func:`model_selection.cross_val_predict` now returns output of the - correct shape for all values of the argument ``method``. - :issue:`7863` by :user:`Aman Dalmia `. - - - Fix a bug where :class:`feature_selection.SelectFdr` did not - exactly implement Benjamini-Hochberg procedure. It formerly may have - selected fewer features than it should. - :issue:`7490` by :user:`Peng Meng `. - - - Added ability to set ``n_jobs`` parameter to :func:`pipeline.make_union`. - A ``TypeError`` will be raised for any other kwargs. :issue:`8028` - by :user:`Alexander Booth `. - - - Added type checking to the ``accept_sparse`` parameter in - :mod:`sklearn.utils.validation` methods. This parameter now accepts only - boolean, string, or list/tuple of strings. ``accept_sparse=None`` is deprecated - and should be replaced by ``accept_sparse=False``. - :issue:`7880` by :user:`Josh Karnofsky `. - - - :class:`model_selection.GridSearchCV`, :class:`model_selection.RandomizedSearchCV` - and :func:`model_selection.cross_val_score` now allow estimators with callable - kernels which were previously prohibited. :issue:`8005` by `Andreas Müller`_ . - - - Added ability to use sparse matrices in :func:`feature_selection.f_regression` - with ``center=True``. :issue:`8065` by :user:`Daniel LeJeune `. + - Relax assumption on the data for the + :class:`kernel_approximation.SkewedChi2Sampler`. Since the Skewed-Chi2 + kernel is defined on the open interval :math:`(-skewedness; +\infty)^d`, + the transform function should not check whether ``X < 0`` but whether ``X < + -self.skewedness``. :issue:`7573` by :user:`Romain Brault `. - - Add ``sample_weight`` parameter to :func:`metrics.cohen_kappa_score`. - :issue:`8335` by :user:`Victor Poughon `. + - Custom metrics for the :mod:`neighbors` binary trees now have + fewer constraints: they must take two 1d-arrays and return a float. + :issue:`6288` by `Jake Vanderplas`_. - In :class:`gaussian_process.GaussianProcessRegressor`, method ``predict`` is a lot faster with ``return_std=True``. :issue:`8591` by :user:`Hadrien Bertrand `. - - Added ability to use sparse matrices in :func:`feature_selection.f_regression` - with ``center=True``. :issue:`8065` by :user:`Daniel LeJeune `. - - - :class:`ensemble.VotingClassifier` now allow changing estimators by using - :meth:`ensemble.VotingClassifier.set_params`. Estimators can also be - removed by setting it to `None`. - :issue:`7674` by :user:`Yichuan Liu `. - - - Prevent cast from float32 to float64 in + - Memory usage enhancement: Prevent cast from float32 to float64 in :class:`linear_model.LogisticRegression` when using newton-cg solver. :issue:`8835` by :user:`Joan Massich `. - - Prevent cast from float32 to float64 in + - Memory usage enhancement: Prevent cast from float32 to float64 in :class:`linear_model.Ridge` when using svd, sparse_cg, cholesky or lsqr solvers :class:`sklearn.linear_model.Ridge` when using svd, sparse_cg, cholesky or lsqr solvers by :user:`Joan Massich `, :user:`Nicolas Cordier ` - - Add ``max_train_size`` parameter to :class:`model_selection.TimeSeriesSplit` - :issue:`8282` by :user:`Aman Dalmia `. +Decomposition, manifold learning and clustering - - Make it possible to load a chunk of an svmlight formatted file by - passing a range of bytes to :func:`datasets.load_svmlight_file`. - :issue:`935` by :user:`Olivier Grisel `. + - :class:`cluster.MiniBatchKMeans` and :class:`cluster.KMeans` + now use significantly less memory when assigning data points to their + nearest cluster center. :issue:`7721` by :user:`Jon Crall `. + + - :class:`decomposition.PCA`, :class:`decomposition.IncrementalPCA` and + :class:`decomposition.TruncatedSVD` now expose the singular values + from the underlying SVD. They are stored in the attribute + ``singular_values_``, like in :class:`decomposition.IncrementalPCA`. + +Preprocessing and feature selection + + - Added ``norm_order`` parameter to :class:`feature_selection.SelectFromModel` + to enable selection of the norm order when ``coef_`` is more than 1D. + :issue:`6181` by :user:`Antoine Wendlinger `. + + - Added ability to use sparse matrices in :func:`feature_selection.f_regression` + with ``center=True``. :issue:`8065` by :user:`Daniel LeJeune `. - Small performance improvement to n-gram creation in :mod:`feature_extraction.text` by binding methods for loops and special-casing unigrams. :issue:`7567` by `Jaye Doepke ` +Model evaluation and meta-estimators + + - :class:`pipeline.Pipeline` allows to cache transformers + within a pipeline by using the ``memory`` constructor parameter. + :issue:`7990` by :user:`Guillaume Lemaitre `. + + - Added ``sample_weight`` parameter to :meth:`pipeline.Pipeline.score`. + :issue:`7723` by :user:`Mikhail Korobov `. + + - Added ability to set ``n_jobs`` parameter to :func:`pipeline.make_union`. + A ``TypeError`` will be raised for any other kwargs. :issue:`8028` + by :user:`Alexander Booth `. + + - :class:`model_selection.GridSearchCV`, :class:`model_selection.RandomizedSearchCV` + and :func:`model_selection.cross_val_score` now allow estimators with callable + kernels which were previously prohibited. :issue:`8005` by `Andreas Müller`_ . + + - :func:`model_selection.cross_val_predict` now returns output of the + correct shape for all values of the argument ``method``. + :issue:`7863` by :user:`Aman Dalmia `. + + - Added ``shuffle`` and ``random_state`` parameters to shuffle training + data before taking prefixes of it based on training sizes in + :func:`model_selection.learning_curve`. + :issue:`7506` by :user:`Narine Kokhlikyan `. + - Speed improvements to :class:`model_selection.StratifiedShuffleSplit`. :issue:`5991` by :user:`Arthur Mensch ` and `Joel Nothman`_. + - :class:`multioutput.MultiOutputRegressor` and :class:`multioutput.MultiOutputClassifier` + now support online learning using `partial_fit`. + issue: `8053` by :user:`Peng Yu `. + + - Add ``max_train_size`` parameter to :class:`model_selection.TimeSeriesSplit` + :issue:`8282` by :user:`Aman Dalmia `. + +Metrics + + - :func:`metrics.matthews_corrcoef` now support multiclass classification. + :issue:`8094` by :user:`Jon Crall `. + + - Add ``sample_weight`` parameter to :func:`metrics.cohen_kappa_score`. + :issue:`8335` by :user:`Victor Poughon `. + +Miscellaneous + + - :func:`utils.check_estimator` now attempts to ensure that methods transform, predict, etc. + do not set attributes on the estimator. + :issue:`7533` by :user:`Ekaterina Krivich `. + + - Added type checking to the ``accept_sparse`` parameter in + :mod:`sklearn.utils.validation` methods. This parameter now accepts only + boolean, string, or list/tuple of strings. ``accept_sparse=None`` is deprecated + and should be replaced by ``accept_sparse=False``. + :issue:`7880` by :user:`Josh Karnofsky `. + + - Make it possible to load a chunk of an svmlight formatted file by + passing a range of bytes to :func:`datasets.load_svmlight_file`. + :issue:`935` by :user:`Olivier Grisel `. + Bug fixes ......... +TODO + - :func:`metrics.average_precision_score` no longer linearly interpolates between operating points, and instead weighs precisions by the change in recall since the last operating point, as per the @@ -443,6 +470,38 @@ Bug fixes :class:`decomposition.IncrementalPCA`. :issue:`9105` by `Hanmin Qin `_. +Trees and ensembles +Linear, kernelized and related models +Decomposition, manifold learning and clustering +Preprocessing and feature selection + + - For sparse matrices, :func:`preprocessing.normalize` with ``return_norm=True`` + will now raise a ``NotImplementedError`` with 'l1' or 'l2' norm and with + norm 'max' the norms returned will be the same as for dense matrices. + :issue:`7771` by `Ang Lu `_. + + - Fix a bug where :class:`feature_selection.SelectFdr` did not + exactly implement Benjamini-Hochberg procedure. It formerly may have + selected fewer features than it should. + :issue:`7490` by :user:`Peng Meng `. + + +Model evaluation and meta-estimators +Metrics +Miscellaneous + + - Added ``classes_`` attribute to :class:`model_selection.GridSearchCV`, + :class:`model_selection.RandomizedSearchCV`, :class:`grid_search.GridSearchCV`, + and :class:`grid_search.RandomizedSearchCV` that matches the ``classes_`` + attribute of ``best_estimator_``. :issue:`7661` and :issue:`8295` + by :user:`Alyssa Batula `, :user:`Dylan Werner-Meier `, + and :user:`Stephen Hoover `. + + - Update Sphinx-Gallery from 0.1.4 to 0.1.7 for resolving links in + documentation build with Sphinx>1.5 :issue:`8010`, :issue:`7986` by + :user:`Oscar Najera ` + + API changes summary ------------------- @@ -526,11 +585,11 @@ API changes summary - The ``n_topics`` parameter of :class:`decomposition.LatentDirichletAllocation` has been renamed to ``n_components`` and will be removed in version 0.21. - :issue:`8922` by :user:`Attractadore` + :issue:`8922` by :user:`Attractadore`. - :class:`cluster.bicluster.SpectralCoclustering` and :class:`cluster.bicluster.SpectralBiclustering` now accept ``y`` in fit. - :issue:`6126` by :user:ldirer + :issue:`6126` by :user:`Laurent Direr `. - :class:`neighbors.LSHForest` has been deprecated and will be removed in 0.21 due to poor performance. @@ -574,6 +633,14 @@ API changes summary :issue:`8174` by :user:`Tahar Zanouda `, `Alexandre Gramfort`_ and `Raghav RV`_. +Trees and ensembles +Linear, kernelized and related models +Decomposition, manifold learning and clustering +Preprocessing and feature selection +Model evaluation and meta-estimators +Metrics +Miscellaneous + .. _changes_0_18_1: From b504e9e4f6e22eee1d08bfbb87c21a6fa9c88154 Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Fri, 30 Jun 2017 14:59:19 +1000 Subject: [PATCH 02/19] More cleaning up --- doc/whats_new.rst | 315 ++++++++++++++++++++++++---------------------- 1 file changed, 162 insertions(+), 153 deletions(-) diff --git a/doc/whats_new.rst b/doc/whats_new.rst index ec8709a3c8e66..6d6d175154fce 100644 --- a/doc/whats_new.rst +++ b/doc/whats_new.rst @@ -23,6 +23,9 @@ Multinomial logistic regression with L1 loss. ?Rewrite of TSNE +Fix longstanding implementation erorr in average_precision_score + + Multi-metric grid search and cross validation Major deprecations @@ -226,6 +229,9 @@ Model evaluation and meta-estimators - Speed improvements to :class:`model_selection.StratifiedShuffleSplit`. :issue:`5991` by :user:`Arthur Mensch ` and `Joel Nothman`_. + - Add ``shuffle`` parameter to :func:`model_selection.train_test_split`. + :issue:`8845` by :user:`themrmax ` + - :class:`multioutput.MultiOutputRegressor` and :class:`multioutput.MultiOutputClassifier` now support online learning using `partial_fit`. issue: `8053` by :user:`Peng Yu `. @@ -260,110 +266,32 @@ Miscellaneous Bug fixes ......... -TODO - - - :func:`metrics.average_precision_score` no longer linearly - interpolates between operating points, and instead weighs precisions - by the change in recall since the last operating point, as per the - `Wikipedia entry `_. - (`#7356 `_). By - :user:`Nick Dingwall ` and `Gael Varoquaux`_. - - - Fixed a bug in :class:`covariance.MinCovDet` where inputting data - that produced a singular covariance matrix would cause the helper method - ``_c_step`` to throw an exception. - :issue:`3367` by :user:`Jeremy Steward ` +Trees and ensembles - Fixed a bug where :class:`ensemble.IsolationForest` uses an an incorrect formula for the average path length :issue:`8549` by `Peter Wang `_. - - Fixed a bug where :class:`cluster.DBSCAN` gives incorrect - result when input is a precomputed sparse matrix with initial - rows all zero. :issue:`8306` by :user:`Akshay Gupta ` - - Fixed a bug where :class:`ensemble.AdaBoostClassifier` throws ``ZeroDivisionError`` while fitting data with single class labels. :issue:`7501` by :user:`Dominik Krzeminski `. - - Fixed a bug when :func:`datasets.make_classification` fails - when generating more than 30 features. :issue:`8159` by - :user:`Herilalaina Rakotoarison `. - - - Fixed a bug where :func:`model_selection.BaseSearchCV.inverse_transform` - returns ``self.best_estimator_.transform()`` instead of - ``self.best_estimator_.inverse_transform()``. - :issue:`8344` by :user:`Akshay Gupta `. - - - Fixed same issue in :func:`grid_search.BaseSearchCV.inverse_transform` - :issue:`8846` by :user:`Rasmus Eriksson ` - - - Fixed a bug where :class:`linear_model.RandomizedLasso` and - :class:`linear_model.RandomizedLogisticRegression` breaks for - sparse input. :issue:`8259` by :user:`Aman Dalmia `. - - - Fixed a bug where :func:`linear_model.RANSACRegressor.fit` may run until - ``max_iter`` if finds a large inlier group early. :issue:`8251` by :user:`aivision2020`. - - - Fixed a bug where :class:`sklearn.naive_bayes.MultinomialNB` and :class:`sklearn.naive_bayes.BernoulliNB` - failed when `alpha=0`. :issue:`5814` by :user:`Yichuan Liu ` and - :user:`Herilalaina Rakotoarison `. - - - Fixed a bug where :func:`datasets.make_moons` gives an - incorrect result when ``n_samples`` is odd. - :issue:`8198` by :user:`Josh Levy `. - - - Fixed a bug where :class:`linear_model.LassoLars` does not give - the same result as the LassoLars implementation available - in R (lars library). :issue:`7849` by :user:`Jair Montoya Martinez `. - - - Some ``fetch_`` functions in :mod:`sklearn.datasets` were ignoring the - ``download_if_missing`` keyword. :issue:`7944` by :user:`Ralf Gommers `. - - Fixed a bug in :class:`ensemble.GradientBoostingClassifier` and :class:`ensemble.GradientBoostingRegressor` where a float being compared to ``0.0`` using ``==`` caused a divide by zero error. issue:`7970` by :user:`He Chen `. - - Fix a bug regarding fitting :class:`cluster.KMeans` with a sparse - array X and initial centroids, where X's means were unnecessarily being - subtracted from the centroids. :issue:`7872` by :user:`Josh Karnofsky `. - - - Fix estimators to accept a ``sample_weight`` parameter of type - ``pandas.Series`` in their ``fit`` function. :issue:`7825` by - `Kathleen Chen`_. - - - Fixed a bug where :class:`ensemble.IsolationForest` fails when - ``max_features`` is less than 1. - :issue:`5732` by :user:`Ishank Gulati `. - - - Fix a bug where :class:`ensemble.VotingClassifier` raises an error - when a numpy array is passed in for weights. :issue:`7983` by - :user:`Vincent Pham `. - - - Fix a bug in :class:`decomposition.LatentDirichletAllocation` - where the ``perplexity`` method was returning incorrect results because - the ``transform`` method returns normalized document topic distributions - as of version 0.18. :issue:`7954` by :user:`Gary Foreman `. - - Fix a bug where :class:`ensemble.GradientBoostingClassifier` and :class:`ensemble.GradientBoostingRegressor` ignored the ``min_impurity_split`` parameter. :issue:`8006` by :user:`Sebastian Pölsterl `. - - Fixes to the input validation in :class:`covariance.EllipticEnvelope`. - :issue:`8086` by `Andreas Müller`_. - - - Fix output shape and bugs with n_jobs > 1 in - :class:`decomposition.SparseCoder` transform and - :func:`decomposition.sparse_encode` - for one-dimensional data and one component. - This also impacts the output shape of :class:`decomposition.DictionaryLearning`. - :issue:`8086` by `Andreas Müller`_. + - Fixed oob_score in :class:`ensemble.BaggingClassifier`. + :issue:`8936` by :user:`mlewis1729 ` - - Several fixes to input validation in - :class:`multiclass.OutputCodeClassifier` - :issue:`8086` by `Andreas Müller`_. + - Fixed a bug where :class:`ensemble.IsolationForest` fails when + ``max_features`` is less than 1. + :issue:`5732` by :user:`Ishank Gulati `. - Fix a bug where :class:`ensemble.gradient_boosting.QuantileLossFunction` computed @@ -371,108 +299,107 @@ TODO wrong values when calling ``__call__``. :issue:`8087` by :user:`Alexis Mignon ` - - Fix :func:`multioutput.MultiOutputClassifier.predict_proba` to - return a list of 2d arrays, rather than a 3d array. In the case where - different target columns had different numbers of classes, a `ValueError` - would be raised on trying to stack matrices with different dimensions. - :issue:`8093` by :user:`Peter Bull `. + - Fix a bug where :class:`ensemble.VotingClassifier` raises an error + when a numpy array is passed in for weights. :issue:`7983` by + :user:`Vincent Pham `. - - Fix a bug where :func:`linear_model.LassoLars.fit` sometimes - left `coef_` as a list, rather than an ndarray. - :issue:`8160` by :user:`CJ Carey `. + - Fixed a bug where :func:`tree.export_graphviz` raised an error + when the length of features_names does not match n_features in the decision + tree. :issue:`8512` by :user:`Li Li `. - - Fix a bug where :class:`feature_extraction.FeatureHasher` - mandatorily applied a sparse random projection to the hashed features, - preventing the use of - :class:`feature_extraction.text.HashingVectorizer` in a - pipeline with :class:`feature_extraction.text.TfidfTransformer`. - :issue:`7513` by :user:`Roman Yurchak `. +Linear, kernelized and related models - - Fix a bug in cases where ``numpy.cumsum`` may be numerically unstable, - raising an exception if instability is identified. :issue:`7376` and - :issue:`7331` by `Joel Nothman`_ and :user:`yangarbiter`. + - Fixed a bug where :func:`linear_model.RANSACRegressor.fit` may run until + ``max_iter`` if it finds a large inlier group early. :issue:`8251` by :user:`aivision2020`. - - Fix a bug where :meth:`base.BaseEstimator.__getstate__` - obstructed pickling customizations of child-classes, when used in a - multiple inheritance context. - :issue:`8316` by :user:`Holger Peters `. + - Fixed a bug where :class:`sklearn.naive_bayes.MultinomialNB` and :class:`sklearn.naive_bayes.BernoulliNB` + failed when `alpha=0`. :issue:`5814` by :user:`Yichuan Liu ` and + :user:`Herilalaina Rakotoarison `. - - Fix a bug in :func:`metrics.classification._check_targets` - which would return ``'binary'`` if ``y_true`` and ``y_pred`` were - both ``'binary'`` but the union of ``y_true`` and ``y_pred`` was - ``'multiclass'``. :issue:`8377` by `Loic Esteve`_. + - Fixed a bug where :class:`linear_model.LassoLars` does not give + the same result as the LassoLars implementation available + in R (lars library). :issue:`7849` by :user:`Jair Montoya Martinez `. + + - Fixed a bug in :class:`linear_model.RandomizedLasso`, + :class:`linear_model.Lars`, :class:`linear_model.LassoLars`, + :class:`linear_model.LarsCV` and :class:`linear_model.LassoLarsCV`, + where the parameter ``precompute`` were not used consistently across + classes, and some values proposed in the docstring could raise errors. + :issue:`5359` by `Tom Dupre la Tour`_. + - Fix a bug where :func:`linear_model.LassoLars.fit` sometimes + left `coef_` as a list, rather than an ndarray. + :issue:`8160` by :user:`CJ Carey `. - Fix :func:`linear_model.BayesianRidge.fit` to return ridge parameter `alpha_` and `lambda_` consistent with calculated coefficients `coef_` and `intercept_`. :issue:`8224` by :user:`Peter Gedeck `. - - Fixed a bug in :class:`manifold.TSNE` where it stored the incorrect - ``kl_divergence_``. :issue:`6507` by :user:`Sebastian Saeger `. - - Fixed a bug in :class:`svm.OneClassSVM` where it returned floats instead of integer classes. :issue:`8676` by :user:`Vathsala Achar `. - - Fixed a bug where :func:`tree.export_graphviz` raised an error - when the length of features_names does not match n_features in the decision - tree. :issue:`8512` by :user:`Li Li `. - - - Fixed a bug in :class:`manifold.TSNE` affecting convergence of the - gradient descent. :issue:`8768` by :user:`David DeTomaso `. + - Fix AIC/BIC criterion computation in :class:`linear_model.LassoLarsIC`. + :issue:`9022` by `Alexandre Gramfort`_ and :user:`Mehmet Basbug `. - Fixed a memory leak in our LibLinear implementation. :issue:`9024` by :user:`Sergei Lebedev ` - - Fixed improper scaling in :class:`cross_decomposition.PLSRegression` - with ``scale=True``. :issue:`7819` by :user:`jayzed82 `. - - - Fixed oob_score in :class:`ensemble.BaggingClassifier`. - :issue:`8936` by :user:`mlewis1729 ` - - - Add ``shuffle`` parameter to :func:`model_selection.train_test_split`. - :issue:`8845` by :user:`themrmax ` - - - Fix AIC/BIC criterion computation in :class:`linear_model.LassoLarsIC`. - :issue:`9022` by `Alexandre Gramfort`_ and :user:`Mehmet Basbug `. - Fix bug where stratified CV splitters did not work with :class:`linear_model.LassoCV`. :issue:`8973` by :user:`Paulo Haddad `. - - Fixed a bug in :class:`linear_model.RandomizedLasso`, - :class:`linear_model.Lars`, :class:`linear_model.LassoLars`, - :class:`linear_model.LarsCV` and :class:`linear_model.LassoLarsCV`, - where the parameter ``precompute`` were not used consistently across - classes, and some values proposed in the docstring could raise errors. - :issue:`5359` by `Tom Dupre la Tour`_. - - - Fixed a bug where :func:`model_selection.validation_curve` - reused the same estimator for each parameter value. - :issue:`7365` by :user:`Aleksandr Sandrovskii `. - - - :class:`multiclass.OneVsOneClassifier`'s ``partial_fit`` now ensures all - classes are provided up-front. :issue:`6250` by - :user:`Asish Panda `. - - - Fixed an integer overflow bug in :func:`metrics.confusion_matrix` and - hence :func:`metrics.cohen_kappa_score`. :issue:`8354`, :issue:`7929` - by `Joel Nothman`_ and :user:`Jon Crall `. - - Fixed a bug in :class:`gaussian_process.GaussianProcessRegressor` when the standard deviation and covariance predicted without fit would fail with a unmeaningful error by default. :issue:`6573` by :user:`Quazi Marufur Rahman ` and `Manoj Kumar`_. +Decomposition, manifold learning and clustering + + - Fix a bug in :class:`decomposition.LatentDirichletAllocation` + where the ``perplexity`` method was returning incorrect results because + the ``transform`` method returns normalized document topic distributions + as of version 0.18. :issue:`7954` by :user:`Gary Foreman `. + + - Fix output shape and bugs with n_jobs > 1 in + :class:`decomposition.SparseCoder` transform and + :func:`decomposition.sparse_encode` + for one-dimensional data and one component. + This also impacts the output shape of :class:`decomposition.DictionaryLearning`. + :issue:`8086` by `Andreas Müller`_. + - Fixed the implementation of `explained_variance_` in :class:`decomposition.PCA`, :class:`decomposition.RandomizedPCA` and :class:`decomposition.IncrementalPCA`. :issue:`9105` by `Hanmin Qin `_. -Trees and ensembles -Linear, kernelized and related models -Decomposition, manifold learning and clustering + - Fixed a bug where :class:`cluster.DBSCAN` gives incorrect + result when input is a precomputed sparse matrix with initial + rows all zero. :issue:`8306` by :user:`Akshay Gupta ` + + - Fix a bug regarding fitting :class:`cluster.KMeans` with a sparse + array X and initial centroids, where X's means were unnecessarily being + subtracted from the centroids. :issue:`7872` by :user:`Josh Karnofsky `. + + - Fixes to the input validation in :class:`covariance.EllipticEnvelope`. + :issue:`8086` by `Andreas Müller`_. + + - Fixed a bug in :class:`covariance.MinCovDet` where inputting data + that produced a singular covariance matrix would cause the helper method + ``_c_step`` to throw an exception. + :issue:`3367` by :user:`Jeremy Steward ` + + - Fixed a bug in :class:`manifold.TSNE` affecting convergence of the + gradient descent. :issue:`8768` by :user:`David DeTomaso `. + + - Fixed a bug in :class:`manifold.TSNE` where it stored the incorrect + ``kl_divergence_``. :issue:`6507` by :user:`Sebastian Saeger `. + + - Fixed improper scaling in :class:`cross_decomposition.PLSRegression` + with ``scale=True``. :issue:`7819` by :user:`jayzed82 `. + Preprocessing and feature selection - For sparse matrices, :func:`preprocessing.normalize` with ``return_norm=True`` @@ -485,10 +412,24 @@ Preprocessing and feature selection selected fewer features than it should. :issue:`7490` by :user:`Peng Meng `. + - Fixed a bug where :class:`linear_model.RandomizedLasso` and + :class:`linear_model.RandomizedLogisticRegression` breaks for + sparse input. :issue:`8259` by :user:`Aman Dalmia `. + + - Fix a bug where :class:`feature_extraction.FeatureHasher` + mandatorily applied a sparse random projection to the hashed features, + preventing the use of + :class:`feature_extraction.text.HashingVectorizer` in a + pipeline with :class:`feature_extraction.text.TfidfTransformer`. + :issue:`7513` by :user:`Roman Yurchak `. + Model evaluation and meta-estimators -Metrics -Miscellaneous + + - Fixed a bug where :func:`model_selection.BaseSearchCV.inverse_transform` + returns ``self.best_estimator_.transform()`` instead of + ``self.best_estimator_.inverse_transform()``. + :issue:`8344` by :user:`Akshay Gupta ` and :user:`Rasmus Eriksson `. - Added ``classes_`` attribute to :class:`model_selection.GridSearchCV`, :class:`model_selection.RandomizedSearchCV`, :class:`grid_search.GridSearchCV`, @@ -497,6 +438,69 @@ Miscellaneous by :user:`Alyssa Batula `, :user:`Dylan Werner-Meier `, and :user:`Stephen Hoover `. + - Fixed a bug where :func:`model_selection.validation_curve` + reused the same estimator for each parameter value. + :issue:`7365` by :user:`Aleksandr Sandrovskii `. + + - Several fixes to input validation in + :class:`multiclass.OutputCodeClassifier` + :issue:`8086` by `Andreas Müller`_. + + - :class:`multiclass.OneVsOneClassifier`'s ``partial_fit`` now ensures all + classes are provided up-front. :issue:`6250` by + :user:`Asish Panda `. + + - Fix :func:`multioutput.MultiOutputClassifier.predict_proba` to + return a list of 2d arrays, rather than a 3d array. In the case where + different target columns had different numbers of classes, a `ValueError` + would be raised on trying to stack matrices with different dimensions. + :issue:`8093` by :user:`Peter Bull `. + + +Metrics + + - :func:`metrics.average_precision_score` no longer linearly + interpolates between operating points, and instead weighs precisions + by the change in recall since the last operating point, as per the + `Wikipedia entry `_. + (`#7356 `_). By + :user:`Nick Dingwall ` and `Gael Varoquaux`_. + + - Fix a bug in :func:`metrics.classification._check_targets` + which would return ``'binary'`` if ``y_true`` and ``y_pred`` were + both ``'binary'`` but the union of ``y_true`` and ``y_pred`` was + ``'multiclass'``. :issue:`8377` by `Loic Esteve`_. + + - Fixed an integer overflow bug in :func:`metrics.confusion_matrix` and + hence :func:`metrics.cohen_kappa_score`. :issue:`8354`, :issue:`7929` + by `Joel Nothman`_ and :user:`Jon Crall `. + +Miscellaneous + + - Fixed a bug when :func:`datasets.make_classification` fails + when generating more than 30 features. :issue:`8159` by + :user:`Herilalaina Rakotoarison `. + + - Fixed a bug where :func:`datasets.make_moons` gives an + incorrect result when ``n_samples`` is odd. + :issue:`8198` by :user:`Josh Levy `. + + - Some ``fetch_`` functions in :mod:`sklearn.datasets` were ignoring the + ``download_if_missing`` keyword. :issue:`7944` by :user:`Ralf Gommers `. + + - Fix estimators to accept a ``sample_weight`` parameter of type + ``pandas.Series`` in their ``fit`` function. :issue:`7825` by + `Kathleen Chen`_. + + - Fix a bug in cases where ``numpy.cumsum`` may be numerically unstable, + raising an exception if instability is identified. :issue:`7376` and + :issue:`7331` by `Joel Nothman`_ and :user:`yangarbiter`. + + - Fix a bug where :meth:`base.BaseEstimator.__getstate__` + obstructed pickling customizations of child-classes, when used in a + multiple inheritance context. + :issue:`8316` by :user:`Holger Peters `. + - Update Sphinx-Gallery from 0.1.4 to 0.1.7 for resolving links in documentation build with Sphinx>1.5 :issue:`8010`, :issue:`7986` by :user:`Oscar Najera ` @@ -505,6 +509,11 @@ Miscellaneous API changes summary ------------------- + - The ``non_negative`` parameter in :class:`feature_extraction.FeatureHasher` + has been deprecated, and replaced with a more principled alternative, + ``alternate_sign``. + :issue:`7565` by :user:`Roman Yurchak `. + - Ensure that estimators' attributes ending with ``_`` are not set in the constructor but only in the ``fit`` method. Most notably, ensemble estimators (deriving from :class:`ensemble.BaseEnsemble`) From 24c742bde0a5752628c09f57eb8b363b2852e220 Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Fri, 30 Jun 2017 15:10:07 +1000 Subject: [PATCH 03/19] More cleaning up --- doc/whats_new.rst | 156 ++++++++++++++++++++++++---------------------- 1 file changed, 80 insertions(+), 76 deletions(-) diff --git a/doc/whats_new.rst b/doc/whats_new.rst index 6d6d175154fce..c73ee6e4f78dd 100644 --- a/doc/whats_new.rst +++ b/doc/whats_new.rst @@ -423,7 +423,6 @@ Preprocessing and feature selection pipeline with :class:`feature_extraction.text.TfidfTransformer`. :issue:`7513` by :user:`Roman Yurchak `. - Model evaluation and meta-estimators - Fixed a bug where :func:`model_selection.BaseSearchCV.inverse_transform` @@ -456,7 +455,6 @@ Model evaluation and meta-estimators would be raised on trying to stack matrices with different dimensions. :issue:`8093` by :user:`Peter Bull `. - Metrics - :func:`metrics.average_precision_score` no longer linearly @@ -509,21 +507,23 @@ Miscellaneous API changes summary ------------------- - - The ``non_negative`` parameter in :class:`feature_extraction.FeatureHasher` - has been deprecated, and replaced with a more principled alternative, - ``alternate_sign``. - :issue:`7565` by :user:`Roman Yurchak `. +Trees and ensembles - - Ensure that estimators' attributes ending with ``_`` are not set - in the constructor but only in the ``fit`` method. Most notably, - ensemble estimators (deriving from :class:`ensemble.BaseEnsemble`) - now only have ``self.estimators_`` available after ``fit``. - :issue:`7464` by `Lars Buitinck`_ and `Loic Esteve`_. + - Gradient boosting base models are no longer estimators. By `Andreas Müller`_. - - All checks in ``utils.estimator_checks``, in particular - :func:`utils.estimator_checks.check_estimator` now accept estimator - instances. Most other checks do not accept - estimator classes any more. :issue:`9019` by `Andreas Müller`_. + - All tree based estimators now accept a ``min_impurity_decrease`` + parameter in lieu of the ``min_impurity_split``, which is now deprecated. + The ``min_impurity_decrease`` helps stop splitting the nodes in which + the weighted impurity decrease from splitting is no longer alteast + ``min_impurity_decrease``. :issue:`8449` by `Raghav RV`_. + +Linear, kernelized and related models + + - :class:`neighbors.LSHForest` has been deprecated and will be + removed in 0.21 due to poor performance. + :issue:`8996` by `Andreas Müller`_. + +Decomposition, manifold learning and clustering - Deprecate the ``doc_topic_distr`` argument of the ``perplexity`` method in :class:`decomposition.LatentDirichletAllocation` because the @@ -531,20 +531,30 @@ API changes summary needed for the perplexity calculation. :issue:`7954` by :user:`Gary Foreman `. - - Replace attribute ``named_steps`` ``dict`` to :class:`utils.Bunch` - in :class:`pipeline.Pipeline` to enable tab completion in interactive - environment. In the case conflict value on ``named_steps`` and ``dict`` - attribute, ``dict`` behavior will be prioritized. - :issue:`8481` by :user:`Herilalaina Rakotoarison `. + - The ``n_topics`` parameter of :class:`decomposition.LatentDirichletAllocation` + has been renamed to ``n_components`` and will be removed in version 0.21. + :issue:`8922` by :user:`Attractadore`. - - The :func:`multioutput.MultiOutputClassifier.predict_proba` - function used to return a 3d array (``n_samples``, ``n_classes``, - ``n_outputs``). In the case where different target columns had different - numbers of classes, a `ValueError` would be raised on trying to stack - matrices with different dimensions. This function now returns a list of - arrays where the length of the list is ``n_outputs``, and each array is - (``n_samples``, ``n_classes``) for that particular output. - :issue:`8093` by :user:`Peter Bull `. + - :class:`cluster.bicluster.SpectralCoclustering` and + :class:`cluster.bicluster.SpectralBiclustering` now accept ``y`` in fit. + :issue:`6126` by :user:`Laurent Direr `. + +Preprocessing and feature selection + + - :class:`feature_selection.SelectFromModel` now has a ``partial_fit`` + method only if the underlying estimator does. By `Andreas Müller`_. + + - :class:`feature_selection.SelectFromModel` now validates the ``threshold`` + parameter and sets the ``threshold_`` attribute during the call to + ``fit``, and no longer during the call to ``transform```, by `Andreas + Müller`_. + + - The ``non_negative`` parameter in :class:`feature_extraction.FeatureHasher` + has been deprecated, and replaced with a more principled alternative, + ``alternate_sign``. + :issue:`7565` by :user:`Roman Yurchak `. + +Model evaluation and meta-estimators - Deprecate the ``fit_params`` constructor input to the :class:`model_selection.GridSearchCV` and @@ -557,52 +567,42 @@ API changes summary :func:`model_selection.cross_val_predict`. :issue:`2879` by :user:`Stephen Hoover `. - - The ``decision_function`` output shape for binary classification in - :class:`multiclass.OneVsRestClassifier` and - :class:`multiclass.OneVsOneClassifier` is now ``(n_samples,)`` to conform - to scikit-learn conventions. :issue:`9100` by `Andreas Müller`_. - - - Gradient boosting base models are no longer estimators. By `Andreas Müller`_. - - - :class:`feature_selection.SelectFromModel` now validates the ``threshold`` - parameter and sets the ``threshold_`` attribute during the call to - ``fit``, and no longer during the call to ``transform```, by `Andreas - Müller`_. - - - :class:`feature_selection.SelectFromModel` now has a ``partial_fit`` - method only if the underlying estimator does. By `Andreas Müller`_. + - In version 0.21, the default behavior of splitters that use the + ``test_size`` and ``train_size`` parameter will change, such that + specifying ``train_size`` alone will cause ``test_size`` to be the + remainder. :issue:`7459` by :user:`Nelson Liu `. - :class:`multiclass.OneVsRestClassifier` now has a ``partial_fit`` method only if the underlying estimator does. By `Andreas Müller`_. - - Estimators with both methods ``decision_function`` and ``predict_proba`` - are now required to have a monotonic relation between them. The - method ``check_decision_proba_consistency`` has been added in - **sklearn.utils.estimator_checks** to check their consistency. - :issue:`7578` by :user:`Shubham Bhardwaj ` + - The ``decision_function`` output shape for binary classification in + :class:`multiclass.OneVsRestClassifier` and + :class:`multiclass.OneVsOneClassifier` is now ``(n_samples,)`` to conform + to scikit-learn conventions. :issue:`9100` by `Andreas Müller`_. - - In version 0.21, the default behavior of splitters that use the - ``test_size`` and ``train_size`` parameter will change, such that - specifying ``train_size`` alone will cause ``test_size`` to be the - remainder. :issue:`7459` by :user:`Nelson Liu `. + - The :func:`multioutput.MultiOutputClassifier.predict_proba` + function used to return a 3d array (``n_samples``, ``n_classes``, + ``n_outputs``). In the case where different target columns had different + numbers of classes, a `ValueError` would be raised on trying to stack + matrices with different dimensions. This function now returns a list of + arrays where the length of the list is ``n_outputs``, and each array is + (``n_samples``, ``n_classes``) for that particular output. + :issue:`8093` by :user:`Peter Bull `. - - All tree based estimators now accept a ``min_impurity_decrease`` - parameter in lieu of the ``min_impurity_split``, which is now deprecated. - The ``min_impurity_decrease`` helps stop splitting the nodes in which - the weighted impurity decrease from splitting is no longer alteast - ``min_impurity_decrease``. :issue:`8449` by `Raghav RV`_. + - Replace attribute ``named_steps`` ``dict`` to :class:`utils.Bunch` + in :class:`pipeline.Pipeline` to enable tab completion in interactive + environment. In the case conflict value on ``named_steps`` and ``dict`` + attribute, ``dict`` behavior will be prioritized. + :issue:`8481` by :user:`Herilalaina Rakotoarison `. - - The ``n_topics`` parameter of :class:`decomposition.LatentDirichletAllocation` - has been renamed to ``n_components`` and will be removed in version 0.21. - :issue:`8922` by :user:`Attractadore`. +Metrics - - :class:`cluster.bicluster.SpectralCoclustering` and - :class:`cluster.bicluster.SpectralBiclustering` now accept ``y`` in fit. - :issue:`6126` by :user:`Laurent Direr `. +Miscellaneous - - :class:`neighbors.LSHForest` has been deprecated and will be - removed in 0.21 due to poor performance. - :issue:`8996` by `Andreas Müller`_. + - Deprecate the ``y`` parameter in `transform` and `inverse_transform`. + The method should not accept ``y`` parameter, as it's used at the prediction time. + :issue:`8174` by :user:`Tahar Zanouda `, `Alexandre Gramfort`_ + and `Raghav RV`_. - SciPy >= 0.13.3 and NumPy >= 1.8.2 are now the minimum supported versions for scikit-learn. The following backported functions in @@ -637,18 +637,22 @@ API changes summary - ``utils.stats.rankdata`` - ``neighbors.approximate.LSHForest`` - - Deprecate the ``y`` parameter in `transform` and `inverse_transform`. - The method should not accept ``y`` parameter, as it's used at the prediction time. - :issue:`8174` by :user:`Tahar Zanouda `, `Alexandre Gramfort`_ - and `Raghav RV`_. + - Estimators with both methods ``decision_function`` and ``predict_proba`` + are now required to have a monotonic relation between them. The + method ``check_decision_proba_consistency`` has been added in + **sklearn.utils.estimator_checks** to check their consistency. + :issue:`7578` by :user:`Shubham Bhardwaj ` -Trees and ensembles -Linear, kernelized and related models -Decomposition, manifold learning and clustering -Preprocessing and feature selection -Model evaluation and meta-estimators -Metrics -Miscellaneous + - All checks in ``utils.estimator_checks``, in particular + :func:`utils.estimator_checks.check_estimator` now accept estimator + instances. Most other checks do not accept + estimator classes any more. :issue:`9019` by `Andreas Müller`_. + + - Ensure that estimators' attributes ending with ``_`` are not set + in the constructor but only in the ``fit`` method. Most notably, + ensemble estimators (deriving from :class:`ensemble.BaseEnsemble`) + now only have ``self.estimators_`` available after ``fit``. + :issue:`7464` by `Lars Buitinck`_ and `Loic Esteve`_. .. _changes_0_18_1: From 187ee22942c1f0168d08e0236e1d45ba4ca598dd Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Sat, 1 Jul 2017 20:37:14 +1000 Subject: [PATCH 04/19] Deprecations --- doc/modules/classes.rst | 14 ++++++++++++-- doc/whats_new.rst | 13 ++++++------- 2 files changed, 18 insertions(+), 9 deletions(-) diff --git a/doc/modules/classes.rst b/doc/modules/classes.rst index 5399e27ef4d08..09dd288a85dc0 100644 --- a/doc/modules/classes.rst +++ b/doc/modules/classes.rst @@ -723,8 +723,6 @@ Kernels: linear_model.PassiveAggressiveClassifier linear_model.PassiveAggressiveRegressor linear_model.Perceptron - linear_model.RandomizedLasso - linear_model.RandomizedLogisticRegression linear_model.RANSACRegressor linear_model.Ridge linear_model.RidgeClassifier @@ -1391,6 +1389,18 @@ Recently deprecated =================== +To be removed in 0.21 +--------------------- + +.. autosummary:: + :toctree: generated/ + :template: deprecated_class.rst + + linear_model.RandomizedLasso + linear_model.RandomizedLogisticRegression + neighbors.LSHForest + + To be removed in 0.20 --------------------- diff --git a/doc/whats_new.rst b/doc/whats_new.rst index c73ee6e4f78dd..04a70f07c7247 100644 --- a/doc/whats_new.rst +++ b/doc/whats_new.rst @@ -1,4 +1,4 @@ -.. currentmodule:: sklearn + ..currentmodule:: sklearn =============== @@ -28,12 +28,11 @@ Fix longstanding implementation erorr in average_precision_score Multi-metric grid search and cross validation -Major deprecations ------------------- - -TODO - -We have deprecated RandomizedLasso and RandomizedLogisticRegression and LSHForest because they weren't appropriate or up to standards. We have deprecated a number of utilities no longer necessary now that we require Scipy 0.13.3 and Numpy 1.8.2 at a minimum. +Note also that we have deprecated RandomizedLasso, +RandomizedLogisticRegression and LSHForest because they weren't +appropriate or up to standards. We have deprecated a number of +utilities no longer necessary now that we require Scipy 0.13.3 and +Numpy 1.8.2 at a minimum. Changed models -------------- From 37df822a79cca2427e1f141b6d1945540a204dd9 Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Sat, 1 Jul 2017 20:45:40 +1000 Subject: [PATCH 05/19] Clean up merge --- doc/whats_new.rst | 42 +++++++++++++++++++----------------------- 1 file changed, 19 insertions(+), 23 deletions(-) diff --git a/doc/whats_new.rst b/doc/whats_new.rst index 9ba019e59e7d6..5a8b9f2395d40 100644 --- a/doc/whats_new.rst +++ b/doc/whats_new.rst @@ -152,12 +152,6 @@ Linear, kernelized and related models attributes, ``n_skips_*``. :issue:`7914` by :user:`Michael Horrell `. - - Relax assumption on the data for the - :class:`kernel_approximation.SkewedChi2Sampler`. Since the Skewed-Chi2 - kernel is defined on the open interval :math:`(-skewedness; +\infty)^d`, - the transform function should not check whether ``X < 0`` but whether ``X < - -self.skewedness``. :issue:`7573` by :user:`Romain Brault `. - - Custom metrics for the :mod:`neighbors` binary trees now have fewer constraints: they must take two 1d-arrays and return a float. :issue:`6288` by `Jake Vanderplas`_. @@ -199,6 +193,16 @@ Preprocessing and feature selection :mod:`feature_extraction.text` by binding methods for loops and special-casing unigrams. :issue:`7567` by `Jaye Doepke ` + - Relax assumption on the data for the + :class:`kernel_approximation.SkewedChi2Sampler`. Since the Skewed-Chi2 + kernel is defined on the open interval :math:`(-skewedness; +\infty)^d`, + the transform function should not check whether ``X < 0`` but whether ``X < + -self.skewedness``. :issue:`7573` by :user:`Romain Brault `. + + - Made default kernel parameters kernel-dependent in + :class:`kernel_approximation.Nystroem`. + :issue:`5229` by :user:`mth4saurabh` and `Andreas Müller`_. + Model evaluation and meta-estimators - :class:`pipeline.Pipeline` allows to cache transformers @@ -472,6 +476,10 @@ Metrics hence :func:`metrics.cohen_kappa_score`. :issue:`8354`, :issue:`7929` by `Joel Nothman`_ and :user:`Jon Crall `. + - Fixed passing of ``gamma`` parameter to the ``chi2`` kernel in + :func:`metrics.pairwise_kernels` :issue:`5211` by :user:`nrhine1`, + :user:`mth4saurabh` and `Andreas Müller`_. + Miscellaneous - Fixed a bug when :func:`datasets.make_classification` fails @@ -501,18 +509,6 @@ Miscellaneous - Update Sphinx-Gallery from 0.1.4 to 0.1.7 for resolving links in documentation build with Sphinx>1.5 :issue:`8010`, :issue:`7986` by :user:`Oscar Najera ` - - Made default kernel parameters kernel-dependent in :class:`kernel_approximation.Nystroem` - :issue:`5229` by :user:`mth4saurabh` and `Andreas Müller`_. - - - Fixed passing of ``gamma`` parameter to the ``chi2`` kernel in - :func:`metrics.pairwise_kernels` :issue:`5211` by :user:`nrhine1`, - :user:`mth4saurabh` and `Andreas Müller`_. - - - Fixed a bug in :class:`gaussian_process.GaussianProcessRegressor` - when the standard deviation and covariance predicted without fit - would fail with a unmeaningful error by default. - :issue:`6573` by :user:`Quazi Marufur Rahman ` and - `Manoj Kumar`_. API changes summary @@ -565,6 +561,11 @@ Preprocessing and feature selection ``alternate_sign``. :issue:`7565` by :user:`Roman Yurchak `. + - :class:`linear_model.RandomizedLogisticRegression`, + and :class:`linear_model.RandomizedLasso` have been deprecated and will + be removed in version 0.21. + :issue: `8995` by :user:`Ramana.S `. + Model evaluation and meta-estimators - Deprecate the ``fit_params`` constructor input to the @@ -646,8 +647,6 @@ Miscellaneous - ``utils.random.choice`` - ``utils.sparsetools.connected_components`` - ``utils.stats.rankdata`` - - ``neighbors.approximate.LSHForest`` - - ``linear_model.randomized_l1`` - Estimators with both methods ``decision_function`` and ``predict_proba`` are now required to have a monotonic relation between them. The @@ -1391,9 +1390,6 @@ Model evaluation and meta-estimators the parameter ``n_labels`` is renamed to ``n_groups``. :issue:`6660` by `Raghav RV`_. - - The :mod:`sklearn.linear_model.randomized_l1` is deprecated. - :issue: `8995` by :user:`Ramana.S `. - Code Contributors ----------------- Aditya Joshi, Alejandro, Alexander Fabisch, Alexander Loginov, Alexander From ce535801af9a11c253180f1ce0908aa655963a2c Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Tue, 4 Jul 2017 13:53:07 +1000 Subject: [PATCH 06/19] Update --- doc/whats_new.rst | 36 ++++++++++++++++++++---------------- 1 file changed, 20 insertions(+), 16 deletions(-) diff --git a/doc/whats_new.rst b/doc/whats_new.rst index c2819595c83bc..ac6297c3be2d9 100644 --- a/doc/whats_new.rst +++ b/doc/whats_new.rst @@ -1,4 +1,4 @@ - ..currentmodule:: sklearn +.. currentmodule:: sklearn =============== @@ -23,6 +23,11 @@ Multinomial logistic regression with L1 loss. ?Rewrite of TSNE +:class:`semi_supervised.LabelSpreading` and +:class:`semi_supervised.LabelPropagation` have had substantial fixes. +Propagation was previously broekn. Spreading should now function better +with respect to parameters. + Fix longstanding implementation erorr in average_precision_score @@ -42,7 +47,9 @@ parameters, may produce different models from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in random sampling procedures. - * :class:`sklearn.ensemble.IsolationForest` (bug fix) + * :class:`ensemble.IsolationForest` (bug fix) + * :class:`semi_supervised.LabelSpreading` (bug fix) + * :class:`semi_supervised.LabelPropagation` (bug fix) * TODO Details are listed in the changelog below. @@ -312,6 +319,12 @@ Trees and ensembles Linear, kernelized and related models + - Fix :class:`semi_supervised.BaseLabelPropagation` to correctly implement + ``LabelPropagation`` and ``LabelSpreading`` as done in the referenced + papers. :issue:`9239` + by :user:`Andre Ambrosio Boechat `, :user:`Utkarsh Upadhyay + `, and `Joel Nothman`_. + - Fixed a bug where :func:`linear_model.RANSACRegressor.fit` may run until ``max_iter`` if it finds a large inlier group early. :issue:`8251` by :user:`aivision2020`. @@ -523,20 +536,6 @@ Trees and ensembles The ``min_impurity_decrease`` helps stop splitting the nodes in which the weighted impurity decrease from splitting is no longer alteast ``min_impurity_decrease``. :issue:`8449` by `Raghav RV`_. - - Fixed the implementation of `explained_variance_` - in :class:`decomposition.PCA`, - :class:`decomposition.RandomizedPCA` and - :class:`decomposition.IncrementalPCA`. - :issue:`9105` by `Hanmin Qin `_. - - - Fix :class:`semi_supervised.BaseLabelPropagation` to correctly implement - ``LabelPropagation`` and ``LabelSpreading`` as done in the referenced - papers. :class:`semi_supervised.LabelPropagation` now always does hard - clamping. Its ``alpha`` parameter has no effect and is - deprecated to be removed in 0.21. :issue:`6727` :issue:`3550` issue:`5770` - by :user:`Andre Ambrosio Boechat `, :user:`Utkarsh Upadhyay - `, and `Joel Nothman`_. - Linear, kernelized and related models @@ -544,6 +543,11 @@ Linear, kernelized and related models removed in 0.21 due to poor performance. :issue:`8996` by `Andreas Müller`_. + - The ``alpha`` parameter of :class:`semi_supervised.LabelPropagation` now + has no effect and is deprecated to be removed in 0.21. :issue:`9239` + by :user:`Andre Ambrosio Boechat `, :user:`Utkarsh Upadhyay + `, and `Joel Nothman`_. + Decomposition, manifold learning and clustering - Deprecate the ``doc_topic_distr`` argument of the ``perplexity`` method From 89d5ea9111ea34ce965788d2bce7948769164c86 Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Wed, 5 Jul 2017 11:12:33 +1000 Subject: [PATCH 07/19] TODOs to prose and minor changes --- doc/whats_new.rst | 73 ++++++++++++++++++++++++++--------------------- 1 file changed, 40 insertions(+), 33 deletions(-) diff --git a/doc/whats_new.rst b/doc/whats_new.rst index ac6297c3be2d9..b16a91d02aefe 100644 --- a/doc/whats_new.rst +++ b/doc/whats_new.rst @@ -15,29 +15,32 @@ Highlights TODO: -This release includes a number of great new features including Local Outlier Factor for anomaly detection, QuantileTransformer for robust feature transformation, and ClassifierChain to simply account for dependencies between classes in multilabel problems. +This release includes a number of great new features including +:class:`neighbors.LocalOutlierFactor` for anomaly detection, +:class:`preprocessing.QuantileTransformer` for robust feature +transformation, and :class:`multioutput.ClassifierChain` to simply +account for dependencies between classes in multilabel problems. We +have some new algorithms in existing estimators, such as +multiplicative update in :class:`decomposition.NMF` and multinomial +:class:`linear_model.LogisticRegression` with L1 loss. + +You can learn faster. The new option to cache transformations in +:class:`pipeline.Pipeline` makes grid search over pipelines including +slow transformations much more efficient. + +And you can predict faster. If you're sure you know what you're doing, +you can turn off some validation using :func:`config_context`. -Pipeline caching makes grid search over pipelines including slow transformations much more efficient. - -Multinomial logistic regression with L1 loss. - -?Rewrite of TSNE +Multi-metric grid search and cross validation +We've made some important fixes too. +TODO: ?Rewrite of TSNE +We've fixed a longstanding implementation erorr in :func:`metrics.average_precision_score`. :class:`semi_supervised.LabelSpreading` and :class:`semi_supervised.LabelPropagation` have had substantial fixes. Propagation was previously broekn. Spreading should now function better with respect to parameters. -Fix longstanding implementation erorr in average_precision_score - - -Multi-metric grid search and cross validation - -Note also that we have deprecated RandomizedLasso, -RandomizedLogisticRegression and LSHForest because they weren't -appropriate or up to standards. We have deprecated a number of -utilities no longer necessary now that we require Scipy 0.13.3 and -Numpy 1.8.2 at a minimum. Changed models -------------- @@ -159,10 +162,6 @@ Linear, kernelized and related models attributes, ``n_skips_*``. :issue:`7914` by :user:`Michael Horrell `. - - Custom metrics for the :mod:`neighbors` binary trees now have - fewer constraints: they must take two 1d-arrays and return a float. - :issue:`6288` by `Jake Vanderplas`_. - - In :class:`gaussian_process.GaussianProcessRegressor`, method ``predict`` is a lot faster with ``return_std=True``. :issue:`8591` by :user:`Hadrien Bertrand `. @@ -173,9 +172,15 @@ Linear, kernelized and related models - Memory usage enhancement: Prevent cast from float32 to float64 in :class:`linear_model.Ridge` when using svd, sparse_cg, cholesky or lsqr solvers - :class:`sklearn.linear_model.Ridge` when using svd, sparse_cg, cholesky or lsqr solvers + :class:`linear_model.Ridge` when using svd, sparse_cg, cholesky or lsqr solvers by :user:`Joan Massich `, :user:`Nicolas Cordier ` +Other predictors + + - Custom metrics for the :mod:`neighbors` binary trees now have + fewer constraints: they must take two 1d-arrays and return a float. + :issue:`6288` by `Jake Vanderplas`_. + Decomposition, manifold learning and clustering - :class:`cluster.MiniBatchKMeans` and :class:`cluster.KMeans` @@ -264,7 +269,7 @@ Miscellaneous :issue:`7533` by :user:`Ekaterina Krivich `. - Added type checking to the ``accept_sparse`` parameter in - :mod:`sklearn.utils.validation` methods. This parameter now accepts only + :mod:`utils.validation` methods. This parameter now accepts only boolean, string, or list/tuple of strings. ``accept_sparse=None`` is deprecated and should be replaced by ``accept_sparse=False``. :issue:`7880` by :user:`Josh Karnofsky `. @@ -319,16 +324,10 @@ Trees and ensembles Linear, kernelized and related models - - Fix :class:`semi_supervised.BaseLabelPropagation` to correctly implement - ``LabelPropagation`` and ``LabelSpreading`` as done in the referenced - papers. :issue:`9239` - by :user:`Andre Ambrosio Boechat `, :user:`Utkarsh Upadhyay - `, and `Joel Nothman`_. - - Fixed a bug where :func:`linear_model.RANSACRegressor.fit` may run until ``max_iter`` if it finds a large inlier group early. :issue:`8251` by :user:`aivision2020`. - - Fixed a bug where :class:`sklearn.naive_bayes.MultinomialNB` and :class:`sklearn.naive_bayes.BernoulliNB` + - Fixed a bug where :class:`naive_bayes.MultinomialNB` and :class:`naive_bayes.BernoulliNB` failed when `alpha=0`. :issue:`5814` by :user:`Yichuan Liu ` and :user:`Herilalaina Rakotoarison `. @@ -371,6 +370,14 @@ Linear, kernelized and related models :issue:`6573` by :user:`Quazi Marufur Rahman ` and `Manoj Kumar`_. +Other predictors + + - Fix :class:`semi_supervised.BaseLabelPropagation` to correctly implement + ``LabelPropagation`` and ``LabelSpreading`` as done in the referenced + papers. :issue:`9239` + by :user:`Andre Ambrosio Boechat `, :user:`Utkarsh Upadhyay + `, and `Joel Nothman`_. + Decomposition, manifold learning and clustering - Fix a bug in :class:`decomposition.LatentDirichletAllocation` @@ -503,7 +510,7 @@ Miscellaneous incorrect result when ``n_samples`` is odd. :issue:`8198` by :user:`Josh Levy `. - - Some ``fetch_`` functions in :mod:`sklearn.datasets` were ignoring the + - Some ``fetch_`` functions in :mod:`datasets` were ignoring the ``download_if_missing`` keyword. :issue:`7944` by :user:`Ralf Gommers `. - Fix estimators to accept a ``sample_weight`` parameter of type @@ -537,7 +544,7 @@ Trees and ensembles the weighted impurity decrease from splitting is no longer alteast ``min_impurity_decrease``. :issue:`8449` by `Raghav RV`_. -Linear, kernelized and related models +Other predictors - :class:`neighbors.LSHForest` has been deprecated and will be removed in 0.21 due to poor performance. @@ -636,7 +643,7 @@ Miscellaneous - SciPy >= 0.13.3 and NumPy >= 1.8.2 are now the minimum supported versions for scikit-learn. The following backported functions in - :mod:`sklearn.utils` have been removed or deprecated accordingly. + :mod:`utils` have been removed or deprecated accordingly. :issue:`8854` and :issue:`8874` by :user:`Naoya Kanai ` Removed in 0.19: @@ -669,7 +676,7 @@ Miscellaneous - Estimators with both methods ``decision_function`` and ``predict_proba`` are now required to have a monotonic relation between them. The method ``check_decision_proba_consistency`` has been added in - **sklearn.utils.estimator_checks** to check their consistency. + **utils.estimator_checks** to check their consistency. :issue:`7578` by :user:`Shubham Bhardwaj ` - All checks in ``utils.estimator_checks``, in particular From 736e93feda98f3ea77fa56c9c1012d92cccf8c18 Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Wed, 5 Jul 2017 21:14:27 +1000 Subject: [PATCH 08/19] Changed models and minor fixes --- doc/modules/pipeline.rst | 2 ++ doc/whats_new.rst | 49 ++++++++++++++++++++++++---------------- 2 files changed, 31 insertions(+), 20 deletions(-) diff --git a/doc/modules/pipeline.rst b/doc/modules/pipeline.rst index b098ec04a999a..4356b3fe8d640 100644 --- a/doc/modules/pipeline.rst +++ b/doc/modules/pipeline.rst @@ -124,6 +124,8 @@ i.e. if the last estimator is a classifier, the :class:`Pipeline` can be used as a classifier. If the last estimator is a transformer, again, so is the pipeline. +.. _pipeline_cache: + Caching transformers: avoid repeated computation ------------------------------------------------- diff --git a/doc/whats_new.rst b/doc/whats_new.rst index b16a91d02aefe..1eb642b01d55d 100644 --- a/doc/whats_new.rst +++ b/doc/whats_new.rst @@ -13,8 +13,6 @@ Version 0.19 Highlights ---------- -TODO: - This release includes a number of great new features including :class:`neighbors.LocalOutlierFactor` for anomaly detection, :class:`preprocessing.QuantileTransformer` for robust feature @@ -24,12 +22,12 @@ have some new algorithms in existing estimators, such as multiplicative update in :class:`decomposition.NMF` and multinomial :class:`linear_model.LogisticRegression` with L1 loss. -You can learn faster. The new option to cache transformations in -:class:`pipeline.Pipeline` makes grid search over pipelines including -slow transformations much more efficient. - +You can learn faster. The :ref:`new option to cache transformations +` in :class:`pipeline.Pipeline` makes grid search over +pipelines including slow transformations much more efficient. And you can predict faster. If you're sure you know what you're doing, -you can turn off some validation using :func:`config_context`. +you can turn off validating that the input is finite using +:func:`config_context`. Multi-metric grid search and cross validation @@ -38,9 +36,8 @@ TODO: ?Rewrite of TSNE We've fixed a longstanding implementation erorr in :func:`metrics.average_precision_score`. :class:`semi_supervised.LabelSpreading` and :class:`semi_supervised.LabelPropagation` have had substantial fixes. -Propagation was previously broekn. Spreading should now function better -with respect to parameters. - +Propagation was previously broken. Spreading should now correctly +respect its alpha parameter. Changed models -------------- @@ -53,7 +50,18 @@ random sampling procedures. * :class:`ensemble.IsolationForest` (bug fix) * :class:`semi_supervised.LabelSpreading` (bug fix) * :class:`semi_supervised.LabelPropagation` (bug fix) - * TODO + * tree based models where ``min_weight_fraction_leaf`` is used (enhancement) + * :class:`ensemble.GradientBoostingClassifier` and + :class:`ensemble.GradientBoostingRegressor` where ``min_impurity_split`` is used (bug fix) + * gradient boosting with :class:`ensemble.gradient_boosting.QuantileLossFunction` (bug fix) + * :class:`linear_model.RANSACRegressor` (bug fix) + * :class:`linear_model.LassoLars` (bug fix) + * :class:`linear_model.LassoLarsIC` (bug fix) + * :class:`cluster.KMeans` with sparse X and initial centroids given (bug fix) + * :class:`manifold.TSNE` (bug fix) + * :class:`cross_decomposition.PLSRegression` + with ``scale=True`` (bug fix) + * :class:`feature_selection.SelectFdr` (bug fix) Details are listed in the changelog below. @@ -136,8 +144,8 @@ Trees and ensembles now support sparse input for prediction. :issue:`6101` by :user:`Ibraim Ganiev `. - - :class:`ensemble.VotingClassifier` now allow changing estimators by using - :meth:`ensemble.VotingClassifier.set_params`. Estimators can also be + - :class:`ensemble.VotingClassifier` now allows changing estimators by using + :meth:`ensemble.VotingClassifier.set_params`. An estimator can also be removed by setting it to `None`. :issue:`7674` by :user:`Yichuan Liu `. @@ -146,7 +154,7 @@ Linear, kernelized and related models - :class:`linear_model.SGDClassifier`, :class:`linear_model.SGDRegressor`, :class:`linear_model.PassiveAggressiveClassifier`, :class:`linear_model.PassiveAggressiveRegressor` and - :class:`linear_model.Perceptron` now expose a ``max_iter`` and + :class:`linear_model.Perceptron` now expose ``max_iter`` and ``tol`` parameters, to handle convergence more precisely. ``n_iter`` parameter is deprecated, and the fitted estimator exposes a ``n_iter_`` attribute, with actual number of iterations before @@ -213,11 +221,11 @@ Preprocessing and feature selection - Made default kernel parameters kernel-dependent in :class:`kernel_approximation.Nystroem`. - :issue:`5229` by :user:`mth4saurabh` and `Andreas Müller`_. + :issue:`5229` by :user:`Saurabh Bansod ` and `Andreas Müller`_. Model evaluation and meta-estimators - - :class:`pipeline.Pipeline` allows to cache transformers + - :class:`pipeline.Pipeline` is now able to cache transformers within a pipeline by using the ``memory`` constructor parameter. :issue:`7990` by :user:`Guillaume Lemaitre `. @@ -301,7 +309,7 @@ Trees and ensembles ``min_impurity_split`` parameter. :issue:`8006` by :user:`Sebastian Pölsterl `. - - Fixed oob_score in :class:`ensemble.BaggingClassifier`. + - Fixed ``oob_score`` in :class:`ensemble.BaggingClassifier`. :issue:`8936` by :user:`mlewis1729 ` - Fixed a bug where :class:`ensemble.IsolationForest` fails when @@ -444,7 +452,7 @@ Preprocessing and feature selection preventing the use of :class:`feature_extraction.text.HashingVectorizer` in a pipeline with :class:`feature_extraction.text.TfidfTransformer`. - :issue:`7513` by :user:`Roman Yurchak `. + :issue:`7565` by :user:`Roman Yurchak `. Model evaluation and meta-estimators @@ -497,8 +505,9 @@ Metrics by `Joel Nothman`_ and :user:`Jon Crall `. - Fixed passing of ``gamma`` parameter to the ``chi2`` kernel in - :func:`metrics.pairwise_kernels` :issue:`5211` by :user:`nrhine1`, - :user:`mth4saurabh` and `Andreas Müller`_. + :func:`metrics.pairwise_kernels` :issue:`5211` by + :user:`Nick Rhinehart `, + :user:`Saurabh Bansod ` and `Andreas Müller`_. Miscellaneous From 8d1fff83dfde3898bf0e60e0a0fba0a7bf7ce5bc Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Wed, 5 Jul 2017 23:28:37 +1000 Subject: [PATCH 09/19] sort --- doc/whats_new.rst | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/doc/whats_new.rst b/doc/whats_new.rst index 1eb642b01d55d..664ba8c585782 100644 --- a/doc/whats_new.rst +++ b/doc/whats_new.rst @@ -47,21 +47,21 @@ parameters, may produce different models from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in random sampling procedures. - * :class:`ensemble.IsolationForest` (bug fix) - * :class:`semi_supervised.LabelSpreading` (bug fix) - * :class:`semi_supervised.LabelPropagation` (bug fix) - * tree based models where ``min_weight_fraction_leaf`` is used (enhancement) + * :class:`cluster.KMeans` with sparse X and initial centroids given (bug fix) + * :class:`cross_decomposition.PLSRegression` + with ``scale=True`` (bug fix) * :class:`ensemble.GradientBoostingClassifier` and :class:`ensemble.GradientBoostingRegressor` where ``min_impurity_split`` is used (bug fix) * gradient boosting with :class:`ensemble.gradient_boosting.QuantileLossFunction` (bug fix) + * :class:`ensemble.IsolationForest` (bug fix) + * :class:`feature_selection.SelectFdr` (bug fix) * :class:`linear_model.RANSACRegressor` (bug fix) * :class:`linear_model.LassoLars` (bug fix) * :class:`linear_model.LassoLarsIC` (bug fix) - * :class:`cluster.KMeans` with sparse X and initial centroids given (bug fix) * :class:`manifold.TSNE` (bug fix) - * :class:`cross_decomposition.PLSRegression` - with ``scale=True`` (bug fix) - * :class:`feature_selection.SelectFdr` (bug fix) + * :class:`semi_supervised.LabelSpreading` (bug fix) + * :class:`semi_supervised.LabelPropagation` (bug fix) + * tree based models where ``min_weight_fraction_leaf`` is used (enhancement) Details are listed in the changelog below. From acc4e311d30a7f7bfdb478ff4d306a91a1c4ccc7 Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Thu, 6 Jul 2017 07:43:54 +1000 Subject: [PATCH 10/19] Merge in 0.18.2 docs --- doc/whats_new.rst | 27 ++++++++++++++++++++++++--- 1 file changed, 24 insertions(+), 3 deletions(-) diff --git a/doc/whats_new.rst b/doc/whats_new.rst index 664ba8c585782..c8759d15fd493 100644 --- a/doc/whats_new.rst +++ b/doc/whats_new.rst @@ -700,12 +700,12 @@ Miscellaneous :issue:`7464` by `Lars Buitinck`_ and `Loic Esteve`_. -.. _changes_0_18_1: +.. _changes_0_18_2: -Version 0.18.1 +Version 0.18.2 ============== -**November 11, 2016** +**June 20, 2017** .. topic:: Last release with Python 2.6 support @@ -713,6 +713,27 @@ Version 0.18.1 Later versions of scikit-learn will require Python 2.7 or above. +Changelog +--------- + + - Fixes for compatibility with NumPy 1.13.0: :issue:`7946` :issue:`8355` by + `Loic Esteve`_. + + - Minor compatibility changes in the examples :issue:`9010` :issue:`8040` + :issue:`9149`. + +Code Contributors +----------------- +Aman Dalmia, Loic Esteve, Nate Guerin, Sergei Lebedev + + +.. _changes_0_18_1: + +Version 0.18.1 +============== + +**November 11, 2016** + Changelog --------- From 8c275997ec2e5b18fccabc85bf534f79bdfdb388 Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Thu, 6 Jul 2017 14:41:49 +1000 Subject: [PATCH 11/19] Missing entry from 0.18 logs --- doc/whats_new.rst | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/doc/whats_new.rst b/doc/whats_new.rst index c8759d15fd493..53679a601d2a6 100644 --- a/doc/whats_new.rst +++ b/doc/whats_new.rst @@ -1445,6 +1445,11 @@ Model evaluation and meta-estimators the parameter ``n_labels`` is renamed to ``n_groups``. :issue:`6660` by `Raghav RV`_. + - Error and loss names for ``scoring`` parameters are now prefixed by + ``'neg_'``, such as ``neg_mean_squared_error``. The unprefixed versions + are deprecated and will be removed in version 0.20. + :issue:`7261` by :user:`Tim Head `. + Code Contributors ----------------- Aditya Joshi, Alejandro, Alexander Fabisch, Alexander Loginov, Alexander From 0b8b79f215d2de44899683088cb7dbfa1da69a98 Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Thu, 6 Jul 2017 18:21:54 +1000 Subject: [PATCH 12/19] Optimistically add some features to highlights --- doc/whats_new.rst | 36 ++++++++++++++++++++---------------- 1 file changed, 20 insertions(+), 16 deletions(-) diff --git a/doc/whats_new.rst b/doc/whats_new.rst index 53679a601d2a6..c2c7ea1edf232 100644 --- a/doc/whats_new.rst +++ b/doc/whats_new.rst @@ -13,7 +13,7 @@ Version 0.19 Highlights ---------- -This release includes a number of great new features including +We are excited to release a number of great new features including :class:`neighbors.LocalOutlierFactor` for anomaly detection, :class:`preprocessing.QuantileTransformer` for robust feature transformation, and :class:`multioutput.ClassifierChain` to simply @@ -22,22 +22,26 @@ have some new algorithms in existing estimators, such as multiplicative update in :class:`decomposition.NMF` and multinomial :class:`linear_model.LogisticRegression` with L1 loss. -You can learn faster. The :ref:`new option to cache transformations -` in :class:`pipeline.Pipeline` makes grid search over -pipelines including slow transformations much more efficient. -And you can predict faster. If you're sure you know what you're doing, -you can turn off validating that the input is finite using -:func:`config_context`. - -Multi-metric grid search and cross validation - -We've made some important fixes too. -TODO: ?Rewrite of TSNE -We've fixed a longstanding implementation erorr in :func:`metrics.average_precision_score`. -:class:`semi_supervised.LabelSpreading` and +You can also learn faster. For instance, the :ref:`new option to cache +transformations ` in :class:`pipeline.Pipeline` makes grid +search over pipelines including slow transformations much more efficient. And +you can predict faster: if you're sure you know what you're doing, you can turn +off validating that the input is finite using :func:`config_context`. + +Cross validation is now able to return the results from multiple metric +evaluations. The new :func:`model_selection.cross_validate` can return many +scores on the test data as well as training set performance and timings, and we +have extended the ``scoring`` and ``refit`` parameters for grid/randomized +search :ref:`to handle multiple metrics `. + +We've made some important fixes too. We've fixed a longstanding implementation +erorr in :func:`metrics.average_precision_score`, so please be cautious with +prior results reported from that function. A number of errors in the +:class:`manifold.TSNE` implementation have been fixed, particularly in the +default Barnes-Hut approximation. :class:`semi_supervised.LabelSpreading` and :class:`semi_supervised.LabelPropagation` have had substantial fixes. -Propagation was previously broken. Spreading should now correctly -respect its alpha parameter. +Propagation was previously broken. Spreading should now correctly respect its +alpha parameter. Changed models -------------- From eb05651456bef04da4bdaafc9bbe86c9f3651d0b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Lo=C3=AFc=20Est=C3=A8ve?= Date: Thu, 6 Jul 2017 11:39:23 +0200 Subject: [PATCH 13/19] Forgotten user directive --- doc/whats_new.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/whats_new.rst b/doc/whats_new.rst index c2c7ea1edf232..20484bea497a1 100644 --- a/doc/whats_new.rst +++ b/doc/whats_new.rst @@ -215,7 +215,7 @@ Preprocessing and feature selection - Small performance improvement to n-gram creation in :mod:`feature_extraction.text` by binding methods for loops and - special-casing unigrams. :issue:`7567` by `Jaye Doepke ` + special-casing unigrams. :issue:`7567` by :user:`Jaye Doepke ` - Relax assumption on the data for the :class:`kernel_approximation.SkewedChi2Sampler`. Since the Skewed-Chi2 From 5cc2b2837686f35d7e6f4830f1db40f1ede60272 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Lo=C3=AFc=20Est=C3=A8ve?= Date: Thu, 6 Jul 2017 11:44:40 +0200 Subject: [PATCH 14/19] Fix alignment --- doc/whats_new.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/whats_new.rst b/doc/whats_new.rst index 20484bea497a1..9f0a72233fb0d 100644 --- a/doc/whats_new.rst +++ b/doc/whats_new.rst @@ -376,7 +376,7 @@ Linear, kernelized and related models :class:`linear_model.LassoCV`. :issue:`8973` by :user:`Paulo Haddad `. - - Fixed a bug in :class:`gaussian_process.GaussianProcessRegressor` + - Fixed a bug in :class:`gaussian_process.GaussianProcessRegressor` when the standard deviation and covariance predicted without fit would fail with a unmeaningful error by default. :issue:`6573` by :user:`Quazi Marufur Rahman ` and From 9643fc468dcb546bd49fc3c27b57b29d813fdfe5 Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Fri, 7 Jul 2017 14:54:28 +1000 Subject: [PATCH 15/19] Cleaning up for Andy's comments --- doc/whats_new.rst | 50 +++--- sklearn/manifold/tests/test_t_sne.py | 235 +++++++++++++++++---------- 2 files changed, 173 insertions(+), 112 deletions(-) diff --git a/doc/whats_new.rst b/doc/whats_new.rst index c2c7ea1edf232..717ab8d894ce6 100644 --- a/doc/whats_new.rst +++ b/doc/whats_new.rst @@ -15,12 +15,12 @@ Highlights We are excited to release a number of great new features including :class:`neighbors.LocalOutlierFactor` for anomaly detection, -:class:`preprocessing.QuantileTransformer` for robust feature -transformation, and :class:`multioutput.ClassifierChain` to simply -account for dependencies between classes in multilabel problems. We -have some new algorithms in existing estimators, such as -multiplicative update in :class:`decomposition.NMF` and multinomial -:class:`linear_model.LogisticRegression` with L1 loss. +:class:`preprocessing.QuantileTransformer` for robust feature transformation, +and the :class:`multioutput.ClassifierChain` meta-estimator to simply account +for dependencies between classes in multilabel problems. We have some new +algorithms in existing estimators, such as multiplicative update in +:class:`decomposition.NMF` and multinomial +:class:`linear_model.LogisticRegression` with L1 loss (use ``solver='saga'``). You can also learn faster. For instance, the :ref:`new option to cache transformations ` in :class:`pipeline.Pipeline` makes grid @@ -40,8 +40,8 @@ prior results reported from that function. A number of errors in the :class:`manifold.TSNE` implementation have been fixed, particularly in the default Barnes-Hut approximation. :class:`semi_supervised.LabelSpreading` and :class:`semi_supervised.LabelPropagation` have had substantial fixes. -Propagation was previously broken. Spreading should now correctly respect its -alpha parameter. +LabelPropagation was previously broken. LabelSpreading should now correctly +respect its alpha parameter. Changed models -------------- @@ -150,7 +150,7 @@ Trees and ensembles - :class:`ensemble.VotingClassifier` now allows changing estimators by using :meth:`ensemble.VotingClassifier.set_params`. An estimator can also be - removed by setting it to `None`. + removed by setting it to ``None``. :issue:`7674` by :user:`Yichuan Liu `. Linear, kernelized and related models @@ -260,7 +260,7 @@ Model evaluation and meta-estimators :issue:`8845` by :user:`themrmax ` - :class:`multioutput.MultiOutputRegressor` and :class:`multioutput.MultiOutputClassifier` - now support online learning using `partial_fit`. + now support online learning using ``partial_fit``. issue: `8053` by :user:`Peng Yu `. - Add ``max_train_size`` parameter to :class:`model_selection.TimeSeriesSplit` @@ -340,7 +340,7 @@ Linear, kernelized and related models ``max_iter`` if it finds a large inlier group early. :issue:`8251` by :user:`aivision2020`. - Fixed a bug where :class:`naive_bayes.MultinomialNB` and :class:`naive_bayes.BernoulliNB` - failed when `alpha=0`. :issue:`5814` by :user:`Yichuan Liu ` and + failed when ``alpha=0``. :issue:`5814` by :user:`Yichuan Liu ` and :user:`Herilalaina Rakotoarison `. - Fixed a bug where :class:`linear_model.LassoLars` does not give @@ -350,17 +350,17 @@ Linear, kernelized and related models - Fixed a bug in :class:`linear_model.RandomizedLasso`, :class:`linear_model.Lars`, :class:`linear_model.LassoLars`, :class:`linear_model.LarsCV` and :class:`linear_model.LassoLarsCV`, - where the parameter ``precompute`` were not used consistently across + where the parameter ``precompute`` was not used consistently across classes, and some values proposed in the docstring could raise errors. :issue:`5359` by `Tom Dupre la Tour`_. - Fix a bug where :func:`linear_model.LassoLars.fit` sometimes - left `coef_` as a list, rather than an ndarray. + left ``coef_`` as a list, rather than an ndarray. :issue:`8160` by :user:`CJ Carey `. - Fix :func:`linear_model.BayesianRidge.fit` to return - ridge parameter `alpha_` and `lambda_` consistent with calculated - coefficients `coef_` and `intercept_`. + ridge parameter ``alpha_`` and ``lambda_`` consistent with calculated + coefficients ``coef_`` and ``intercept_``. :issue:`8224` by :user:`Peter Gedeck `. - Fixed a bug in :class:`svm.OneClassSVM` where it returned floats instead of @@ -404,7 +404,7 @@ Decomposition, manifold learning and clustering This also impacts the output shape of :class:`decomposition.DictionaryLearning`. :issue:`8086` by `Andreas Müller`_. - - Fixed the implementation of `explained_variance_` + - Fixed the implementation of ``explained_variance_`` in :class:`decomposition.PCA`, :class:`decomposition.RandomizedPCA` and :class:`decomposition.IncrementalPCA`. @@ -484,10 +484,10 @@ Model evaluation and meta-estimators classes are provided up-front. :issue:`6250` by :user:`Asish Panda `. - - Fix :func:`multioutput.MultiOutputClassifier.predict_proba` to - return a list of 2d arrays, rather than a 3d array. In the case where - different target columns had different numbers of classes, a `ValueError` - would be raised on trying to stack matrices with different dimensions. + - Fix :func:`multioutput.MultiOutputClassifier.predict_proba` to return a + list of 2d arrays, rather than a 3d array. In the case where different + target columns had different numbers of classes, a ``ValueError`` would be + raised on trying to stack matrices with different dimensions. :issue:`8093` by :user:`Peter Bull `. Metrics @@ -561,7 +561,7 @@ Other predictors - :class:`neighbors.LSHForest` has been deprecated and will be removed in 0.21 due to poor performance. - :issue:`8996` by `Andreas Müller`_. + :issue:`9078` by :user:`Laurent Direr `. - The ``alpha`` parameter of :class:`semi_supervised.LabelPropagation` now has no effect and is deprecated to be removed in 0.21. :issue:`9239` @@ -602,7 +602,7 @@ Preprocessing and feature selection - :class:`linear_model.RandomizedLogisticRegression`, and :class:`linear_model.RandomizedLasso` have been deprecated and will be removed in version 0.21. - :issue: `8995` by :user:`Ramana.S `. + :issue:`8995` by :user:`Ramana.S `. Model evaluation and meta-estimators @@ -633,7 +633,7 @@ Model evaluation and meta-estimators - The :func:`multioutput.MultiOutputClassifier.predict_proba` function used to return a 3d array (``n_samples``, ``n_classes``, ``n_outputs``). In the case where different target columns had different - numbers of classes, a `ValueError` would be raised on trying to stack + numbers of classes, a ``ValueError`` would be raised on trying to stack matrices with different dimensions. This function now returns a list of arrays where the length of the list is ``n_outputs``, and each array is (``n_samples``, ``n_classes``) for that particular output. @@ -645,11 +645,9 @@ Model evaluation and meta-estimators attribute, ``dict`` behavior will be prioritized. :issue:`8481` by :user:`Herilalaina Rakotoarison `. -Metrics - Miscellaneous - - Deprecate the ``y`` parameter in `transform` and `inverse_transform`. + - Deprecate the ``y`` parameter in ``transform`` and ``inverse_transform``. The method should not accept ``y`` parameter, as it's used at the prediction time. :issue:`8174` by :user:`Tahar Zanouda `, `Alexandre Gramfort`_ and `Raghav RV`_. diff --git a/sklearn/manifold/tests/test_t_sne.py b/sklearn/manifold/tests/test_t_sne.py index 52c056a5adadf..8b9c9d6a76862 100644 --- a/sklearn/manifold/tests/test_t_sne.py +++ b/sklearn/manifold/tests/test_t_sne.py @@ -10,6 +10,7 @@ from sklearn.utils.testing import assert_array_equal from sklearn.utils.testing import assert_array_almost_equal from sklearn.utils.testing import assert_less +from sklearn.utils.testing import assert_greater from sklearn.utils.testing import assert_raises_regexp from sklearn.utils.testing import assert_in from sklearn.utils.testing import skip_if_32bit @@ -140,20 +141,26 @@ def test_binary_search_neighbors(): # Test that when we use all the neighbors the results are identical k = n_samples - neighbors_nn = np.argsort(distances, axis=1)[:, :k].astype(np.int64) - P2 = _binary_search_perplexity(distances, neighbors_nn, + neighbors_nn = np.argsort(distances, axis=1)[:, 1:k].astype(np.int64) + distances_nn = np.array([distances[k, neighbors_nn[k]] + for k in range(n_samples)]) + P2 = _binary_search_perplexity(distances_nn, neighbors_nn, desired_perplexity, verbose=0) - assert_array_almost_equal(P1, P2, decimal=4) + P_nn = np.array([P1[k, neighbors_nn[k]] for k in range(n_samples)]) + assert_array_almost_equal(P_nn, P2, decimal=4) # Test that the highest P_ij are the same when few neighbors are used - for k in np.linspace(80, n_samples, 10): + for k in np.linspace(80, n_samples, 5): k = int(k) topn = k * 10 # check the top 10 *k entries out of k * k entries neighbors_nn = np.argsort(distances, axis=1)[:, :k].astype(np.int64) - P2k = _binary_search_perplexity(distances, neighbors_nn, + distances_nn = np.array([distances[k, neighbors_nn[k]] + for k in range(n_samples)]) + P2k = _binary_search_perplexity(distances_nn, neighbors_nn, desired_perplexity, verbose=0) idx = np.argsort(P1.ravel())[::-1] P1top = P1.ravel()[idx][:topn] + idx = np.argsort(P2k.ravel())[::-1] P2top = P2k.ravel()[idx][:topn] assert_array_almost_equal(P1top, P2top, decimal=2) @@ -175,6 +182,8 @@ def test_binary_perplexity_stability(): P = _binary_search_perplexity(distances.copy(), neighbors_nn.copy(), 3, verbose=0) P1 = _joint_probabilities_nn(distances, neighbors_nn, 3, verbose=0) + # Convert the sparse matrix to a dense one for testing + P1 = P1.toarray() if last_P is None: last_P = P last_P1 = P1 @@ -193,9 +202,9 @@ def test_gradient(): alpha = 1.0 distances = random_state.randn(n_samples, n_features).astype(np.float32) - distances = distances.dot(distances.T) + distances = np.abs(distances.dot(distances.T)) np.fill_diagonal(distances, 0.0) - X_embedded = random_state.randn(n_samples, n_components) + X_embedded = random_state.randn(n_samples, n_components).astype(np.float32) P = _joint_probabilities(distances, desired_perplexity=25.0, verbose=0) @@ -233,21 +242,17 @@ def test_trustworthiness(): def test_preserve_trustworthiness_approximately(): # Nearest neighbors should be preserved approximately. random_state = check_random_state(0) - # The Barnes-Hut approximation uses a different method to estimate - # P_ij using only a number of nearest neighbors instead of all - # points (so that k = 3 * perplexity). As a result we set the - # perplexity=5, so that the number of neighbors is 5%. n_components = 2 methods = ['exact', 'barnes_hut'] - X = random_state.randn(100, n_components).astype(np.float32) + X = random_state.randn(50, n_components).astype(np.float32) for init in ('random', 'pca'): for method in methods: - tsne = TSNE(n_components=n_components, perplexity=50, + tsne = TSNE(n_components=n_components, perplexity=25, learning_rate=100.0, init=init, random_state=0, method=method) X_embedded = tsne.fit_transform(X) - T = trustworthiness(X, X_embedded, n_neighbors=1) - assert_almost_equal(T, 1.0, decimal=1) + t = trustworthiness(X, X_embedded, n_neighbors=1) + assert_greater(t, 0.9) def test_optimization_minimizes_kl_divergence(): @@ -255,7 +260,7 @@ def test_optimization_minimizes_kl_divergence(): random_state = check_random_state(0) X, _ = make_blobs(n_features=3, random_state=random_state) kl_divergences = [] - for n_iter in [200, 250, 300]: + for n_iter in [250, 300, 350]: tsne = TSNE(n_components=2, perplexity=10, learning_rate=100.0, n_iter=n_iter, random_state=0) tsne.fit_transform(X) @@ -280,13 +285,16 @@ def test_fit_csr_matrix(): def test_preserve_trustworthiness_approximately_with_precomputed_distances(): # Nearest neighbors should be preserved approximately. random_state = check_random_state(0) - X = random_state.randn(100, 2) - D = squareform(pdist(X), "sqeuclidean") - tsne = TSNE(n_components=2, perplexity=2, learning_rate=100.0, - metric="precomputed", random_state=0, verbose=0) - X_embedded = tsne.fit_transform(D) - assert_almost_equal(trustworthiness(D, X_embedded, n_neighbors=1, - precomputed=True), 1.0, decimal=1) + for i in range(3): + X = random_state.randn(100, 2) + D = squareform(pdist(X), "sqeuclidean") + tsne = TSNE(n_components=2, perplexity=2, learning_rate=100.0, + early_exaggeration=2.0, metric="precomputed", + random_state=i, verbose=0) + X_embedded = tsne.fit_transform(D) + t = trustworthiness(D, X_embedded, n_neighbors=1, + precomputed=True) + assert t > .95 def test_early_exaggeration_too_small(): @@ -310,10 +318,32 @@ def test_non_square_precomputed_distances(): tsne.fit_transform, np.array([[0.0], [1.0]])) +def test_non_positive_precomputed_distances(): + # Precomputed distance matrices must be positive. + bad_dist = np.array([[0., -1.], [1., 0.]]) + for method in ['barnes_hut', 'exact']: + tsne = TSNE(metric="precomputed", method=method) + assert_raises_regexp(ValueError, "All distances .*precomputed.*", + tsne.fit_transform, bad_dist) + + +def test_non_positive_computed_distances(): + # Computed distance matrices must be positive. + def metric(x, y): + return -1 + + tsne = TSNE(metric=metric, method='exact') + X = np.array([[0.0, 0.0], [1.0, 1.0]]) + assert_raises_regexp(ValueError, "All distances .*metric given.*", + tsne.fit_transform, X) + + def test_init_not_available(): # 'init' must be 'pca', 'random', or numpy array. + tsne = TSNE(init="not available") m = "'init' must be 'pca', 'random', or a numpy array" - assert_raises_regexp(ValueError, m, TSNE, init="not available") + assert_raises_regexp(ValueError, m, tsne.fit_transform, + np.array([[0.0], [1.0]])) def test_init_ndarray(): @@ -332,10 +362,29 @@ def test_init_ndarray_precomputed(): def test_distance_not_available(): # 'metric' must be valid. - tsne = TSNE(metric="not available") + tsne = TSNE(metric="not available", method='exact') assert_raises_regexp(ValueError, "Unknown metric not available.*", tsne.fit_transform, np.array([[0.0], [1.0]])) + tsne = TSNE(metric="not available", method='barnes_hut') + assert_raises_regexp(ValueError, "Metric 'not available' not valid.*", + tsne.fit_transform, np.array([[0.0], [1.0]])) + + +def test_method_not_available(): + # 'nethod' must be 'barnes_hut' or 'exact' + tsne = TSNE(method='not available') + assert_raises_regexp(ValueError, "'method' must be 'barnes_hut' or ", + tsne.fit_transform, np.array([[0.0], [1.0]])) + + +def test_angle_out_of_range_checks(): + # check the angle parameter range + for angle in [-1, -1e-6, 1 + 1e-6, 2]: + tsne = TSNE(angle=angle) + assert_raises_regexp(ValueError, "'angle' must be between 0.0 - 1.0", + tsne.fit_transform, np.array([[0.0], [1.0]])) + def test_pca_initialization_not_compatible_with_precomputed_kernel(): # Precomputed distance matrices must be square matrices. @@ -345,6 +394,48 @@ def test_pca_initialization_not_compatible_with_precomputed_kernel(): tsne.fit_transform, np.array([[0.0], [1.0]])) +def test_n_components_range(): + # barnes_hut method should only be used with n_components <= 3 + tsne = TSNE(n_components=4, method="barnes_hut") + assert_raises_regexp(ValueError, "'n_components' should be .*", + tsne.fit_transform, np.array([[0.0], [1.0]])) + + +def test_early_exaggeration_used(): + # check that the ``early_exaggeration`` parameter has an effect + random_state = check_random_state(0) + n_components = 2 + methods = ['exact', 'barnes_hut'] + X = random_state.randn(25, n_components).astype(np.float32) + for method in methods: + tsne = TSNE(n_components=n_components, perplexity=1, + learning_rate=100.0, init="pca", random_state=0, + method=method, early_exaggeration=1.0) + X_embedded1 = tsne.fit_transform(X) + tsne = TSNE(n_components=n_components, perplexity=1, + learning_rate=100.0, init="pca", random_state=0, + method=method, early_exaggeration=10.0) + X_embedded2 = tsne.fit_transform(X) + + assert not np.allclose(X_embedded1, X_embedded2) + + +def test_n_iter_used(): + # check that the ``n_iter`` parameter has an effect + random_state = check_random_state(0) + n_components = 2 + methods = ['exact', 'barnes_hut'] + X = random_state.randn(25, n_components).astype(np.float32) + for method in methods: + for n_iter in [251, 500]: + tsne = TSNE(n_components=n_components, perplexity=1, + learning_rate=0.5, init="random", random_state=0, + method=method, early_exaggeration=1.0, n_iter=n_iter) + tsne.fit_transform(X) + + assert tsne.n_iter_final == n_iter - 1 + + def test_answer_gradient_two_points(): # Test the tree with only a single set of children. # @@ -418,7 +509,13 @@ def _run_answer_test(pos_input, pos_output, neighbors, grad_output, pij_input = squareform(pij_input).astype(np.float32) grad_bh = np.zeros(pos_output.shape, dtype=np.float32) - _barnes_hut_tsne.gradient(pij_input, pos_output, neighbors, + from scipy.sparse import csr_matrix + P = csr_matrix(pij_input) + + neighbors = P.indices.astype(np.int64) + indptr = P.indptr.astype(np.int64) + + _barnes_hut_tsne.gradient(P.data, pos_output, neighbors, indptr, grad_bh, 0.5, 2, 1, skip_num_points=0) assert_array_almost_equal(grad_bh, grad_output, decimal=4) @@ -439,7 +536,7 @@ def test_verbose(): sys.stdout = old_stdout assert("[t-SNE]" in out) - assert("Computing pairwise distances" in out) + assert("nearest neighbors..." in out) assert("Computed conditional probabilities" in out) assert("Mean sigma" in out) assert("Finished" in out) @@ -481,10 +578,15 @@ def test_64bit(): methods = ['barnes_hut', 'exact'] for method in methods: for dt in [np.float32, np.float64]: - X = random_state.randn(100, 2).astype(dt) + X = random_state.randn(50, 2).astype(dt) tsne = TSNE(n_components=2, perplexity=2, learning_rate=100.0, - random_state=0, method=method) - tsne.fit_transform(X) + random_state=0, method=method, verbose=0) + X_embedded = tsne.fit_transform(X) + effective_type = X_embedded.dtype + + # tsne cython code is only single precision, so the output will + # always be single precision, irrespectively of the input dtype + assert effective_type == np.float32 def test_barnes_hut_angle(): @@ -499,10 +601,10 @@ def test_barnes_hut_angle(): random_state = check_random_state(0) distances = random_state.randn(n_samples, n_features) distances = distances.astype(np.float32) - distances = distances.dot(distances.T) + distances = abs(distances.dot(distances.T)) np.fill_diagonal(distances, 0.0) params = random_state.randn(n_samples, n_components) - P = _joint_probabilities(distances, perplexity, False) + P = _joint_probabilities(distances, perplexity, verbose=0) kl, gradex = _kl_divergence(params, P, degrees_of_freedom, n_samples, n_components) @@ -510,58 +612,19 @@ def test_barnes_hut_angle(): bt = BallTree(distances) distances_nn, neighbors_nn = bt.query(distances, k=k + 1) neighbors_nn = neighbors_nn[:, 1:] - Pbh = _joint_probabilities_nn(distances, neighbors_nn, - perplexity, False) - kl, gradbh = _kl_divergence_bh(params, Pbh, neighbors_nn, - degrees_of_freedom, n_samples, - n_components, angle=angle, - skip_num_points=0, verbose=False) + distances_nn = np.array([distances[i, neighbors_nn[i]] + for i in range(n_samples)]) + assert np.all(distances[0, neighbors_nn[0]] == distances_nn[0]),\ + abs(distances[0, neighbors_nn[0]] - distances_nn[0]) + Pbh = _joint_probabilities_nn(distances_nn, neighbors_nn, + perplexity, verbose=0) + kl, gradbh = _kl_divergence_bh(params, Pbh, degrees_of_freedom, + n_samples, n_components, angle=angle, + skip_num_points=0, verbose=0) + + P = squareform(P) + Pbh = Pbh.toarray() assert_array_almost_equal(Pbh, P, decimal=5) - assert_array_almost_equal(gradex, gradbh, decimal=5) - - -def test_quadtree_similar_point(): - # Introduce a point into a quad tree where a similar point already exists. - # Test will hang if it doesn't complete. - Xs = [] - - # check the case where points are actually different - Xs.append(np.array([[1, 2], [3, 4]], dtype=np.float32)) - # check the case where points are the same on X axis - Xs.append(np.array([[1.0, 2.0], [1.0, 3.0]], dtype=np.float32)) - # check the case where points are arbitrarily close on X axis - Xs.append(np.array([[1.00001, 2.0], [1.00002, 3.0]], dtype=np.float32)) - # check the case where points are the same on Y axis - Xs.append(np.array([[1.0, 2.0], [3.0, 2.0]], dtype=np.float32)) - # check the case where points are arbitrarily close on Y axis - Xs.append(np.array([[1.0, 2.00001], [3.0, 2.00002]], dtype=np.float32)) - # check the case where points are arbitrarily close on both axes - Xs.append(np.array([[1.00001, 2.00001], [1.00002, 2.00002]], - dtype=np.float32)) - - # check the case where points are arbitrarily close on both axes - # close to machine epsilon - x axis - Xs.append(np.array([[1, 0.0003817754041], [2, 0.0003817753750]], - dtype=np.float32)) - - # check the case where points are arbitrarily close on both axes - # close to machine epsilon - y axis - Xs.append(np.array([[0.0003817754041, 1.0], [0.0003817753750, 2.0]], - dtype=np.float32)) - - for X in Xs: - counts = np.zeros(3, dtype='int64') - _barnes_hut_tsne.check_quadtree(X, counts) - m = "Tree consistency failed: unexpected number of points at root node" - assert_equal(counts[0], counts[1], m) - m = "Tree consistency failed: unexpected number of points on the tree" - assert_equal(counts[0], counts[2], m) - - -def test_index_offset(): - # Make sure translating between 1D and N-D indices are preserved - assert_equal(_barnes_hut_tsne.test_index2offset(), 1) - assert_equal(_barnes_hut_tsne.test_index_offset(), 1) @skip_if_32bit @@ -569,8 +632,8 @@ def test_n_iter_without_progress(): # Use a dummy negative n_iter_without_progress and check output on stdout random_state = check_random_state(0) X = random_state.randn(100, 2) - tsne = TSNE(n_iter_without_progress=-1, verbose=2, - random_state=1, method='exact') + tsne = TSNE(n_iter_without_progress=-1, verbose=2, learning_rate=1e8, + random_state=1, method='exact', n_iter=300) old_stdout = sys.stdout sys.stdout = StringIO() @@ -616,7 +679,7 @@ def test_min_grad_norm(): start_grad_norm = line.find('gradient norm') if start_grad_norm >= 0: line = line[start_grad_norm:] - line = line.replace('gradient norm = ', '') + line = line.replace('gradient norm = ', '').split(' ')[0] gradient_norm_values.append(float(line)) # Compute how often the gradient norm is smaller than min_grad_norm From 08fc42c7b73faac7bd4c57b96913745d568969a3 Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Fri, 7 Jul 2017 14:59:14 +1000 Subject: [PATCH 16/19] Mention beta_loss=0 speedup --- doc/whats_new.rst | 3 +++ 1 file changed, 3 insertions(+) diff --git a/doc/whats_new.rst b/doc/whats_new.rst index 974fd48f7409d..71fb0612e70f2 100644 --- a/doc/whats_new.rst +++ b/doc/whats_new.rst @@ -204,6 +204,9 @@ Decomposition, manifold learning and clustering from the underlying SVD. They are stored in the attribute ``singular_values_``, like in :class:`decomposition.IncrementalPCA`. + - :class:`decomposition.NMF` now faster when ``beta_loss=0``. + :issue:`9277` by :user:`hongkahjun`. + Preprocessing and feature selection - Added ``norm_order`` parameter to :class:`feature_selection.SelectFromModel` From 6aa9fd0c87b8bb422c200da5f1b6a6751a5c5a8e Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Tue, 11 Jul 2017 09:56:53 +1000 Subject: [PATCH 17/19] Update --- doc/whats_new.rst | 51 ++++++++++++++++++++++++----------------------- 1 file changed, 26 insertions(+), 25 deletions(-) diff --git a/doc/whats_new.rst b/doc/whats_new.rst index 4295fc71ab7cb..dbc725d6d4c78 100644 --- a/doc/whats_new.rst +++ b/doc/whats_new.rst @@ -78,28 +78,6 @@ Changelog New features ............ -Configuration - - :class:`model_selection.GridSearchCV` and - :class:`model_selection.RandomizedSearchCV` now support simultaneous - evaluation of multiple metrics. Refer to the - :ref:`multimetric_grid_search` section of the user guide for more - information. :issue:`7388` by `Raghav RV`_ - - - Added the :func:`model_selection.cross_validate` which allows evaluation - of multiple metrics. This function returns a dict with more useful - information from cross-validation such as the train scores, fit times and - score times. - Refer to :ref:`multimetric_cross_validation` section of the userguide - for more information. :issue:`7388` by `Raghav RV`_ - - - Added :class:`multioutput.ClassifierChain` for multi-label - classification. By `Adam Kleczewski `_. - - - Validation that input data contains no NaN or inf can now be suppressed - using :func:`config_context`, at your own risk. This will save on runtime, - and may be particularly useful for prediction time. :issue:`7548` by - `Joel Nothman`_. - Classifiers and regressors - Added :class:`multioutput.ClassifierChain` for multi-label @@ -133,6 +111,19 @@ Other estimators Model selection and evaluation + - :class:`model_selection.GridSearchCV` and + :class:`model_selection.RandomizedSearchCV` now support simultaneous + evaluation of multiple metrics. Refer to the + :ref:`multimetric_grid_search` section of the user guide for more + information. :issue:`7388` by `Raghav RV`_ + + - Added the :func:`model_selection.cross_validate` which allows evaluation + of multiple metrics. This function returns a dict with more useful + information from cross-validation such as the train scores, fit times and + score times. + Refer to :ref:`multimetric_cross_validation` section of the userguide + for more information. :issue:`7388` by `Raghav RV`_ + - Added :func:`metrics.mean_squared_log_error`, which computes the mean square error of the logarithmic transformation of targets, particularly useful for targets with an exponential trend. @@ -147,6 +138,12 @@ Model selection and evaluation :class:`model_selection.RepeatedStratifiedKFold`. :issue:`8120` by `Neeraj Gangwar`_. +Miscellaneous + + - Validation that input data contains no NaN or inf can now be suppressed + using :func:`config_context`, at your own risk. This will save on runtime, + and may be particularly useful for prediction time. :issue:`7548` by + `Joel Nothman`_. Enhancements ............ @@ -372,6 +369,10 @@ Linear, kernelized and related models classes, and some values proposed in the docstring could raise errors. :issue:`5359` by `Tom Dupre la Tour`_. + - Fix inconsistent results between :class:`linear_model.RidgeCV` + and :class:`linear_model.Ridge` when using ``normalize=True`` + by `Alexandre Gramfort`_. + - Fix a bug where :func:`linear_model.LassoLars.fit` sometimes left ``coef_`` as a list, rather than an ndarray. :issue:`8160` by :user:`CJ Carey `. @@ -564,9 +565,9 @@ Miscellaneous - Add ``data_home`` parameter to :func:`sklearn.datasets.fetch_kddcup99` by `Loic Esteve`_. - - Fix inconsistent results between :class:`linear_model.RidgeCV` - and :class:`linear_model.Ridge` when using ``normalize=True`` - by `Alexandre Gramfort`_. + - Several minor issues were fixed with thanks to the alerts of + [lgtm.com](http://lgtm.com). :issue:`9278` by :user:`Jean Helie `, + among others. API changes summary ------------------- From 09878d4c082c0968f35ab5ec6a5f239911c5bfb5 Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Thu, 13 Jul 2017 08:51:38 +1000 Subject: [PATCH 18/19] Clean up new what's new entries --- doc/whats_new.rst | 42 +++++++++++++++++------------------------- 1 file changed, 17 insertions(+), 25 deletions(-) diff --git a/doc/whats_new.rst b/doc/whats_new.rst index d9e7331df4a4f..432bdd46475e9 100644 --- a/doc/whats_new.rst +++ b/doc/whats_new.rst @@ -66,8 +66,6 @@ random sampling procedures. * :class:`semi_supervised.LabelSpreading` (bug fix) * :class:`semi_supervised.LabelPropagation` (bug fix) * tree based models where ``min_weight_fraction_leaf`` is used (enhancement) - * :class:`sklearn.ensemble.IsolationForest` (bug fix) - * :class:`sklearn.manifold.TSNE` (bug fix) Details are listed in the changelog below. @@ -221,6 +219,15 @@ Decomposition, manifold learning and clustering - :class:`decomposition.NMF` now faster when ``beta_loss=0``. :issue:`9277` by :user:`hongkahjun`. + - Memory improvements for method ``barnes_hut`` in :class:`manifold.TSNE` + :issue:`7089` by :user:`Thomas Moreau ` and `Olivier Grisel`_. + + - Optimization schedule improvements for Barnes-Hut :class:`manifold.TSNE` + so the results are closer to the one from the reference implementation + `lvdmaaten/bhtsne `_ by :user:`Thomas + Moreau ` and `Olivier Grisel`_. + + Preprocessing and feature selection - Added ``norm_order`` parameter to :class:`feature_selection.SelectFromModel` @@ -307,21 +314,6 @@ Miscellaneous passing a range of bytes to :func:`datasets.load_svmlight_file`. :issue:`935` by :user:`Olivier Grisel `. - - Small performance improvement to n-gram creation in - :mod:`feature_extraction.text` by binding methods for loops and - special-casing unigrams. :issue:`7567` by `Jaye Doepke ` - - - Speed improvements to :class:`model_selection.StratifiedShuffleSplit`. - :issue:`5991` by :user:`Arthur Mensch ` and `Joel Nothman`_. - - - Memory improvements for method barnes_hut in :class:`manifold.TSNE` - :issue:`7089` by :user:`Thomas Moreau ` and `Olivier Grisel`_. - - - Optimization schedule improvements for so the results are closer to the - one from the reference implementation - `lvdmaaten/bhtsne `_ by - :user:`Thomas Moreau ` and `Olivier Grisel`_. - Bug fixes ......... @@ -428,6 +420,14 @@ Other predictors Decomposition, manifold learning and clustering + - Fixed the implementation of :class:`manifold.TSNE`: + - ``early_exageration`` parameter had no effect and is now used for the + first 250 optimization iterations. + - Fixed the ``InsersionError`` reported in :issue:`8992`. + - Improve the learning schedule to match the one from the reference + implementation `lvdmaaten/bhtsne `_. + by :user:`Thomas Moreau ` and `Olivier Grisel`_. + - Fix a bug in :class:`decomposition.LatentDirichletAllocation` where the ``perplexity`` method was returning incorrect results because the ``transform`` method returns normalized document topic distributions @@ -586,14 +586,6 @@ Miscellaneous [lgtm.com](http://lgtm.com). :issue:`9278` by :user:`Jean Helie `, among others. - - Fixed the implementation of :class:`manifold.TSNE`: - - ``early_exageration`` parameter had no effect and is now used for the - first 250 optimization iterations. - - Fixed the ``InsersionError`` reported in :issue:`8992`. - - Improve the learning schedule to match the one from the reference - implementation `lvdmaaten/bhtsne `_. - by :user:`Thomas Moreau ` and `Olivier Grisel`_. - API changes summary ------------------- From 0312156db192c34de428f591dc89c2891b01680e Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Thu, 13 Jul 2017 15:10:48 +1000 Subject: [PATCH 19/19] DOC Add changes missed from what's new And other minor things. This took lots of effort which I would have not committed where I not home sick... --- doc/whats_new.rst | 171 +++++++++++++++++++++++++++++++++++----------- 1 file changed, 131 insertions(+), 40 deletions(-) diff --git a/doc/whats_new.rst b/doc/whats_new.rst index 432bdd46475e9..21eb3478dbc1b 100644 --- a/doc/whats_new.rst +++ b/doc/whats_new.rst @@ -56,7 +56,7 @@ random sampling procedures. with ``scale=True`` (bug fix) * :class:`ensemble.GradientBoostingClassifier` and :class:`ensemble.GradientBoostingRegressor` where ``min_impurity_split`` is used (bug fix) - * gradient boosting with :class:`ensemble.gradient_boosting.QuantileLossFunction` (bug fix) + * gradient boosting ``loss='quantile'`` (bug fix) * :class:`ensemble.IsolationForest` (bug fix) * :class:`feature_selection.SelectFdr` (bug fix) * :class:`linear_model.RANSACRegressor` (bug fix) @@ -145,6 +145,10 @@ Miscellaneous and may be particularly useful for prediction time. :issue:`7548` by `Joel Nothman`_. + - Added a test to ensure parameter listing in docstrings match the + function/class signature. :issue:`9206` by `Alexandre Gramfort`_ and + `Raghav RV`_. + Enhancements ............ @@ -165,6 +169,9 @@ Trees and ensembles removed by setting it to ``None``. :issue:`7674` by :user:`Yichuan Liu `. + - :func:`tree.export_graphviz` now shows configurable number of decimal + places. :issue:`8698` by :user:`Guillaume Lemaitre `. + Linear, kernelized and related models - :class:`linear_model.SGDClassifier`, :class:`linear_model.SGDRegressor`, @@ -174,7 +181,7 @@ Linear, kernelized and related models ``tol`` parameters, to handle convergence more precisely. ``n_iter`` parameter is deprecated, and the fitted estimator exposes a ``n_iter_`` attribute, with actual number of iterations before - convergence. By `Tom Dupre la Tour`_. + convergence. :issue:`5036` by `Tom Dupre la Tour`_. - Added ``average`` parameter to perform weight averaging in :class:`linear_model.PassiveAggressiveClassifier`. :issue:`4939` @@ -190,14 +197,17 @@ Linear, kernelized and related models is a lot faster with ``return_std=True``. :issue:`8591` by :user:`Hadrien Bertrand `. - - Memory usage enhancement: Prevent cast from float32 to float64 in - :class:`linear_model.LogisticRegression` when using newton-cg - solver. :issue:`8835` by :user:`Joan Massich `. + - Added ``return_std`` to ``predict`` method of + :class:`linear_model.ARDRegression` and + :class:`linear_model.BayesianRidge`. + :issue:`7838` by :user:`Sergey Feldman `. - - Memory usage enhancement: Prevent cast from float32 to float64 in - :class:`linear_model.Ridge` when using svd, sparse_cg, cholesky or lsqr solvers - :class:`linear_model.Ridge` when using svd, sparse_cg, cholesky or lsqr solvers - by :user:`Joan Massich `, :user:`Nicolas Cordier ` + - Memory usage enhancements: Prevent cast from float32 to float64 in: + :class:`linear_model.MultiTaskElasticNet`; + :class:`linear_model.LogisticRegression` when using newton-cg solver; and + :class:`linear_model.Ridge` when using svd, sparse_cg, cholesky or lsqr + solvers. :issue:`8835`, :issue:`8061` by :user:`Joan Massich ` and :user:`Nicolas + Cordier ` and :user:`Thierry Guillemot`. Other predictors @@ -205,6 +215,11 @@ Other predictors fewer constraints: they must take two 1d-arrays and return a float. :issue:`6288` by `Jake Vanderplas`_. + - ``algorithm='auto`` in :mod:`neighbors` estimators now chooses the most + appropriate algorithm for all input types and metrics. :issue:`9145` by + :user:`Herilalaina Rakotoarison ` and :user:`Reddy Chinthala + `. + Decomposition, manifold learning and clustering - :class:`cluster.MiniBatchKMeans` and :class:`cluster.KMeans` @@ -215,6 +230,7 @@ Decomposition, manifold learning and clustering :class:`decomposition.TruncatedSVD` now expose the singular values from the underlying SVD. They are stored in the attribute ``singular_values_``, like in :class:`decomposition.IncrementalPCA`. + :issue:`7685` by :user:`Tommy Löfstedt ` - :class:`decomposition.NMF` now faster when ``beta_loss=0``. :issue:`9277` by :user:`hongkahjun`. @@ -227,6 +243,10 @@ Decomposition, manifold learning and clustering `lvdmaaten/bhtsne `_ by :user:`Thomas Moreau ` and `Olivier Grisel`_. + - Memory usage enhancements: Prevent cast from float32 to float64 in + :class:`decomposition.PCA` and + :func:`decomposition.randomized_svd_low_rank`. + :issue:`9067` by `Raghav RV`_. Preprocessing and feature selection @@ -257,6 +277,10 @@ Model evaluation and meta-estimators within a pipeline by using the ``memory`` constructor parameter. :issue:`7990` by :user:`Guillaume Lemaitre `. + - :class:`pipeline.Pipeline` steps can now be accessed as attributes of its + ``named_steps`` attribute. :issue:`8586` by :user:`Herilalaina + Rakotoarison `. + - Added ``sample_weight`` parameter to :meth:`pipeline.Pipeline.score`. :issue:`7723` by :user:`Mikhail Korobov `. @@ -264,9 +288,11 @@ Model evaluation and meta-estimators A ``TypeError`` will be raised for any other kwargs. :issue:`8028` by :user:`Alexander Booth `. - - :class:`model_selection.GridSearchCV`, :class:`model_selection.RandomizedSearchCV` - and :func:`model_selection.cross_val_score` now allow estimators with callable - kernels which were previously prohibited. :issue:`8005` by `Andreas Müller`_ . + - :class:`model_selection.GridSearchCV`, + :class:`model_selection.RandomizedSearchCV` and + :func:`model_selection.cross_val_score` now allow estimators with callable + kernels which were previously prohibited. + :issue:`8005` by `Andreas Müller`_ . - :func:`model_selection.cross_val_predict` now returns output of the correct shape for all values of the argument ``method``. @@ -277,6 +303,9 @@ Model evaluation and meta-estimators :func:`model_selection.learning_curve`. :issue:`7506` by :user:`Narine Kokhlikyan `. + - :class:`model_selection.StratifiedShuffleSplit` now works with multioutput + multiclass (or multilabel) data. :issue:`9044` by `Vlad Niculae`_. + - Speed improvements to :class:`model_selection.StratifiedShuffleSplit`. :issue:`5991` by :user:`Arthur Mensch ` and `Joel Nothman`_. @@ -285,11 +314,14 @@ Model evaluation and meta-estimators - :class:`multioutput.MultiOutputRegressor` and :class:`multioutput.MultiOutputClassifier` now support online learning using ``partial_fit``. - issue: `8053` by :user:`Peng Yu `. + :issue: `8053` by :user:`Peng Yu `. - Add ``max_train_size`` parameter to :class:`model_selection.TimeSeriesSplit` :issue:`8282` by :user:`Aman Dalmia `. + - More clustering metrics are now available through :func:`metrics.get_scorer` + and ``scoring`` parameters. :issue:`8117` by `Raghav RV`_. + Metrics - :func:`metrics.matthews_corrcoef` now support multiclass classification. @@ -300,25 +332,31 @@ Metrics Miscellaneous - - :func:`utils.check_estimator` now attempts to ensure that methods transform, predict, etc. - do not set attributes on the estimator. + - :func:`utils.check_estimator` now attempts to ensure that methods + transform, predict, etc. do not set attributes on the estimator. :issue:`7533` by :user:`Ekaterina Krivich `. - Added type checking to the ``accept_sparse`` parameter in - :mod:`utils.validation` methods. This parameter now accepts only - boolean, string, or list/tuple of strings. ``accept_sparse=None`` is deprecated - and should be replaced by ``accept_sparse=False``. + :mod:`utils.validation` methods. This parameter now accepts only boolean, + string, or list/tuple of strings. ``accept_sparse=None`` is deprecated and + should be replaced by ``accept_sparse=False``. :issue:`7880` by :user:`Josh Karnofsky `. - Make it possible to load a chunk of an svmlight formatted file by passing a range of bytes to :func:`datasets.load_svmlight_file`. :issue:`935` by :user:`Olivier Grisel `. + - :class:`dummy.DummyClassifier` and :class:`dummy.DummyRegressor` + now accept non-finite features. :issue:`8931` by :user:`Attractadore`. + Bug fixes ......... Trees and ensembles + - Fixed a memory leak in trees when using trees with ``criterion='mae'``. + :issue:`8002` by `Raghav RV`_. + - Fixed a bug where :class:`ensemble.IsolationForest` uses an an incorrect formula for the average path length :issue:`8549` by `Peter Wang `_. @@ -327,10 +365,10 @@ Trees and ensembles ``ZeroDivisionError`` while fitting data with single class labels. :issue:`7501` by :user:`Dominik Krzeminski `. - - Fixed a bug in :class:`ensemble.GradientBoostingClassifier` - and :class:`ensemble.GradientBoostingRegressor` - where a float being compared to ``0.0`` using ``==`` caused a divide by zero - error. issue:`7970` by :user:`He Chen `. + - Fixed a bug in :class:`ensemble.GradientBoostingClassifier` and + :class:`ensemble.GradientBoostingRegressor` where a float being compared + to ``0.0`` using ``==`` caused a divide by zero error. :issue:`7970` by + :user:`He Chen `. - Fix a bug where :class:`ensemble.GradientBoostingClassifier` and :class:`ensemble.GradientBoostingRegressor` ignored the @@ -338,16 +376,21 @@ Trees and ensembles :issue:`8006` by :user:`Sebastian Pölsterl `. - Fixed ``oob_score`` in :class:`ensemble.BaggingClassifier`. - :issue:`8936` by :user:`mlewis1729 ` + :issue:`8936` by :user:`Michael Lewis ` + + - Fixed excessive memory usage in prediction for random forests estimators. + :issue:`8672` by :user:`Mike Benfield `. + + - Fixed a bug where ``sample_weight`` as a list broke random forests in Python 2 + :issue:`8068` by :user:`xor`. - Fixed a bug where :class:`ensemble.IsolationForest` fails when ``max_features`` is less than 1. :issue:`5732` by :user:`Ishank Gulati `. - - Fix a bug where - :class:`ensemble.gradient_boosting.QuantileLossFunction` computed - negative errors for negative values of ``ytrue - ypred`` leading to - wrong values when calling ``__call__``. + - Fix a bug where gradient boosting with ``loss='quantile'`` computed + negative errors for negative values of ``ytrue - ypred`` leading to wrong + values when calling ``__call__``. :issue:`8087` by :user:`Alexis Mignon ` - Fix a bug where :class:`ensemble.VotingClassifier` raises an error @@ -361,11 +404,13 @@ Trees and ensembles Linear, kernelized and related models - Fixed a bug where :func:`linear_model.RANSACRegressor.fit` may run until - ``max_iter`` if it finds a large inlier group early. :issue:`8251` by :user:`aivision2020`. + ``max_iter`` if it finds a large inlier group early. :issue:`8251` by + :user:`aivision2020`. - - Fixed a bug where :class:`naive_bayes.MultinomialNB` and :class:`naive_bayes.BernoulliNB` - failed when ``alpha=0``. :issue:`5814` by :user:`Yichuan Liu ` and - :user:`Herilalaina Rakotoarison `. + - Fixed a bug where :class:`naive_bayes.MultinomialNB` and + :class:`naive_bayes.BernoulliNB` failed when ``alpha=0``. :issue:`5814` by + :user:`Yichuan Liu ` and :user:`Herilalaina Rakotoarison + `. - Fixed a bug where :class:`linear_model.LassoLars` does not give the same result as the LassoLars implementation available @@ -378,8 +423,8 @@ Linear, kernelized and related models classes, and some values proposed in the docstring could raise errors. :issue:`5359` by `Tom Dupre la Tour`_. - - Fix inconsistent results between :class:`linear_model.RidgeCV` - and :class:`linear_model.Ridge` when using ``normalize=True`` + - Fix inconsistent results between :class:`linear_model.RidgeCV` and + :class:`linear_model.Ridge` when using ``normalize=True``. :issue:`9302` by `Alexandre Gramfort`_. - Fix a bug where :func:`linear_model.LassoLars.fit` sometimes @@ -471,6 +516,15 @@ Decomposition, manifold learning and clustering - Fixed improper scaling in :class:`cross_decomposition.PLSRegression` with ``scale=True``. :issue:`7819` by :user:`jayzed82 `. + - :class:`cluster.bicluster.SpectralCoclustering` and + :class:`cluster.bicluster.SpectralBiclustering` ``fit`` method conforms + with API by accepting ``y`` and returning the object. :issue:`6126`, + :issue:`7814` by :user:`Laurent Direr ` and :user:`Maniteja + Nandana `. + + - Fix bug where :mod:`mixture` ``sample`` methods did not return as many + samples as requested. :issue:`7702` by :user:`Levi John Wolf `. + Preprocessing and feature selection - For sparse matrices, :func:`preprocessing.normalize` with ``return_norm=True`` @@ -494,6 +548,10 @@ Preprocessing and feature selection pipeline with :class:`feature_extraction.text.TfidfTransformer`. :issue:`7565` by :user:`Roman Yurchak `. + - Fix a bug where :class:`feature_selection.mutual_info_regression` did not + correctly use ``n_neighbors``. :issue:`8181` by :user:`Guillaume Lemaitre + `. + Model evaluation and meta-estimators - Fixed a bug where :func:`model_selection.BaseSearchCV.inverse_transform` @@ -512,6 +570,9 @@ Model evaluation and meta-estimators reused the same estimator for each parameter value. :issue:`7365` by :user:`Aleksandr Sandrovskii `. + - :func:`model_selection.permutation_test_score` now works with Pandas + types. :issue:`5697` by :user:`Stijn Tonk `. + - Several fixes to input validation in :class:`multiclass.OutputCodeClassifier` :issue:`8086` by `Andreas Müller`_. @@ -545,7 +606,7 @@ Metrics by `Joel Nothman`_ and :user:`Jon Crall `. - Fixed passing of ``gamma`` parameter to the ``chi2`` kernel in - :func:`metrics.pairwise_kernels` :issue:`5211` by + :func:`metrics.pairwise.pairwise_kernels` :issue:`5211` by :user:`Nick Rhinehart `, :user:`Saurabh Bansod ` and `Andreas Müller`_. @@ -579,8 +640,11 @@ Miscellaneous documentation build with Sphinx>1.5 :issue:`8010`, :issue:`7986` by :user:`Oscar Najera ` - - Add ``data_home`` parameter to - :func:`sklearn.datasets.fetch_kddcup99` by `Loic Esteve`_. + - Add ``data_home`` parameter to :func:`sklearn.datasets.fetch_kddcup99`. + :issue:`9289` by `Loic Esteve`_. + + - Fix dataset loaders using Python 3 version of makedirs to also work in + Python 2. :issue:`9284` by :user:`Sebastin Santy `. - Several minor issues were fixed with thanks to the alerts of [lgtm.com](http://lgtm.com). :issue:`9278` by :user:`Jean Helie `, @@ -599,12 +663,24 @@ Trees and ensembles the weighted impurity decrease from splitting is no longer alteast ``min_impurity_decrease``. :issue:`8449` by `Raghav RV`_. +Linear, kernelized and related models + + - ``n_iter`` parameter is deprecated in :class:`linear_model.SGDClassifier`, + :class:`linear_model.SGDRegressor`, + :class:`linear_model.PassiveAggressiveClassifier`, + :class:`linear_model.PassiveAggressiveRegressor` and + :class:`linear_model.Perceptron`. By `Tom Dupre la Tour`_. + Other predictors - :class:`neighbors.LSHForest` has been deprecated and will be removed in 0.21 due to poor performance. :issue:`9078` by :user:`Laurent Direr `. + - :class:`neighbors.NearestCentroid` no longer purports to support + ``metric='precomputed'`` which now raises an error. :issue:`8515` by + :user:`Sergul Aydore `. + - The ``alpha`` parameter of :class:`semi_supervised.LabelPropagation` now has no effect and is deprecated to be removed in 0.21. :issue:`9239` by :user:`Andre Ambrosio Boechat `, :user:`Utkarsh Upadhyay @@ -622,9 +698,12 @@ Decomposition, manifold learning and clustering has been renamed to ``n_components`` and will be removed in version 0.21. :issue:`8922` by :user:`Attractadore`. - - :class:`cluster.bicluster.SpectralCoclustering` and - :class:`cluster.bicluster.SpectralBiclustering` now accept ``y`` in fit. - :issue:`6126` by :user:`Laurent Direr `. + - :meth:`decomposition.SparsePCA.transform`'s ``ridge_alpha`` parameter is + deprecated in preference for class parameter. + :issue:`8137` by :user:`Naoya Kanai `. + + - :class:`cluster.DBSCAN` now has a ``metric_params`` parameter. + :issue:`8139` by :user:`Naoya Kanai `. Preprocessing and feature selection @@ -633,7 +712,7 @@ Preprocessing and feature selection - :class:`feature_selection.SelectFromModel` now validates the ``threshold`` parameter and sets the ``threshold_`` attribute during the call to - ``fit``, and no longer during the call to ``transform```, by `Andreas + ``fit``, and no longer during the call to ``transform```. By `Andreas Müller`_. - The ``non_negative`` parameter in :class:`feature_extraction.FeatureHasher` @@ -664,6 +743,11 @@ Model evaluation and meta-estimators specifying ``train_size`` alone will cause ``test_size`` to be the remainder. :issue:`7459` by :user:`Nelson Liu `. + - :class:`multiclass.OneVsRestClassifier` now has ``partial_fit``, + ``decision_function`` and ``predict_proba`` methods only when the + underlying estimator does. :issue:`7812` by `Andreas Müller`_ and + :user:`Mikhail Korobov `. + - :class:`multiclass.OneVsRestClassifier` now has a ``partial_fit`` method only if the underlying estimator does. By `Andreas Müller`_. @@ -878,6 +962,13 @@ Bug fixes parameter setting on the split produced by the first ``split`` call to the cross-validation splitter. :issue:`7660` by `Raghav RV`_. + - Fix bug where :meth:`preprocessing.MultiLabelBinarizer.fit_transform` + returned an invalid CSR matrix. + :issue:`7750` by :user:`CJ Carey `. + + - Fixed a bug where :func:`metrics.pairwise.cosine_distances` could return a + small negative distance. :issue:`7732` by :user:`Artsion `. + API changes summary -------------------