Skip to content

[WIP] PCA NEP-37 adding random pathway and CuPy test #17676

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 47 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
3882723
MNT Adds labeler
thomasjpfan Feb 21, 2020
d79390b
BUG Fix
thomasjpfan Feb 21, 2020
d71ecae
Double quotes are better
thomasjpfan Feb 21, 2020
919a519
BUG Fix
thomasjpfan Feb 21, 2020
a89754c
BUG Fix
thomasjpfan Feb 21, 2020
0f56d96
MNT Adds build ci tag
thomasjpfan Feb 21, 2020
e4ee673
MNT Use fork for new feature
thomasjpfan Feb 22, 2020
409be8d
Merge branch 'only_change_setup'
thomasjpfan Feb 22, 2020
3e729e4
MNT Uses tagged version
thomasjpfan Feb 22, 2020
faef88c
Merge remote-tracking branch 'upstream/master'
thomasjpfan Feb 26, 2020
60c3834
WIP Testing nep37 [skip ci]
thomasjpfan Feb 27, 2020
e7bfda8
WIP Testing nep37 [skip ci]
thomasjpfan Feb 27, 2020
8cdfd2d
WIP Testing nep37 [skip ci]
thomasjpfan Feb 27, 2020
a8dc598
WIP Testing nep37 [skip ci]
thomasjpfan Feb 27, 2020
adb07db
WIP Testing nep37 [skip ci]
thomasjpfan Feb 27, 2020
df8ca82
WIP Testing nep37 [skip ci]
thomasjpfan Feb 27, 2020
ea1c0fb
WIP Testing nep37 [skip ci]
thomasjpfan Feb 27, 2020
4fe65fe
WIP Testing nep37 [skip ci]
thomasjpfan Feb 27, 2020
5ca03e0
Merge remote-tracking branch 'upstream/master' into pca_array_functio…
thomasjpfan Feb 27, 2020
fe6293d
WIP Testing nep37 [skip ci]
thomasjpfan Feb 27, 2020
8c3001f
WIP Testing nep37 [skip ci]
thomasjpfan Feb 27, 2020
33bbd52
WIP Testing nep37 [skip ci]
thomasjpfan Feb 27, 2020
19b9d2a
WIP Testing nep37 [skip ci]
thomasjpfan Feb 27, 2020
a988b8a
WIP Testing nep37 [skip ci]
thomasjpfan Feb 27, 2020
a33bde0
WIP Testing nep37 [skip ci]
thomasjpfan Feb 27, 2020
d50fbbf
WIP Testing nep37 [skip ci]
thomasjpfan Feb 27, 2020
817c4f7
BUG Fix
thomasjpfan Feb 28, 2020
2718bb7
Merge remote-tracking branch 'upstream/master' into pca_array_functio…
thomasjpfan Apr 20, 2020
d14166f
WIP Update
thomasjpfan Apr 20, 2020
e2b74a5
WIP Enables support for JAx
thomasjpfan Apr 21, 2020
100b0f9
WIP [ci skip]
thomasjpfan Apr 21, 2020
7578d90
ENH adds support for minmaxScaler
thomasjpfan Apr 21, 2020
fc5d9e8
Fix extra n_features + better error message
ogrisel Jun 22, 2020
a24515c
Merge master + pass npx to linalg.svd
ogrisel Jun 22, 2020
f33e325
Add missing docstring to make the tests pass
ogrisel Jun 22, 2020
a96afc5
Add a test for jax compat
ogrisel Jun 22, 2020
adf4692
Let's focus on svd_solver='full' for now
ogrisel Jun 22, 2020
c1c2f20
CI Be nicer to the ci
thomasjpfan Jun 22, 2020
0598331
Enabling random pathway for cuPy (with iterated_power==2 only for now)
viclafargue Jun 23, 2020
71be7ba
Improved testing
viclafargue Jun 23, 2020
79f6177
Use QR when LU is not available
viclafargue Jun 24, 2020
fc93b8d
Use _get_array_module for randomized_svd
viclafargue Jun 25, 2020
fe60628
PCA GPU benchmark
viclafargue Jun 25, 2020
9340f7e
Adding warm-up for reliable benchmarking
viclafargue Jun 25, 2020
cc3e539
2 warmups better than 1
viclafargue Jun 25, 2020
0610142
Display sum of explained variance
viclafargue Jun 25, 2020
51e1f11
Comparison with cuML
viclafargue Jun 26, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
206 changes: 103 additions & 103 deletions azure-pipelines.yml
Original file line number Diff line number Diff line change
Expand Up @@ -67,23 +67,23 @@ jobs:
SKLEARN_SKIP_NETWORK_TESTS: '0'

# Will run all the time regardless of linting outcome.
- template: build_tools/azure/posix.yml
parameters:
name: Linux_Runs
vmImage: ubuntu-18.04
matrix:
pylatest_conda_mkl:
DISTRIB: 'conda'
PYTHON_VERSION: '*'
BLAS: 'mkl'
NUMPY_VERSION: '*'
SCIPY_VERSION: '*'
CYTHON_VERSION: '*'
PILLOW_VERSION: '*'
PYTEST_VERSION: '*'
JOBLIB_VERSION: '*'
THREADPOOLCTL_VERSION: '2.0.0'
COVERAGE: 'true'
# - template: build_tools/azure/posix.yml
# parameters:
# name: Linux_Runs
# vmImage: ubuntu-18.04
# matrix:
# pylatest_conda_mkl:
# DISTRIB: 'conda'
# PYTHON_VERSION: '*'
# BLAS: 'mkl'
# NUMPY_VERSION: '*'
# SCIPY_VERSION: '*'
# CYTHON_VERSION: '*'
# PILLOW_VERSION: '*'
# PYTEST_VERSION: '*'
# JOBLIB_VERSION: '*'
# THREADPOOLCTL_VERSION: '2.0.0'
# COVERAGE: 'true'

- template: build_tools/azure/posix.yml
parameters:
Expand All @@ -95,31 +95,31 @@ jobs:
# Linux environment to test that scikit-learn can be built against
# versions of numpy, scipy with ATLAS that comes with Ubuntu Bionic 18.04
# i.e. numpy 1.13.3 and scipy 0.19
py36_ubuntu_atlas:
DISTRIB: 'ubuntu'
PYTHON_VERSION: '3.6'
JOBLIB_VERSION: '0.11'
PYTEST_XDIST: 'false'
THREADPOOLCTL_VERSION: '2.0.0'
# Linux + Python 3.6 build with OpenBLAS and without SITE_JOBLIB
py36_conda_openblas:
DISTRIB: 'conda'
PYTHON_VERSION: '3.6'
BLAS: 'openblas'
NUMPY_VERSION: '1.13.3'
SCIPY_VERSION: '0.19.1'
PANDAS_VERSION: '*'
CYTHON_VERSION: '*'
# temporary pin pytest due to unknown failure with pytest 5.3
PYTEST_VERSION: '5.2'
PILLOW_VERSION: '4.2.1'
MATPLOTLIB_VERSION: '2.1.1'
SCIKIT_IMAGE_VERSION: '*'
# latest version of joblib available in conda for Python 3.6
JOBLIB_VERSION: '0.13.2'
THREADPOOLCTL_VERSION: '2.0.0'
PYTEST_XDIST: 'false'
COVERAGE: 'true'
# py36_ubuntu_atlas:
# DISTRIB: 'ubuntu'
# PYTHON_VERSION: '3.6'
# JOBLIB_VERSION: '0.11'
# PYTEST_XDIST: 'false'
# THREADPOOLCTL_VERSION: '2.0.0'
# # Linux + Python 3.6 build with OpenBLAS and without SITE_JOBLIB
# py36_conda_openblas:
# DISTRIB: 'conda'
# PYTHON_VERSION: '3.6'
# BLAS: 'openblas'
# NUMPY_VERSION: '1.13.3'
# SCIPY_VERSION: '0.19.1'
# PANDAS_VERSION: '*'
# CYTHON_VERSION: '*'
# # temporary pin pytest due to unknown failure with pytest 5.3
# PYTEST_VERSION: '5.2'
# PILLOW_VERSION: '4.2.1'
# MATPLOTLIB_VERSION: '2.1.1'
# SCIKIT_IMAGE_VERSION: '*'
# # latest version of joblib available in conda for Python 3.6
# JOBLIB_VERSION: '0.13.2'
# THREADPOOLCTL_VERSION: '2.0.0'
# PYTEST_XDIST: 'false'
# COVERAGE: 'true'
# Linux environment to test the latest available dependencies and MKL.
# It runs tests requiring lightgbm, pandas and PyAMG.
pylatest_pip_openblas_pandas:
Expand All @@ -131,66 +131,66 @@ jobs:
TEST_DOCSTRINGS: 'true'
CHECK_WARNINGS: 'true'

- template: build_tools/azure/posix-32.yml
parameters:
name: Linux32
vmImage: ubuntu-18.04
dependsOn: [linting]
condition: and(ne(variables['Build.Reason'], 'Schedule'), succeeded('linting'))
matrix:
py36_ubuntu_atlas_32bit:
DISTRIB: 'ubuntu-32'
PYTHON_VERSION: '3.6'
JOBLIB_VERSION: '0.13'
THREADPOOLCTL_VERSION: '2.0.0'
# - template: build_tools/azure/posix-32.yml
# parameters:
# name: Linux32
# vmImage: ubuntu-18.04
# dependsOn: [linting]
# condition: and(ne(variables['Build.Reason'], 'Schedule'), succeeded('linting'))
# matrix:
# py36_ubuntu_atlas_32bit:
# DISTRIB: 'ubuntu-32'
# PYTHON_VERSION: '3.6'
# JOBLIB_VERSION: '0.13'
# THREADPOOLCTL_VERSION: '2.0.0'

- template: build_tools/azure/posix.yml
parameters:
name: macOS
vmImage: macOS-10.14
dependsOn: [linting]
condition: and(ne(variables['Build.Reason'], 'Schedule'), succeeded('linting'))
matrix:
pylatest_conda_mkl:
DISTRIB: 'conda'
PYTHON_VERSION: '*'
BLAS: 'mkl'
NUMPY_VERSION: '*'
SCIPY_VERSION: '*'
CYTHON_VERSION: '*'
PILLOW_VERSION: '*'
PYTEST_VERSION: '*'
JOBLIB_VERSION: '*'
THREADPOOLCTL_VERSION: '2.0.0'
COVERAGE: 'true'
pylatest_conda_mkl_no_openmp:
DISTRIB: 'conda'
PYTHON_VERSION: '*'
BLAS: 'mkl'
NUMPY_VERSION: '*'
SCIPY_VERSION: '*'
CYTHON_VERSION: '*'
PILLOW_VERSION: '*'
PYTEST_VERSION: '*'
JOBLIB_VERSION: '*'
THREADPOOLCTL_VERSION: '2.0.0'
COVERAGE: 'true'
SKLEARN_TEST_NO_OPENMP: 'true'
SKLEARN_SKIP_OPENMP_TEST: 'true'
# - template: build_tools/azure/posix.yml
# parameters:
# name: macOS
# vmImage: macOS-10.14
# dependsOn: [linting]
# condition: and(ne(variables['Build.Reason'], 'Schedule'), succeeded('linting'))
# matrix:
# pylatest_conda_mkl:
# DISTRIB: 'conda'
# PYTHON_VERSION: '*'
# BLAS: 'mkl'
# NUMPY_VERSION: '*'
# SCIPY_VERSION: '*'
# CYTHON_VERSION: '*'
# PILLOW_VERSION: '*'
# PYTEST_VERSION: '*'
# JOBLIB_VERSION: '*'
# THREADPOOLCTL_VERSION: '2.0.0'
# COVERAGE: 'true'
# pylatest_conda_mkl_no_openmp:
# DISTRIB: 'conda'
# PYTHON_VERSION: '*'
# BLAS: 'mkl'
# NUMPY_VERSION: '*'
# SCIPY_VERSION: '*'
# CYTHON_VERSION: '*'
# PILLOW_VERSION: '*'
# PYTEST_VERSION: '*'
# JOBLIB_VERSION: '*'
# THREADPOOLCTL_VERSION: '2.0.0'
# COVERAGE: 'true'
# SKLEARN_TEST_NO_OPENMP: 'true'
# SKLEARN_SKIP_OPENMP_TEST: 'true'

- template: build_tools/azure/windows.yml
parameters:
name: Windows
vmImage: vs2017-win2016
dependsOn: [linting]
condition: and(ne(variables['Build.Reason'], 'Schedule'), succeeded('linting'))
matrix:
py37_conda_mkl:
PYTHON_VERSION: '3.7'
CHECK_WARNINGS: 'true'
PYTHON_ARCH: '64'
PYTEST_VERSION: '*'
COVERAGE: 'true'
py36_pip_openblas_32bit:
PYTHON_VERSION: '3.6'
PYTHON_ARCH: '32'
# - template: build_tools/azure/windows.yml
# parameters:
# name: Windows
# vmImage: vs2017-win2016
# dependsOn: [linting]
# condition: and(ne(variables['Build.Reason'], 'Schedule'), succeeded('linting'))
# matrix:
# py37_conda_mkl:
# PYTHON_VERSION: '3.7'
# CHECK_WARNINGS: 'true'
# PYTHON_ARCH: '64'
# PYTEST_VERSION: '*'
# COVERAGE: 'true'
# py36_pip_openblas_32bit:
# PYTHON_VERSION: '3.6'
# PYTHON_ARCH: '32'
82 changes: 82 additions & 0 deletions benchmarks/bench_pca_gpu.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
"""
A comparison of PCA with/wthout GPU

Obtained with NVIDIA Tesla V100

With svd_solver='full' and iterated_power=2:
Without GPU : runtime: 1.197s, explained variance: 6.967
With CuPy : runtime: 0.016s, explained variance: 6.967
With cuML : runtime: 0.007s, explained variance: 6.967

With svd_solver='full' and iterated_power=10:
Without GPU : runtime: 1.210s, explained variance: 6.967
With CuPy : runtime: 0.016s, explained variance: 6.967
With cuML : runtime: 0.007s, explained variance: 6.967

With svd_solver='randomized' and iterated_power=2:
Without GPU : runtime: 0.032s, explained variance: 6.745
With CuPy : runtime: 0.012s, explained variance: 6.693

With svd_solver='randomized' and iterated_power=10:
Without GPU : runtime: 0.204s, explained variance: 6.945
With CuPy : runtime: 0.040s, explained variance: 6.949
"""

import sklearn
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
from cuml.decomposition import PCA as cumlPCA
import numpy as np
import cupy as cp
import time


if __name__ == '__main__':
import cupy
X_np, y_np = make_classification(n_samples=10000, n_features=100)
X_np = X_np.astype(np.float32)
X_cp = cp.asfortranarray(cp.asarray(X_np))

for svd_solver in ["full", "randomized"]:
for iterated_power in [2, 10]:
pca_np = PCA(n_components=3, svd_solver=svd_solver, copy=True,
random_state=0, iterated_power=iterated_power)
pca_np.fit_transform(X_np) # Warm-up
pca_np = PCA(**pca_np.get_params()) # Overwrite to clean model
t0 = time.time()
pca_np.fit_transform(X_np)
without_gpu_time = time.time() - t0
exp_var_np = pca_np.explained_variance_.sum()

cupy_time = 0
exp_var_cp = 0
with sklearn.config_context(enable_duck_array=True):
pca_cp = PCA(**pca_np.get_params())
pca_cp.fit_transform(X_cp) # Warm-up
pca_cp = PCA(**pca_np.get_params()) # Overwrite to clean model
t0 = time.time()
pca_cp.fit_transform(X_cp)
cupy_time = time.time() - t0
exp_var_cp = cp.asnumpy(pca_cp.explained_variance_.sum())

if svd_solver == 'full':
pca_cuml = cumlPCA(**pca_np.get_params())
pca_cuml.fit_transform(X_cp) # Warm-up
pca_cuml = cumlPCA(**pca_np.get_params()) # Overwrite to clean model
t0 = time.time()
pca_cuml.fit_transform(X_cp)
cuml_time = time.time() - t0
exp_var_cuml = pca_np.explained_variance_.sum()

msg = 'With svd_solver=\'{}\' and iterated_power={}:'
print(msg.format(svd_solver, iterated_power))
m1 = '\tWithout GPU : runtime: {:.3f}s, explained variance: {:.3f}'
print(m1.format(without_gpu_time, exp_var_np))
m2 = '\tWith CuPy : runtime: {:.3f}s, explained variance: {:.3f}'
print(m2.format(cupy_time, exp_var_cp))

m3 = '\tWith cuML : runtime: {:.3f}s, explained variance: {:.3f}\n'
if svd_solver == 'full':
print(m3.format(cuml_time, exp_var_cuml))
else:
print()
2 changes: 1 addition & 1 deletion build_tools/azure/install.sh
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ elif [[ "$DISTRIB" == "conda-pip-latest" ]]; then
python -m pip install -U pip
python -m pip install pytest==$PYTEST_VERSION pytest-cov

python -m pip install pandas matplotlib pyamg scikit-image
python -m pip install pandas matplotlib pyamg scikit-image jax jaxlib
# do not install dependencies for lightgbm since it requires scikit-learn
python -m pip install lightgbm --no-deps
elif [[ "$DISTRIB" == "conda-pip-scipy-dev" ]]; then
Expand Down
2 changes: 1 addition & 1 deletion build_tools/azure/test_script.sh
Original file line number Diff line number Diff line change
Expand Up @@ -42,5 +42,5 @@ cp setup.cfg $TEST_DIR
cd $TEST_DIR

set -x
$TEST_CMD --pyargs sklearn
$TEST_CMD --pyargs sklearn.decomposition
set +x
6 changes: 5 additions & 1 deletion sklearn/_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
'working_memory': int(os.environ.get('SKLEARN_WORKING_MEMORY', 1024)),
'print_changed_only': True,
'display': 'text',
'enable_duck_array': False,
}


Expand All @@ -28,7 +29,8 @@ def get_config():


def set_config(assume_finite=None, working_memory=None,
print_changed_only=None, display=None):
print_changed_only=None, display=None,
enable_duck_array=None):
"""Set global scikit-learn configuration

.. versionadded:: 0.19
Expand Down Expand Up @@ -80,6 +82,8 @@ def set_config(assume_finite=None, working_memory=None,
_global_config['print_changed_only'] = print_changed_only
if display is not None:
_global_config['display'] = display
if enable_duck_array is not None:
_global_config['enable_duck_array'] = enable_duck_array


@contextmanager
Expand Down
13 changes: 6 additions & 7 deletions sklearn/decomposition/_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@
# License: BSD 3 clause

import numpy as np
from scipy import linalg

from ..base import BaseEstimator, TransformerMixin
from ..utils import check_array
Expand Down Expand Up @@ -44,7 +43,7 @@ def get_covariance(self):
cov.flat[::len(cov) + 1] += self.noise_variance_ # modify diag inplace
return cov

def get_precision(self):
def get_precision(self, npx=np):
"""Compute data precision matrix with the generative model.

Equals the inverse of the covariance but computed with
Expand All @@ -61,7 +60,7 @@ def get_precision(self):
if self.n_components_ == 0:
return np.eye(n_features) / self.noise_variance_
if self.n_components_ == n_features:
return linalg.inv(self.get_covariance())
return npx.linalg.inv(self.get_covariance())

# Get precision using matrix inversion lemma
components_ = self.components_
Expand All @@ -70,11 +69,11 @@ def get_precision(self):
components_ = components_ * np.sqrt(exp_var[:, np.newaxis])
exp_var_diff = np.maximum(exp_var - self.noise_variance_, 0.)
precision = np.dot(components_, components_.T) / self.noise_variance_
precision.flat[::len(precision) + 1] += 1. / exp_var_diff
precision = np.dot(components_.T,
np.dot(linalg.inv(precision), components_))
precision.ravel()[::len(precision) + 1] += 1. / exp_var_diff
precision = npx.dot(components_.T,
npx.dot(npx.linalg.inv(precision), components_))
precision /= -(self.noise_variance_ ** 2)
precision.flat[::len(precision) + 1] += 1. / self.noise_variance_
precision.ravel()[::len(precision) + 1] += 1. / self.noise_variance_
return precision

@abstractmethod
Expand Down
Loading