Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
206 commits
Select commit Hold shift + click to select a range
e5fdbbb
Fix flake8 errors
marcelobeckmann Dec 20, 2018
dcf96f4
Test rebase
marcelobeckmann Dec 20, 2018
41a2748
Test rebase
marcelobeckmann Apr 10, 2019
d3221a7
Test CI
marcelobeckmann Apr 25, 2019
da71fba
Test rebase
marcelobeckmann Apr 25, 2019
a63c43f
Merge after remote pull
marcelobeckmann Apr 25, 2019
e50d9d9
Test rebase
marcelobeckmann Apr 25, 2019
47b20a9
Test rebase
marcelobeckmann Dec 20, 2018
12b773b
Test rebase
marcelobeckmann Apr 10, 2019
a32f8e7
Test CI
marcelobeckmann Apr 25, 2019
3480bf2
Test rebase
marcelobeckmann Apr 25, 2019
7be14ba
Test rebase
marcelobeckmann Apr 25, 2019
3d1f2bc
Text rebase
marcelobeckmann Apr 26, 2019
181a750
Fix CI errors
marcelobeckmann Apr 26, 2019
b8da4c9
Improve test coverage
marcelobeckmann Apr 30, 2019
e31f72b
Merge branch 'master' into HEAD
jnothman Apr 30, 2019
5096b76
Test CI
marcelobeckmann Apr 25, 2019
1230b6f
Test rebase
marcelobeckmann Apr 25, 2019
16b756f
More changes
marcelobeckmann Apr 30, 2019
db6303b
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann May 1, 2019
7cb7ce9
Fix merge
marcelobeckmann May 1, 2019
9679345
Improve test coverage
marcelobeckmann May 1, 2019
066d9fa
Test rebase
marcelobeckmann Dec 20, 2018
348bf40
Test rebase
marcelobeckmann Apr 10, 2019
1b6f8b6
Test CI
marcelobeckmann Apr 25, 2019
89b8884
Test rebase
marcelobeckmann Apr 25, 2019
ab8a61d
Test rebase
marcelobeckmann Apr 25, 2019
ef90d8e
Test rebase
marcelobeckmann Dec 20, 2018
dd1fdcd
Test rebase
marcelobeckmann Apr 25, 2019
71ce0c5
Test rebase
marcelobeckmann Apr 25, 2019
705fec9
Fix CI errors
marcelobeckmann Apr 26, 2019
9e5a2ac
Improve test coverage
marcelobeckmann Apr 30, 2019
1ed4550
TST Remove np.seterr calls in test files (#13712)
aditya1702 Apr 26, 2019
a3a3135
FIX Correct brier_score_loss when there's only one class in y_true (#…
qinhanmin2014 Apr 26, 2019
ecb50be
CI skip HashVectorizer test on pypy (#13729)
glemaitre Apr 26, 2019
57693b1
MAINT removed close_figure helper (#13730)
NicolasHug Apr 26, 2019
9fd98c7
[MRG+2] Faster Gradient Boosting Decision Trees with binned features …
NicolasHug Apr 26, 2019
ed2ce90
DOC Fixing language in Hamming loss docstring. (#13735)
mitar Apr 27, 2019
6708e0d
FEA OPTICS: add extract_xi method (#12077)
adrinjalali Apr 27, 2019
2ca1fa4
DOC new convention is :pr: not :issue:
jnothman Apr 27, 2019
fb21d0f
DOC what's new cleaning (#13706)
jnothman Apr 27, 2019
8cc70af
FIX euclidean_distances float32 numerical instabilities (#13554)
jeremiedbb Apr 29, 2019
a7654e4
DOC Update release dates
jnothman Apr 27, 2019
52bb273
DOC Add commit contributors
jnothman Apr 27, 2019
fcc9519
DOC bump version
jnothman Apr 27, 2019
1df1fea
DOC move 0.20 to previous releases
jnothman Apr 29, 2019
0339b55
Added distance_threshold parameter to hierarchical clustering (#9069)
VathsalaAchar Apr 29, 2019
19f1c57
DOC add missing kernels to pairwise_kernels (#13746)
hossein-pourbozorg Apr 30, 2019
dcf3a37
DOC more ambiguous May release date for 0.21
jnothman Apr 30, 2019
2b1a697
TST Ignore Kmeans test failures on MacOS (#12648)
qinhanmin2014 Apr 30, 2019
2cb2802
FIX Optics paper typo which resulted in undersized clusters (#13750)
qinhanmin2014 Apr 30, 2019
84dfcf1
TST use approximate equality for float comparison (#13749)
jnothman Apr 30, 2019
d798f06
More changes
marcelobeckmann Apr 30, 2019
4889385
Improve test coverage
marcelobeckmann May 1, 2019
1cd6979
Merge branch 'b5584' of https://github.com/marcelobeckmann/scikit-lea…
marcelobeckmann May 1, 2019
206cd26
Merge branch 'master' of https://github.com/marcelobeckmann/scikit-learn
marcelobeckmann May 1, 2019
43b77ef
Test rebase
marcelobeckmann Dec 20, 2018
6d847d4
Test rebase
marcelobeckmann Apr 10, 2019
9379e2c
Test CI
marcelobeckmann Apr 25, 2019
4bf77e7
Test rebase
marcelobeckmann Apr 25, 2019
992b5cb
Test rebase
marcelobeckmann Apr 25, 2019
dbc6f55
Test rebase
marcelobeckmann Dec 20, 2018
da825de
Test rebase
marcelobeckmann Apr 25, 2019
3090915
Test rebase
marcelobeckmann Apr 25, 2019
4d10175
Fix CI errors
marcelobeckmann Apr 26, 2019
460484f
Improve test coverage
marcelobeckmann Apr 30, 2019
8b7f236
TST Remove np.seterr calls in test files (#13712)
aditya1702 Apr 26, 2019
c699f8d
FIX Correct brier_score_loss when there's only one class in y_true (#…
qinhanmin2014 Apr 26, 2019
ddf9022
CI skip HashVectorizer test on pypy (#13729)
glemaitre Apr 26, 2019
b3ad764
MAINT removed close_figure helper (#13730)
NicolasHug Apr 26, 2019
b99aacc
[MRG+2] Faster Gradient Boosting Decision Trees with binned features …
NicolasHug Apr 26, 2019
8f1bcd3
DOC Fixing language in Hamming loss docstring. (#13735)
mitar Apr 27, 2019
cda1b54
FEA OPTICS: add extract_xi method (#12077)
adrinjalali Apr 27, 2019
6b10f24
DOC new convention is :pr: not :issue:
jnothman Apr 27, 2019
5209834
DOC what's new cleaning (#13706)
jnothman Apr 27, 2019
cc0184f
FIX euclidean_distances float32 numerical instabilities (#13554)
jeremiedbb Apr 29, 2019
ceb4b44
DOC Update release dates
jnothman Apr 27, 2019
172d21f
DOC Add commit contributors
jnothman Apr 27, 2019
f3b1544
DOC bump version
jnothman Apr 27, 2019
8ed3ecb
DOC move 0.20 to previous releases
jnothman Apr 29, 2019
0c4d489
Added distance_threshold parameter to hierarchical clustering (#9069)
VathsalaAchar Apr 29, 2019
6c1054e
DOC add missing kernels to pairwise_kernels (#13746)
hossein-pourbozorg Apr 30, 2019
ca12d35
DOC more ambiguous May release date for 0.21
jnothman Apr 30, 2019
9c09d9b
TST Ignore Kmeans test failures on MacOS (#12648)
qinhanmin2014 Apr 30, 2019
ed74af0
FIX Optics paper typo which resulted in undersized clusters (#13750)
qinhanmin2014 Apr 30, 2019
f40273f
TST use approximate equality for float comparison (#13749)
jnothman Apr 30, 2019
7dd2a9b
Improve test coverage
marcelobeckmann May 1, 2019
c5a4472
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann May 1, 2019
3a9f576
Improve test coverage
marcelobeckmann May 1, 2019
d3b10fe
Fix flake8 errors
marcelobeckmann May 1, 2019
fcb4763
Use _precompute_metric_params
marcelobeckmann May 7, 2019
6f2d98d
Fix flake8 errors
marcelobeckmann May 7, 2019
5c6c30d
Fix flake8 errors
marcelobeckmann May 7, 2019
745de05
Add ranges as parameters for data scale
marcelobeckmann May 11, 2019
6fa6e88
Fix flake8 errors
marcelobeckmann May 11, 2019
077b3cb
Fix flake8 errors
marcelobeckmann May 11, 2019
ded653a
Fix incorrect replace in metrics.rst
marcelobeckmann May 12, 2019
eb1ee32
Fix flake8 errors
marcelobeckmann May 12, 2019
782eb3d
Fix flake8 errors
marcelobeckmann May 12, 2019
4c03f5c
Update with master branch
marcelobeckmann May 20, 2019
3ca56d5
Update with master branch
marcelobeckmann May 20, 2019
29d82d5
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann May 21, 2019
16d339d
Merge with master branch
marcelobeckmann May 21, 2019
6ea57ac
Merge with master branch
marcelobeckmann May 21, 2019
16b9377
Fix utf-8 encoding
marcelobeckmann May 21, 2019
1474df8
Fix incorrect merge
marcelobeckmann May 22, 2019
49a5ac2
Merge branch 'b5584' of https://github.com/marcelobeckmann/scikit-lea…
marcelobeckmann May 22, 2019
c92d47d
Fix merge issues in pairwise.py
jnothman May 22, 2019
67491ce
Remove bak files
marcelobeckmann May 22, 2019
e123d36
Provide proper support to pairwise_distances method
marcelobeckmann Jun 13, 2019
a993bbe
Fix flake8 errors
marcelobeckmann Jun 14, 2019
5e4cf76
Fix flake8 errors
marcelobeckmann Jun 14, 2019
66650fa
Fix flake8 errors
marcelobeckmann Jun 14, 2019
23966ff
Fix flake8 errors
marcelobeckmann Jun 14, 2019
ae7f556
Apply minor fixes, tests, and comments
marcelobeckmann Jun 17, 2019
d257bba
Simplified range calculation
marcelobeckmann Jun 18, 2019
52ad60f
Fix flake8 error
marcelobeckmann Jun 18, 2019
336c183
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Jun 20, 2019
c4959fa
Remove gower for sparse matrix tests
marcelobeckmann Jun 21, 2019
faa404f
Fix flake8 errors
marcelobeckmann Jun 21, 2019
4a2d89e
Remove unnecessary if
marcelobeckmann Jun 21, 2019
5b84803
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Jun 21, 2019
cc58403
Makes a proper user of precomputed parameters
marcelobeckmann Jul 7, 2019
f69fd04
Fix flake8 errors
marcelobeckmann Jul 7, 2019
098bef9
Fix flake8 errors
marcelobeckmann Jul 7, 2019
bab9ca0
Add more tests cases
marcelobeckmann Jul 7, 2019
bba8828
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Jul 7, 2019
26779a0
Remove variables and simplify code readability
marcelobeckmann Jul 29, 2019
87e2f63
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Jul 29, 2019
ff6366b
Fix flake8 errors
marcelobeckmann Jul 29, 2019
0931f81
Improve robustness for pairwise_distances with gower
marcelobeckmann Aug 6, 2019
fa44c39
Fix flake8 errors
marcelobeckmann Aug 6, 2019
4a46ae1
Fix flake8 errors
marcelobeckmann Aug 6, 2019
38d99d5
Fix flake8 errors
marcelobeckmann Aug 6, 2019
a93efa5
Fix flake8 errors
marcelobeckmann Aug 6, 2019
512428d
Fix flake8 errors
marcelobeckmann Aug 6, 2019
6a403d5
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Aug 6, 2019
29cd45e
Remove unnecessary conversion
marcelobeckmann Aug 6, 2019
a811d57
Improve robustness to test categorical values in other deployments
marcelobeckmann Aug 7, 2019
4d6d584
Fix flake8 errors
marcelobeckmann Aug 7, 2019
dbd4af5
Fix compilation error
marcelobeckmann Aug 7, 2019
091a7fa
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Aug 7, 2019
117cca0
Fix flake8 errors
marcelobeckmann Aug 7, 2019
34e78ae
Detect incorrect NaN comparison in other deployments
marcelobeckmann Aug 8, 2019
0a802c3
Detect incorrect NaN comparison in other deployments
marcelobeckmann Aug 12, 2019
6df57c2
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Aug 12, 2019
e610965
Detect test discrepancies in other deployments
marcelobeckmann Aug 12, 2019
1cddfdf
Detect test discrepancies in other deployments
marcelobeckmann Aug 12, 2019
850caa6
Detect test discrepancies in other deployments
marcelobeckmann Aug 12, 2019
545e496
Detect test discrepancies in other deployments
marcelobeckmann Aug 13, 2019
6b438cf
Detect test discrepancies in other deployments
marcelobeckmann Aug 13, 2019
b9d2188
Detect test discrepancies in other deployments
marcelobeckmann Aug 13, 2019
df73f9e
Detect test discrepancies in other deployments
marcelobeckmann Aug 13, 2019
bc08577
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Aug 13, 2019
a73852e
Detect test discrepancies in other deployments
marcelobeckmann Aug 13, 2019
127bc7b
Detect test discrepancies in other deployments
marcelobeckmann Aug 13, 2019
8cd9ca3
Detect test discrepancies in other deployments
marcelobeckmann Aug 13, 2019
e8c6624
Detect test discrepancies in other deployments
marcelobeckmann Aug 13, 2019
a339e48
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Aug 13, 2019
27b7fd9
Detect test discrepancies in other deployments
marcelobeckmann Aug 13, 2019
8e37937
Detect test discrepancies in other deployments
marcelobeckmann Aug 13, 2019
d11d8e7
Detect test discrepancies in other deployments
marcelobeckmann Aug 13, 2019
c375fee
Use the _object_dtype_isnan to detect nan in mixed matrices data
marcelobeckmann Aug 21, 2019
462c6f3
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Aug 21, 2019
8f4e9de
Fix flake8 error
marcelobeckmann Aug 21, 2019
58770f0
Fix code after code review
marcelobeckmann Sep 11, 2019
b3270a8
Merge with head
marcelobeckmann Sep 11, 2019
188e0ca
Fix flake8 errors
marcelobeckmann Sep 11, 2019
eb1ab6b
Remove files added incorrectly
marcelobeckmann Sep 19, 2019
eab56c4
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Sep 19, 2019
5f3421e
Fix flake8 errors
marcelobeckmann Sep 19, 2019
9317415
Changes after code review
marcelobeckmann Sep 20, 2019
d16f833
Fix flake8 errors
marcelobeckmann Sep 20, 2019
b23fc65
Fix flake8 errors
marcelobeckmann Sep 20, 2019
da6b46d
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Sep 20, 2019
d84be70
New proposal to avoid ZeroDivisionError
marcelobeckmann Sep 23, 2019
e67579d
New proposal to avoid ZeroDivisionError
marcelobeckmann Sep 23, 2019
e5167e0
Fix flake8 errors
marcelobeckmann Sep 23, 2019
7de895b
Improve categorical detection given hints from code review
marcelobeckmann Oct 23, 2019
c88cf0f
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Oct 23, 2019
08e692a
Fix flake8 errors
marcelobeckmann Oct 23, 2019
19e4f0b
Fix flake8 errors
marcelobeckmann Oct 23, 2019
7d480b2
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Oct 23, 2019
14d0d8b
Revert problematic merge with other's failures
marcelobeckmann Oct 24, 2019
3b3bb54
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Oct 24, 2019
82707d2
Fix merge conflicts
marcelobeckmann Oct 24, 2019
e187e01
Fix flake8 errors
marcelobeckmann Oct 24, 2019
7370840
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Oct 25, 2019
a86ba38
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Oct 29, 2019
c0f3ee2
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Oct 31, 2019
d1a116f
Propose fix after code review
marcelobeckmann Nov 12, 2019
8ddfb1b
Propose fix after code review
marcelobeckmann Nov 14, 2019
b37f750
Propose fix after code review
marcelobeckmann Nov 15, 2019
88f835d
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Nov 15, 2019
77d925f
Fix unit test errors in other types of deployment
marcelobeckmann Nov 15, 2019
72bc1dc
Improve performance for nan columns
marcelobeckmann Nov 20, 2019
984a6a0
Fix code after review
marcelobeckmann Nov 20, 2019
cf861bd
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Nov 20, 2019
1510744
Make some prints to figure out the unit test error in some specifc pl…
marcelobeckmann Nov 21, 2019
988028a
Make some prints to figure out the unit test error in some specifc pl…
marcelobeckmann Nov 22, 2019
8454f97
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Nov 22, 2019
f1d840d
Make some prints to figure out the unit test error in some specifc pl…
marcelobeckmann Nov 26, 2019
a8f2a65
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Nov 26, 2019
37359f0
Make some prints to figure out the unit test error in some specifc pl…
marcelobeckmann Nov 26, 2019
63c179e
Revert improvement to check full nan columns
marcelobeckmann Nov 27, 2019
8786f5d
Merge remote-tracking branch 'upstream/master' into b5584
adrinjalali Mar 29, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions doc/modules/metrics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,46 @@ is equivalent to :func:`linear_kernel`, only slower.)
Information Retrieval. Cambridge University Press.
https://nlp.stanford.edu/IR-book/html/htmledition/the-vector-space-model-for-scoring-1.html

.. _gower_distances:

Gower distances
-----------------
The function :func:`gower_distances` computes the distances between the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The function :func:`gower_distances` computes the distances between the
The function :func:`~sklearn.metrics.pairwise.gower_distances` computes the distances between the

observations in X and Y, that may contain combinations of numerical, boolean,
or categorical attributes, using an implementation of Gower Similarity.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please describe how we go from the similarity to the distance?


.. math::

g(\mathbf{x}, \mathbf{y}) = \frac{\sum_i(s(x_i, y_i))}{|\{i| x_i\text{ is not missing or }y_i\text{ is not missing}\}|}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use i or k to index the features but not both please


Where:

x, y : array_like (1, n_features) are the observations to be compared.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
x, y : array_like (1, n_features) are the observations to be compared.
x, y : two samples to be compared.


s(x, y) : Calculates the similarity of all features (for k = 1 to n_features)
of x and y, as described by the expressions:

s(x_k, y_k) = 0, if k represents a boolean or categorical attribute,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be rendered in latex :math:`formula here`

and they are equal.

s(x_k, y_k) = 1, if k represents a boolean or categorical attribute,
and they are unequal.

s(x_k, y_k) = abs(x_k - y_k), if k represents a numerical attribute.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So IIUC, the scale of a numerical feature will have a huge impact on the final value? Should the features be standardized before computing the Gower similarity?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The features are currently being min-max scaled within Gower unless scale=False.


s(x_k, y_k) = 0, if x_k or y_k are missing.


The Gower formula combines a Manhattan (L1) distance for numeric features
with Hamming distance for categorical features to obtain a general coefficient
for categorical and numeric data.

.. topic:: References:

* Gower, J.C., 1971, A General Coefficient of Similarity and Some of Its
Properties, Biometrics, Vol. 27, No. 4. (Dec., 1971), pp. 857-871.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Properties, Biometrics, Vol. 27, No. 4. (Dec., 1971), pp. 857-871.
Properties, Biometrics, Vol. 27, No. 4. (Dec., 1971), pp. 857-871.

http://members.cbio.mines-paristech.fr/~jvert/svn/bibli/local/Gower1971general.pdf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
http://members.cbio.mines-paristech.fr/~jvert/svn/bibli/local/Gower1971general.pdf
http://members.cbio.mines-paristech.fr/~jvert/svn/bibli/local/Gower1971general.pdf


.. _linear_kernel:

Linear kernel
Expand Down
260 changes: 255 additions & 5 deletions sklearn/metrics/pairwise.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@

from ._pairwise_fast import _chi2_kernel_fast, _sparse_manhattan
from ..exceptions import DataConversionWarning

from ..utils.fixes import _object_dtype_isnan

# Utility Functions
def _return_float_dtype(X, Y):
Expand Down Expand Up @@ -544,7 +544,7 @@ def pairwise_distances_argmin_min(X, Y, axis=1, metric="euclidean",
Valid values for metric are:

- from scikit-learn: ['cityblock', 'cosine', 'euclidean', 'l1', 'l2',
'manhattan']
'manhattan', 'gower']

- from scipy.spatial.distance: ['braycurtis', 'canberra', 'chebyshev',
'correlation', 'dice', 'hamming', 'jaccard', 'kulsinski',
Expand Down Expand Up @@ -632,7 +632,7 @@ def pairwise_distances_argmin(X, Y, axis=1, metric="euclidean",
Valid values for metric are:

- from scikit-learn: ['cityblock', 'cosine', 'euclidean', 'l1', 'l2',
'manhattan']
'manhattan', 'gower']

- from scipy.spatial.distance: ['braycurtis', 'canberra', 'chebyshev',
'correlation', 'dice', 'hamming', 'jaccard', 'kulsinski',
Expand Down Expand Up @@ -829,6 +829,232 @@ def cosine_distances(X, Y=None):
return S


def gower_distances(X, Y=None, categorical_features=None, scale=True):
"""Compute the distances between the observations in X and Y,
that may contain mixed types of data, using an implementation
of Gower formula.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add "Read more in the :ref:User Guide <ref_to_UG>"

Parameters
----------
X : array-like, or pandas.DataFrame, shape (n_samples, n_features)

Y : array-like, or pandas.DataFrame, optional,
shape (n_samples, n_features)

categorical_features : array-like, optional, shape (n_features)
Indicates with True/False whether a column is a categorical attribute.
This is useful when categorical atributes are represented as integer
values. Categorical ordinal attributes are treated as numeric, and
must be marked as false.

Alternatively, the categorical_features array can be represented only
with the numerical indexes of the categorical attribtes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with the numerical indexes of the categorical attribtes.
with the numerical indexes of the categorical attributes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also support categorical_features being a callable, as we do in ColumnTransformer?


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default behaviour for categorical_features is not described

If the categorical_features array is not provided, by default all
non-numeric columns are considered categorical.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that behaviour is undefined if columns mix numeric and non-numeric values.


scale : boolean, list or array, optional (default=True)
Indicates if the numerical columns will be scaled between 0 and 1.
If false, it is assumed the numerical columns are already scaled.
If a list or array, it must countain the ranges of values from
numerical columns.

Returns
-------
similarities : ndarray, shape (n_samples_X, n_samples_Y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

distances?


References
----------
Gower, J.C., 1971, A General Coefficient of Similarity and Some of Its
Properties.

Notes
-----
The numeric feature ranges are determined from both X and Y.

Current implementation does not support sparse matrices.

All the non-numerical types (e.g., str), are treated as categorical
features.

This implementation modifies the Gower's original similarity measure in
the folowing aspects:

* The values in the original similarity S range between 0 and 1. To
guarantee this, it is assumed the numerical features of X and Y are
scaled between 0 and 1.

* Different from the original similarity S, this implementation
returns 1-S.
"""
if issparse(X) or issparse(Y):
raise TypeError("Gower distance does not support sparse matrices")

if not isinstance(scale, (bool, list, np.ndarray)):
raise TypeError("Parameter scale must be boolean, list, or ndarray")

if X is None or len(X) == 0:
raise ValueError("X can not be None or empty")

# It is necessary to convert to ndarray in advance to define the dtype
# as np.object, otherwise numeric columns will be converted to string
# if there are other string columns.
if not isinstance(X, np.ndarray):
X = np.asarray(X, dtype=np.object)

if Y is not None and not isinstance(Y, np.ndarray):
Y = np.asarray(Y, dtype=np.object)

X, Y = check_pairwise_arrays(X, Y, precomputed=False, dtype=X.dtype,
force_all_finite=False)

X = np.asarray(X, dtype=np.object)

cat_mask = _detect_categorical_features(X, categorical_features)
num_mask = ~ cat_mask

# Calculates the min and max values, and if requested, scale the
# input values in order to obtain the distances between 0 and 1,
# as proposed by the Gower's paper.
ranges = 1
if np.any(num_mask):
process_scale = False
if isinstance(scale, bool):
process_scale = scale
else:
if len(np.asarray(scale).flatten()) != X[:, num_mask].shape[1]:
raise ValueError("Length of scale parameter must be equal "
"to the number of numerical columns.")
process_scale = True

ranges, min, max = _precompute_gower_params(X, Y, scale, num_mask)

# avoid division by zero when all values in the column are the same
ranges[ranges == 0] = 1

# check if the data is pre-scaled when scale=False
if not process_scale and (np.min(min) < 0 or np.max(max) > 1):
raise ValueError("Input data is not scaled between 0 and 1.")

D = np.zeros((X.shape[0], Y.shape[0]), dtype=np.float)

for i in range(X.shape[0]):
j_start = i

# For non square results
if X.shape[0] != Y.shape[0] or X is not Y:
j_start = 0

# Makes the comparisson for np.nan for arrays with dtype=np.object,
# this is necessary as some deployments returns True for
# np.nan == np.nan
cat_nan_cols = (_object_dtype_isnan(X[i, cat_mask]) |
_object_dtype_isnan(Y[j_start:, cat_mask]))

# Calculates the similarities for categorical columns
cat_dists = ((X[i, cat_mask] != Y[j_start:, cat_mask]) | cat_nan_cols)
# Calculates the Manhattan distances for numerical columns
num_dists = abs(X[i, num_mask] -
Y[j_start:, num_mask]) / ranges

# Calculates the number of non missing columns
non_missing = X.shape[1] - (cat_nan_cols.sum(axis=1) +
_object_dtype_isnan(num_dists).sum(axis=1)
.astype(np.float32))

# This is to avoid ZeroDivisionError
non_missing[non_missing == 0] = np.nan

# Gets the final results
total = np.sum(cat_dists, axis=1) + np.sum(num_dists, axis=1)

results = total / non_missing

D[i, j_start:] = results
if X is Y:
D[i:, j_start] = results

return D


def _detect_categorical_features(X, categorical_features=None):
"""Identifies the numerical and non-numerical (categorical) columns
of an array.

Parameters
----------
X : array-like, or pandas.DataFrame, shape (n_samples, n_features)

categorical_features : array-like, optional, shape (n_features)
Indicates with True/False whether a column is a categorical attribute.

Alternatively, the categorical_features array can be represented only
with the numerical indexes of the categorical attribtes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*attributes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unresolved


If the categorical_features array is None, they will be automatically
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be specified in a docstring of a public function, not a private one. And it needs to give more detail about the automatic detection. As it is, I'm not comfortable about this automatic detection stuff, unless the column contains strings or pd.Categorical.

detected in X. Numerical columns are identified as a subtype of
np.number, whilist categorical columns are not a subtype of np.number.

Returns
-------
categorical_features_mask : ndarray, shape (n_features)

"""
# Automatic detection of categorical features
if categorical_features is None:
categorical_features = np.zeros(np.shape(X)[1], dtype=bool)

def detect_cat(x):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

insert a blank line before this so that test run, please

if not np.isnan(x):
if np.issubdtype(type(x), np.number):
raise ValueError(False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a very unconventional way of providing control flow and passing values around. Why are we using exceptions rather than return values here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I see that we're applying pyufunc to check each element individually, and using exceptions to abort as soon as we have a non-NaN. This logic is very unclear from your code, and I see no benefit in doing it this way rather than an explicit python loop over elements, or something more functional-style:

non_nan_values = itertools.dropwhile(np.isnan, X[:, col])
try:
    value = next(non_nan_values)
except StopIteration:
    TODO: handle case when all values are NaN

TODO: determine type from value

else:
raise ValueError(True)

f_test = np.frompyfunc(detect_cat, 1, 1)
for col in range(np.shape(X)[1]):
try:
# This identifies categorical and numerical columns,
# A TypeError or ValueError(True) means it is a categorical
# column.

# This test was disabled because some deployments are returning
# nan instead of 0 in columns with nan values:
# if np.nansum(X[:, col]) > 0:
f_test(X[:, col])
except ValueError as e:
categorical_features[col] = e.args[0]
except TypeError:
categorical_features[col] = True
else:
categorical_features = np.asarray(categorical_features)
if np.issubdtype(categorical_features.dtype, np.integer):
new_categorical_features = np.zeros(np.shape(X)[1], dtype=bool)
new_categorical_features[categorical_features] = True
categorical_features = new_categorical_features
return categorical_features


def _precompute_gower_params(X, Y, scale, num_mask):
"""Precompute data-derived metric parameters for gower distances
"""
X_num = X[:, num_mask].astype(np.float32)
min = np.nanmin(X_num, axis=0)
max = np.nanmax(X_num, axis=0)

if X is not Y and Y is not None:
Y_num = Y[:, num_mask].astype(np.float32)
min = np.minimum(np.nanmin(Y_num, axis=0), min)
max = np.maximum(np.nanmax(Y_num, axis=0), max)

if scale is None or type(scale) is bool:
scale = np.abs(max - min)
elif isinstance(scale, list):
scale = np.asarray(scale)

return scale, min, max


# Paired distances
def paired_euclidean_distances(X, Y):
"""
Expand Down Expand Up @@ -905,7 +1131,7 @@ def paired_cosine_distances(X, Y):
'l2': paired_euclidean_distances,
'l1': paired_manhattan_distances,
'manhattan': paired_manhattan_distances,
'cityblock': paired_manhattan_distances}
'cityblock': paired_manhattan_distances, }


def paired_distances(X, Y, metric="euclidean", **kwds):
Expand Down Expand Up @@ -1298,6 +1524,7 @@ def chi2_kernel(X, Y=None, gamma=1.):
'l2': euclidean_distances,
'l1': manhattan_distances,
'manhattan': manhattan_distances,
'gower': gower_distances,
'precomputed': None, # HACK: precomputed is always allowed, never called
'nan_euclidean': nan_euclidean_distances,
}
Expand All @@ -1322,6 +1549,7 @@ def distance_metrics():
'l1' metrics.pairwise.manhattan_distances
'l2' metrics.pairwise.euclidean_distances
'manhattan' metrics.pairwise.manhattan_distances
'gower' metrics.pairwise.gower_distances

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marcelobeckmann you have add gower to the list of metric function, but gower has not been added to the PAIRWISE_DISTANCE_FUNCTIONS collection.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @darena-mdsol, I'll have a look in the PAIRWISE_DISTANCE_COLLECTION. About the weights, it was a misunderstanding from the original Gower formula, that weight is not necessary, and won't be added in the future.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marcelobeckmann Thanks for your reply. I'm an on a project that allows a user to assign weight/importance to each variable. My team has been discussing if weight/importance is useful or not, in lieu of your reply. Could you explain the misunderstanding that you mentioned? Thanks.

Copy link
Author

@marcelobeckmann marcelobeckmann Feb 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @darena-mdsol, the purpose of this PR is to implement the default Gower distance as described in the section 1 from original paper. If the weighted implementation as described in the section 4 needs to be implemented, then a new ticket needs to be open, I won't do that in this PR. My misunderstanding was the formulas in the section 4 were the main proposal from that paper.

'nan_euclidean' metrics.pairwise.nan_euclidean_distances
=============== ========================================

Expand Down Expand Up @@ -1400,7 +1628,7 @@ def _pairwise_callable(X, Y, metric, force_all_finite=True, **kwds):
'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto',
'russellrao', 'seuclidean', 'sokalmichener',
'sokalsneath', 'sqeuclidean', 'yule', "wminkowski",
'nan_euclidean', 'haversine']
'nan_euclidean', 'haversine', 'gower']

_NAN_METRICS = ['nan_euclidean']

Expand Down Expand Up @@ -1429,6 +1657,19 @@ def _check_chunk_size(reduced, chunk_size):
def _precompute_metric_params(X, Y, metric=None, **kwds):
"""Precompute data-derived metric parameters if not provided
"""
if metric == 'gower':
categorical_features = None
if 'categorical_features' in kwds:
categorical_features = kwds['categorical_features']

num_mask = ~ _detect_categorical_features(X, categorical_features)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there benefit to determining categorical features from both X and Y?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there benefit to determining categorical features from both X and Y?


scale = None
if 'scale' in kwds:
scale = kwds['scale']
scale, _, _ = _precompute_gower_params(X, Y, scale, num_mask)

return {'scale': scale}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we also return the determined categorical_features if they had been passed in as None?

if metric == "seuclidean" and 'V' not in kwds:
if X is Y:
V = np.var(X, axis=0, ddof=1)
Expand Down Expand Up @@ -1721,6 +1962,15 @@ def pairwise_distances(X, Y=None, metric="euclidean", n_jobs=None,
check_non_negative(X, whom=whom)
return X
elif metric in PAIRWISE_DISTANCE_FUNCTIONS:
if metric == 'gower':
# These convertions are necessary for matrices with string values
if not isinstance(X, np.ndarray):
X = np.asarray(X, dtype=np.object)
if Y is not None and not isinstance(Y, np.ndarray):
Y = np.asarray(Y, dtype=np.object)
params = _precompute_metric_params(X, Y, metric=metric, **kwds)
kwds.update(**params)

func = PAIRWISE_DISTANCE_FUNCTIONS[metric]
elif callable(metric):
func = partial(_pairwise_callable, metric=metric,
Expand Down
Loading