-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Implement Gower similarity coeficient #9555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
e5fdbbb
dcf96f4
41a2748
d3221a7
da71fba
a63c43f
e50d9d9
47b20a9
12b773b
a32f8e7
3480bf2
7be14ba
3d1f2bc
181a750
b8da4c9
e31f72b
5096b76
1230b6f
16b756f
db6303b
7cb7ce9
9679345
066d9fa
348bf40
1b6f8b6
89b8884
ab8a61d
ef90d8e
dd1fdcd
71ce0c5
705fec9
9e5a2ac
1ed4550
a3a3135
ecb50be
57693b1
9fd98c7
ed2ce90
6708e0d
2ca1fa4
fb21d0f
8cc70af
a7654e4
52bb273
fcc9519
1df1fea
0339b55
19f1c57
dcf3a37
2b1a697
2cb2802
84dfcf1
d798f06
4889385
1cd6979
206cd26
43b77ef
6d847d4
9379e2c
4bf77e7
992b5cb
dbc6f55
da825de
3090915
4d10175
460484f
8b7f236
c699f8d
ddf9022
b3ad764
b99aacc
8f1bcd3
cda1b54
6b10f24
5209834
cc0184f
ceb4b44
172d21f
f3b1544
8ed3ecb
0c4d489
6c1054e
ca12d35
9c09d9b
ed74af0
f40273f
7dd2a9b
c5a4472
3a9f576
d3b10fe
fcb4763
6f2d98d
5c6c30d
745de05
6fa6e88
077b3cb
ded653a
eb1ee32
782eb3d
4c03f5c
3ca56d5
29d82d5
16d339d
6ea57ac
16b9377
1474df8
49a5ac2
c92d47d
67491ce
e123d36
a993bbe
5e4cf76
66650fa
23966ff
ae7f556
d257bba
52ad60f
336c183
c4959fa
faa404f
4a2d89e
5b84803
cc58403
f69fd04
098bef9
bab9ca0
bba8828
26779a0
87e2f63
ff6366b
0931f81
fa44c39
4a46ae1
38d99d5
a93efa5
512428d
6a403d5
29cd45e
a811d57
4d6d584
dbd4af5
091a7fa
117cca0
34e78ae
0a802c3
6df57c2
e610965
1cddfdf
850caa6
545e496
6b438cf
b9d2188
df73f9e
bc08577
a73852e
127bc7b
8cd9ca3
e8c6624
a339e48
27b7fd9
8e37937
d11d8e7
c375fee
462c6f3
8f4e9de
58770f0
b3270a8
188e0ca
eb1ab6b
eab56c4
5f3421e
9317415
d16f833
b23fc65
da6b46d
d84be70
e67579d
e5167e0
7de895b
c88cf0f
08e692a
19e4f0b
7d480b2
14d0d8b
3b3bb54
82707d2
e187e01
7370840
a86ba38
c0f3ee2
d1a116f
8ddfb1b
b37f750
88f835d
77d925f
72bc1dc
984a6a0
cf861bd
1510744
988028a
8454f97
f1d840d
a8f2a65
37359f0
63c179e
8786f5d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -93,6 +93,46 @@ is equivalent to :func:`linear_kernel`, only slower.) | |||||
Information Retrieval. Cambridge University Press. | ||||||
https://nlp.stanford.edu/IR-book/html/htmledition/the-vector-space-model-for-scoring-1.html | ||||||
|
||||||
.. _gower_distances: | ||||||
|
||||||
Gower distances | ||||||
----------------- | ||||||
The function :func:`gower_distances` computes the distances between the | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
observations in X and Y, that may contain combinations of numerical, boolean, | ||||||
or categorical attributes, using an implementation of Gower Similarity. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please describe how we go from the similarity to the distance? |
||||||
|
||||||
.. math:: | ||||||
|
||||||
g(\mathbf{x}, \mathbf{y}) = \frac{\sum_i(s(x_i, y_i))}{|\{i| x_i\text{ is not missing or }y_i\text{ is not missing}\}|} | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. use i or k to index the features but not both please |
||||||
|
||||||
Where: | ||||||
|
||||||
x, y : array_like (1, n_features) are the observations to be compared. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
s(x, y) : Calculates the similarity of all features (for k = 1 to n_features) | ||||||
of x and y, as described by the expressions: | ||||||
|
||||||
s(x_k, y_k) = 0, if k represents a boolean or categorical attribute, | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These should be rendered in latex |
||||||
and they are equal. | ||||||
|
||||||
s(x_k, y_k) = 1, if k represents a boolean or categorical attribute, | ||||||
and they are unequal. | ||||||
|
||||||
s(x_k, y_k) = abs(x_k - y_k), if k represents a numerical attribute. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So IIUC, the scale of a numerical feature will have a huge impact on the final value? Should the features be standardized before computing the Gower similarity? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The features are currently being min-max scaled within Gower unless scale=False. |
||||||
|
||||||
s(x_k, y_k) = 0, if x_k or y_k are missing. | ||||||
|
||||||
|
||||||
The Gower formula combines a Manhattan (L1) distance for numeric features | ||||||
with Hamming distance for categorical features to obtain a general coefficient | ||||||
for categorical and numeric data. | ||||||
|
||||||
jnothman marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
.. topic:: References: | ||||||
|
||||||
* Gower, J.C., 1971, A General Coefficient of Similarity and Some of Its | ||||||
Properties, Biometrics, Vol. 27, No. 4. (Dec., 1971), pp. 857-871. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
http://members.cbio.mines-paristech.fr/~jvert/svn/bibli/local/Gower1971general.pdf | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
.. _linear_kernel: | ||||||
|
||||||
Linear kernel | ||||||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -31,7 +31,7 @@ | |||||
|
||||||
from ._pairwise_fast import _chi2_kernel_fast, _sparse_manhattan | ||||||
from ..exceptions import DataConversionWarning | ||||||
|
||||||
from ..utils.fixes import _object_dtype_isnan | ||||||
|
||||||
# Utility Functions | ||||||
def _return_float_dtype(X, Y): | ||||||
|
@@ -544,7 +544,7 @@ def pairwise_distances_argmin_min(X, Y, axis=1, metric="euclidean", | |||||
Valid values for metric are: | ||||||
|
||||||
- from scikit-learn: ['cityblock', 'cosine', 'euclidean', 'l1', 'l2', | ||||||
'manhattan'] | ||||||
'manhattan', 'gower'] | ||||||
marcelobeckmann marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
- from scipy.spatial.distance: ['braycurtis', 'canberra', 'chebyshev', | ||||||
'correlation', 'dice', 'hamming', 'jaccard', 'kulsinski', | ||||||
|
@@ -632,7 +632,7 @@ def pairwise_distances_argmin(X, Y, axis=1, metric="euclidean", | |||||
Valid values for metric are: | ||||||
|
||||||
- from scikit-learn: ['cityblock', 'cosine', 'euclidean', 'l1', 'l2', | ||||||
'manhattan'] | ||||||
'manhattan', 'gower'] | ||||||
|
||||||
- from scipy.spatial.distance: ['braycurtis', 'canberra', 'chebyshev', | ||||||
'correlation', 'dice', 'hamming', 'jaccard', 'kulsinski', | ||||||
|
@@ -829,6 +829,232 @@ def cosine_distances(X, Y=None): | |||||
return S | ||||||
|
||||||
|
||||||
def gower_distances(X, Y=None, categorical_features=None, scale=True): | ||||||
"""Compute the distances between the observations in X and Y, | ||||||
that may contain mixed types of data, using an implementation | ||||||
of Gower formula. | ||||||
|
||||||
marcelobeckmann marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please add "Read more in the :ref: |
||||||
Parameters | ||||||
---------- | ||||||
X : array-like, or pandas.DataFrame, shape (n_samples, n_features) | ||||||
|
||||||
Y : array-like, or pandas.DataFrame, optional, | ||||||
shape (n_samples, n_features) | ||||||
|
||||||
categorical_features : array-like, optional, shape (n_features) | ||||||
Indicates with True/False whether a column is a categorical attribute. | ||||||
This is useful when categorical atributes are represented as integer | ||||||
values. Categorical ordinal attributes are treated as numeric, and | ||||||
must be marked as false. | ||||||
|
||||||
Alternatively, the categorical_features array can be represented only | ||||||
with the numerical indexes of the categorical attribtes. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should we also support categorical_features being a callable, as we do in ColumnTransformer? |
||||||
|
||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Default behaviour for categorical_features is not described |
||||||
If the categorical_features array is not provided, by default all | ||||||
non-numeric columns are considered categorical. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note that behaviour is undefined if columns mix numeric and non-numeric values. |
||||||
|
||||||
scale : boolean, list or array, optional (default=True) | ||||||
Indicates if the numerical columns will be scaled between 0 and 1. | ||||||
If false, it is assumed the numerical columns are already scaled. | ||||||
If a list or array, it must countain the ranges of values from | ||||||
numerical columns. | ||||||
|
||||||
Returns | ||||||
------- | ||||||
similarities : ndarray, shape (n_samples_X, n_samples_Y) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. distances? |
||||||
|
||||||
References | ||||||
---------- | ||||||
Gower, J.C., 1971, A General Coefficient of Similarity and Some of Its | ||||||
Properties. | ||||||
|
||||||
Notes | ||||||
adrinjalali marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
----- | ||||||
The numeric feature ranges are determined from both X and Y. | ||||||
|
||||||
Current implementation does not support sparse matrices. | ||||||
|
||||||
All the non-numerical types (e.g., str), are treated as categorical | ||||||
features. | ||||||
|
||||||
This implementation modifies the Gower's original similarity measure in | ||||||
the folowing aspects: | ||||||
|
||||||
* The values in the original similarity S range between 0 and 1. To | ||||||
guarantee this, it is assumed the numerical features of X and Y are | ||||||
scaled between 0 and 1. | ||||||
|
||||||
* Different from the original similarity S, this implementation | ||||||
returns 1-S. | ||||||
""" | ||||||
if issparse(X) or issparse(Y): | ||||||
raise TypeError("Gower distance does not support sparse matrices") | ||||||
|
||||||
if not isinstance(scale, (bool, list, np.ndarray)): | ||||||
raise TypeError("Parameter scale must be boolean, list, or ndarray") | ||||||
|
||||||
if X is None or len(X) == 0: | ||||||
raise ValueError("X can not be None or empty") | ||||||
|
||||||
# It is necessary to convert to ndarray in advance to define the dtype | ||||||
# as np.object, otherwise numeric columns will be converted to string | ||||||
# if there are other string columns. | ||||||
if not isinstance(X, np.ndarray): | ||||||
X = np.asarray(X, dtype=np.object) | ||||||
|
||||||
if Y is not None and not isinstance(Y, np.ndarray): | ||||||
Y = np.asarray(Y, dtype=np.object) | ||||||
|
||||||
X, Y = check_pairwise_arrays(X, Y, precomputed=False, dtype=X.dtype, | ||||||
force_all_finite=False) | ||||||
|
||||||
X = np.asarray(X, dtype=np.object) | ||||||
|
||||||
cat_mask = _detect_categorical_features(X, categorical_features) | ||||||
num_mask = ~ cat_mask | ||||||
|
||||||
# Calculates the min and max values, and if requested, scale the | ||||||
# input values in order to obtain the distances between 0 and 1, | ||||||
# as proposed by the Gower's paper. | ||||||
ranges = 1 | ||||||
if np.any(num_mask): | ||||||
process_scale = False | ||||||
if isinstance(scale, bool): | ||||||
process_scale = scale | ||||||
else: | ||||||
if len(np.asarray(scale).flatten()) != X[:, num_mask].shape[1]: | ||||||
raise ValueError("Length of scale parameter must be equal " | ||||||
"to the number of numerical columns.") | ||||||
process_scale = True | ||||||
|
||||||
ranges, min, max = _precompute_gower_params(X, Y, scale, num_mask) | ||||||
|
||||||
# avoid division by zero when all values in the column are the same | ||||||
ranges[ranges == 0] = 1 | ||||||
|
||||||
# check if the data is pre-scaled when scale=False | ||||||
if not process_scale and (np.min(min) < 0 or np.max(max) > 1): | ||||||
raise ValueError("Input data is not scaled between 0 and 1.") | ||||||
|
||||||
D = np.zeros((X.shape[0], Y.shape[0]), dtype=np.float) | ||||||
|
||||||
for i in range(X.shape[0]): | ||||||
j_start = i | ||||||
|
||||||
# For non square results | ||||||
if X.shape[0] != Y.shape[0] or X is not Y: | ||||||
j_start = 0 | ||||||
|
||||||
# Makes the comparisson for np.nan for arrays with dtype=np.object, | ||||||
# this is necessary as some deployments returns True for | ||||||
# np.nan == np.nan | ||||||
cat_nan_cols = (_object_dtype_isnan(X[i, cat_mask]) | | ||||||
_object_dtype_isnan(Y[j_start:, cat_mask])) | ||||||
|
||||||
# Calculates the similarities for categorical columns | ||||||
cat_dists = ((X[i, cat_mask] != Y[j_start:, cat_mask]) | cat_nan_cols) | ||||||
# Calculates the Manhattan distances for numerical columns | ||||||
num_dists = abs(X[i, num_mask] - | ||||||
Y[j_start:, num_mask]) / ranges | ||||||
|
||||||
# Calculates the number of non missing columns | ||||||
marcelobeckmann marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
non_missing = X.shape[1] - (cat_nan_cols.sum(axis=1) + | ||||||
_object_dtype_isnan(num_dists).sum(axis=1) | ||||||
.astype(np.float32)) | ||||||
|
||||||
# This is to avoid ZeroDivisionError | ||||||
non_missing[non_missing == 0] = np.nan | ||||||
|
||||||
# Gets the final results | ||||||
total = np.sum(cat_dists, axis=1) + np.sum(num_dists, axis=1) | ||||||
|
||||||
results = total / non_missing | ||||||
|
||||||
D[i, j_start:] = results | ||||||
if X is Y: | ||||||
D[i:, j_start] = results | ||||||
|
||||||
return D | ||||||
|
||||||
|
||||||
def _detect_categorical_features(X, categorical_features=None): | ||||||
"""Identifies the numerical and non-numerical (categorical) columns | ||||||
of an array. | ||||||
|
||||||
Parameters | ||||||
---------- | ||||||
X : array-like, or pandas.DataFrame, shape (n_samples, n_features) | ||||||
|
||||||
categorical_features : array-like, optional, shape (n_features) | ||||||
Indicates with True/False whether a column is a categorical attribute. | ||||||
|
||||||
Alternatively, the categorical_features array can be represented only | ||||||
with the numerical indexes of the categorical attribtes. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. *attributes There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Unresolved |
||||||
|
||||||
If the categorical_features array is None, they will be automatically | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This needs to be specified in a docstring of a public function, not a private one. And it needs to give more detail about the automatic detection. As it is, I'm not comfortable about this automatic detection stuff, unless the column contains strings or pd.Categorical. |
||||||
detected in X. Numerical columns are identified as a subtype of | ||||||
np.number, whilist categorical columns are not a subtype of np.number. | ||||||
|
||||||
Returns | ||||||
------- | ||||||
categorical_features_mask : ndarray, shape (n_features) | ||||||
|
||||||
""" | ||||||
# Automatic detection of categorical features | ||||||
if categorical_features is None: | ||||||
categorical_features = np.zeros(np.shape(X)[1], dtype=bool) | ||||||
|
||||||
def detect_cat(x): | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. insert a blank line before this so that test run, please |
||||||
if not np.isnan(x): | ||||||
if np.issubdtype(type(x), np.number): | ||||||
raise ValueError(False) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks like a very unconventional way of providing control flow and passing values around. Why are we using exceptions rather than return values here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay. I see that we're applying pyufunc to check each element individually, and using exceptions to abort as soon as we have a non-NaN. This logic is very unclear from your code, and I see no benefit in doing it this way rather than an explicit python loop over elements, or something more functional-style: non_nan_values = itertools.dropwhile(np.isnan, X[:, col])
try:
value = next(non_nan_values)
except StopIteration:
TODO: handle case when all values are NaN
TODO: determine type from value |
||||||
else: | ||||||
raise ValueError(True) | ||||||
|
||||||
f_test = np.frompyfunc(detect_cat, 1, 1) | ||||||
for col in range(np.shape(X)[1]): | ||||||
try: | ||||||
# This identifies categorical and numerical columns, | ||||||
# A TypeError or ValueError(True) means it is a categorical | ||||||
# column. | ||||||
|
||||||
# This test was disabled because some deployments are returning | ||||||
# nan instead of 0 in columns with nan values: | ||||||
# if np.nansum(X[:, col]) > 0: | ||||||
f_test(X[:, col]) | ||||||
except ValueError as e: | ||||||
categorical_features[col] = e.args[0] | ||||||
except TypeError: | ||||||
categorical_features[col] = True | ||||||
else: | ||||||
categorical_features = np.asarray(categorical_features) | ||||||
if np.issubdtype(categorical_features.dtype, np.integer): | ||||||
new_categorical_features = np.zeros(np.shape(X)[1], dtype=bool) | ||||||
new_categorical_features[categorical_features] = True | ||||||
categorical_features = new_categorical_features | ||||||
return categorical_features | ||||||
|
||||||
|
||||||
def _precompute_gower_params(X, Y, scale, num_mask): | ||||||
"""Precompute data-derived metric parameters for gower distances | ||||||
""" | ||||||
X_num = X[:, num_mask].astype(np.float32) | ||||||
min = np.nanmin(X_num, axis=0) | ||||||
max = np.nanmax(X_num, axis=0) | ||||||
|
||||||
if X is not Y and Y is not None: | ||||||
Y_num = Y[:, num_mask].astype(np.float32) | ||||||
min = np.minimum(np.nanmin(Y_num, axis=0), min) | ||||||
max = np.maximum(np.nanmax(Y_num, axis=0), max) | ||||||
|
||||||
if scale is None or type(scale) is bool: | ||||||
scale = np.abs(max - min) | ||||||
elif isinstance(scale, list): | ||||||
scale = np.asarray(scale) | ||||||
|
||||||
return scale, min, max | ||||||
|
||||||
|
||||||
# Paired distances | ||||||
def paired_euclidean_distances(X, Y): | ||||||
""" | ||||||
|
@@ -905,7 +1131,7 @@ def paired_cosine_distances(X, Y): | |||||
'l2': paired_euclidean_distances, | ||||||
'l1': paired_manhattan_distances, | ||||||
'manhattan': paired_manhattan_distances, | ||||||
'cityblock': paired_manhattan_distances} | ||||||
'cityblock': paired_manhattan_distances, } | ||||||
|
||||||
|
||||||
def paired_distances(X, Y, metric="euclidean", **kwds): | ||||||
|
@@ -1298,6 +1524,7 @@ def chi2_kernel(X, Y=None, gamma=1.): | |||||
'l2': euclidean_distances, | ||||||
'l1': manhattan_distances, | ||||||
'manhattan': manhattan_distances, | ||||||
'gower': gower_distances, | ||||||
'precomputed': None, # HACK: precomputed is always allowed, never called | ||||||
'nan_euclidean': nan_euclidean_distances, | ||||||
} | ||||||
|
@@ -1322,6 +1549,7 @@ def distance_metrics(): | |||||
'l1' metrics.pairwise.manhattan_distances | ||||||
'l2' metrics.pairwise.euclidean_distances | ||||||
'manhattan' metrics.pairwise.manhattan_distances | ||||||
'gower' metrics.pairwise.gower_distances | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @marcelobeckmann you have add There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @darena-mdsol, I'll have a look in the PAIRWISE_DISTANCE_COLLECTION. About the weights, it was a misunderstanding from the original Gower formula, that weight is not necessary, and won't be added in the future. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @marcelobeckmann Thanks for your reply. I'm an on a project that allows a user to assign weight/importance to each variable. My team has been discussing if weight/importance is useful or not, in lieu of your reply. Could you explain the misunderstanding that you mentioned? Thanks. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi @darena-mdsol, the purpose of this PR is to implement the default Gower distance as described in the section 1 from original paper. If the weighted implementation as described in the section 4 needs to be implemented, then a new ticket needs to be open, I won't do that in this PR. My misunderstanding was the formulas in the section 4 were the main proposal from that paper. |
||||||
'nan_euclidean' metrics.pairwise.nan_euclidean_distances | ||||||
=============== ======================================== | ||||||
|
||||||
|
@@ -1400,7 +1628,7 @@ def _pairwise_callable(X, Y, metric, force_all_finite=True, **kwds): | |||||
'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto', | ||||||
'russellrao', 'seuclidean', 'sokalmichener', | ||||||
'sokalsneath', 'sqeuclidean', 'yule', "wminkowski", | ||||||
'nan_euclidean', 'haversine'] | ||||||
'nan_euclidean', 'haversine', 'gower'] | ||||||
|
||||||
_NAN_METRICS = ['nan_euclidean'] | ||||||
|
||||||
|
@@ -1429,6 +1657,19 @@ def _check_chunk_size(reduced, chunk_size): | |||||
def _precompute_metric_params(X, Y, metric=None, **kwds): | ||||||
"""Precompute data-derived metric parameters if not provided | ||||||
""" | ||||||
if metric == 'gower': | ||||||
categorical_features = None | ||||||
if 'categorical_features' in kwds: | ||||||
categorical_features = kwds['categorical_features'] | ||||||
|
||||||
num_mask = ~ _detect_categorical_features(X, categorical_features) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there benefit to determining categorical features from both X and Y? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there benefit to determining categorical features from both X and Y? |
||||||
|
||||||
scale = None | ||||||
marcelobeckmann marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
if 'scale' in kwds: | ||||||
scale = kwds['scale'] | ||||||
scale, _, _ = _precompute_gower_params(X, Y, scale, num_mask) | ||||||
|
||||||
return {'scale': scale} | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. shouldn't we also return the determined |
||||||
if metric == "seuclidean" and 'V' not in kwds: | ||||||
if X is Y: | ||||||
V = np.var(X, axis=0, ddof=1) | ||||||
|
@@ -1721,6 +1962,15 @@ def pairwise_distances(X, Y=None, metric="euclidean", n_jobs=None, | |||||
check_non_negative(X, whom=whom) | ||||||
return X | ||||||
elif metric in PAIRWISE_DISTANCE_FUNCTIONS: | ||||||
if metric == 'gower': | ||||||
# These convertions are necessary for matrices with string values | ||||||
if not isinstance(X, np.ndarray): | ||||||
X = np.asarray(X, dtype=np.object) | ||||||
if Y is not None and not isinstance(Y, np.ndarray): | ||||||
Y = np.asarray(Y, dtype=np.object) | ||||||
params = _precompute_metric_params(X, Y, metric=metric, **kwds) | ||||||
kwds.update(**params) | ||||||
|
||||||
func = PAIRWISE_DISTANCE_FUNCTIONS[metric] | ||||||
elif callable(metric): | ||||||
func = partial(_pairwise_callable, metric=metric, | ||||||
|
Uh oh!
There was an error while loading. Please reload this page.