Skip to content

[MRG] Add preprocessor option #117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
122 commits
Select commit Hold shift + click to select a range
7b3d739
WIP create MahalanobisMixin
May 25, 2018
f21cc85
ENH Update algorithms with Mahalanobis Mixin:
May 25, 2018
6f8a115
Merge branch 'new_api_design' into feat/mahalanobis_class
Jun 11, 2018
f9e3c82
FIX: add missing import
Jun 11, 2018
1a32c11
FIX: update sklearn's function check_no_fit_attributes_set_in_init to…
Jun 11, 2018
d0f5019
FIX: take function ``_get_args`` from scikit-learn's PR https://githu…
Jun 11, 2018
eba2a60
ENH: add transformer_ attribute and improve docstring
Jun 14, 2018
b5d966f
WIP: move transform() in BaseMetricLearner to transformer_from_metric…
Jun 18, 2018
ee0d1bd
WIP: refactor metric to original formulation: a function, with result…
Jun 18, 2018
6b5a3b5
WIP: make all Mahalanobis Metric Learner algorithms have transformer_…
Jun 19, 2018
6eb65ac
ENH Add score_pairs function
Jun 25, 2018
35ece36
TST add test on toy example for score_pairs
Jun 26, 2018
dca6838
ENH Add embed function
Jun 27, 2018
3254ce3
FIX fix error in slicing of quadruplets
Jun 27, 2018
e209b21
FIX minor corrections
Jun 27, 2018
abea7de
FIX minor corrections
Jun 27, 2018
65e794a
FIX fix PEP8 errors
Jun 27, 2018
12b5429
FIX remove possible one-sample scoring from docstring for now
Jun 27, 2018
eff278e
REF rename n_features_out to num_dims to be more coherent with curren…
Jun 27, 2018
810d191
MAINT: Adress https://github.com/metric-learn/metric-learn/pull/96#pu…
Jul 24, 2018
585b5d2
ENH: Add check_tuples
Jul 24, 2018
af0a3ac
FIX: fix parenthesis
Jul 24, 2018
f1dd4c2
ENH: First commit adding a preprocessor
Aug 21, 2018
47fbf46
ENH: Improve check_tuples with more comments and deal better with ens…
Aug 23, 2018
8eb419a
STY: remove unexpected spaces
Aug 23, 2018
033da60
FIX: Raise more appropriate error message
Aug 23, 2018
6fae262
FIX: fix string formatting and refactor name and context to use conte…
Aug 23, 2018
758d4cc
FIX: only allow 2D if preprocessor and 3D if not preprocessor
Aug 24, 2018
deb6d5d
FIX: put format arguments in the right order
Aug 24, 2018
01ee081
MAINT: better to say the preprocessor than a preprocessor in messages
Aug 24, 2018
92f9651
FIX: numeric should be default if NO preprocessor
Aug 24, 2018
4c41c37
FIX: preprocessor argument has to be a boolean (before this change wa…
Aug 28, 2018
4b7e89b
FIX: fix preprocessor argument passed in check_tuples in base_metric
Aug 28, 2018
609b80e
MAINT: say a preprocessor rather than the preprocessor
Aug 31, 2018
764e2f1
Merge branch 'new_api_design' into feat/preprocessor
Sep 4, 2018
4342660
DOC: fix docstring of t in check_tuples
Sep 4, 2018
e50cbae
MAINT: make error messages better by only printing presence of prepro…
Sep 6, 2018
56838b4
TST: Add tests for check_tuples
Sep 6, 2018
9f05c24
TST: simplify tests by removing the test for messages with the estima…
Sep 6, 2018
12ce8ac
STY: remove unnecessary parenthesis
Sep 6, 2018
62a989c
FIX: put back else statement that probably was wrongfully merged
Sep 6, 2018
33c5d8b
TST: add tests for weakly supervised estimators and preprocessor that…
Sep 6, 2018
5eba5fa
TST: add tests for preprocessor
Sep 7, 2018
42b34e0
FIX: remove redundant metric and transformer function, wrongly merged
Sep 13, 2018
a380cd3
MAINT: rename format_input into preprocess_tuples and input into tuples
Sep 13, 2018
54d1710
MAINT: fixes and enhancements
Sep 13, 2018
3b716f0
MAINT: mutualize check_tuples
Sep 13, 2018
bac835e
MAINT: refactor SimplePreprocessor into ArrayIndexer
Sep 14, 2018
2b0f495
MAINT: improve check_tuples and tests
Sep 21, 2018
6ae7ba5
TST: add random seed for _Supervised classes
Sep 24, 2018
a1c8a67
TST: Adapt test pipeline
Sep 24, 2018
735f975
TST: fix test_progress_message_preprocessor_tuples by making func ret…
Sep 24, 2018
3586208
Remove deprecated cross_validation import and put model_selection ins…
Oct 8, 2018
51d7e07
WIP replace checks by unique check_input function
Oct 15, 2018
27e215c
Fixes some tests:
Oct 16, 2018
e23554f
TST: Cherry pick from new sklearn version ac0e230000556b7c413e08b77d8…
Oct 16, 2018
9ded846
FIX: get changes from master to pass test iris for NCA
Oct 16, 2018
50514bc
FIX fix tests that were failing due to the error message
Oct 16, 2018
96b58b4
TST: fix test_check_input_invalid_t that changed since we test t at t…
Oct 16, 2018
9cab2ee
TST fix NCA's iris test taking code from master
Oct 16, 2018
069b8e2
FIX fix tests:
Oct 16, 2018
e8d8795
FIX fix previous modification that removed self.X_ but was modifying …
Oct 16, 2018
7c539c7
FIX ensure at least 2d only for checking the metric because after che…
Oct 16, 2018
f801fae
STY: Fix PEP8 violations
Nov 9, 2018
b38b223
MAINT: Refactor error messages with the help of numerical codes
Nov 12, 2018
cc6d661
MAINT: mutualize check_preprocessor and check_input for every estimator
Nov 12, 2018
192a042
FIX: remove format_map for python2.7 compatibility
Nov 12, 2018
0328941
DOC: Add docstring for check_input and fix some bugs
Nov 12, 2018
00078c2
DOC: add docstrings
Nov 12, 2018
1ded46a
MAINT: Removing changes not related to this PR, and fixing previous p…
Nov 13, 2018
52a1aec
STY: Fix PEP8 errors
Nov 13, 2018
072e834
STY: fix indent problems
Nov 13, 2018
80929e2
Fixing docstring spaces
Nov 13, 2018
40a0172
DOC: add preprocessor docstring when missing
Nov 13, 2018
5a3af89
STY: PEP8 fixes
Nov 13, 2018
968e36e
MAINT: refactor the global check function into _prepare_input
Nov 13, 2018
e5b5f57
FIX: fix quadruplets scoring and delete useless comments
Nov 13, 2018
84c9d56
MAINT: remove some enhancements to be coherent with previous code and…
Nov 13, 2018
7605fa4
MAINT: Improve test messages
Nov 13, 2018
69de333
MAINT: reorganize tests
Nov 13, 2018
b29a555
FIX: fix typo in LMNN shogun and clean todo for the equivalent code i…
Nov 15, 2018
29be5e2
MAINT: Rename inputs and input into input_data
Nov 19, 2018
9d849f7
STY: add backticks to None
Nov 20, 2018
d9ba29e
MAINT: add more detailed comment of first checks and remove old comment
Nov 20, 2018
c1dcc1f
MAINT: improve comments for checking num_features
Nov 20, 2018
6ccd25d
MAINT: Refactor t into tuple_size
Nov 20, 2018
2d96eda
MAINT: Fix small PEP8 error
Nov 20, 2018
4b50ff1
MAINT: FIX remaining t into tuple_size and replace hasattr if None by…
Nov 20, 2018
f69f135
MAINT: remove misplaced comment
Nov 20, 2018
192d208
MAINT: Put back/add docstrings for decision_function/predict
Nov 20, 2018
2c09d9a
MAINT: remove unnecessary ellipsis and upadate docstring of decision_…
Nov 20, 2018
1b7e55f
Add comments in LMNN for arguments useful for the shogun version that…
Nov 20, 2018
8b600be
MAINT: Remove useless mock_preprocessor
Nov 20, 2018
e4468d0
MAINT: Remove useless loop
Nov 20, 2018
a1c95fa
MAINT: refactor test_dict_unchanged
Nov 20, 2018
d95b22a
MAINT: remove _get_args copied from scikit-learn and replace it by an…
Nov 20, 2018
bc06f8f
MAINT: Fragment check_input by extracting blocks into check_input_cla…
Nov 21, 2018
9260a8e
MAINT: ensure min_samples=2 for supervised learning algorithms (we sh…
Nov 21, 2018
93c3e34
ENH: Return custom error when some error is due to the preprocessor
Nov 27, 2018
f2d0cd7
MAINT: Refactor algorithms preprocessing steps
Nov 28, 2018
02e82ff
MAINT: finish the work of the previous commit
Nov 28, 2018
99206b3
TST: add test for cross-validation: comparison of manual cross-val an…
Nov 28, 2018
8ee08b8
ENH: put y=None by default in LSML for better compatibility. This als…
Nov 28, 2018
784f697
ENH: add error message when type of inputs is not some expected type
Nov 28, 2018
c46bbe1
TST: add test that checks that 'classic' is the default behaviour
Nov 28, 2018
39a7256
TST: remove unnecessary conversion to vertical vector of y
Nov 28, 2018
082bca5
FIX: remove wrong condition hasattr 'score' at top of loop
Nov 28, 2018
6abbcd6
MAINT: Add comment to explain why we return twice X for build_regress…
Nov 29, 2018
f0a1dc2
ENH: improve test for preprocessor and return error message if the gi…
Nov 29, 2018
ab5f2e3
FIX: fix wrong type_of_inputs in a test
Nov 29, 2018
5324e85
FIX: deal with the case where preprocessor is None
Nov 29, 2018
48bce7d
WIP refactor build_dataset
Dec 3, 2018
c0cc882
MAINT: refactor bool preprocessor to with_preprocessor
Dec 5, 2018
0b3e58a
FIX: fix build_pairs and build_quadruplets because 'only named argume…
Dec 5, 2018
fbd7242
STY: fix PEP8 error
Dec 5, 2018
148012e
MAINT: mututalize test_same_with_or_without_preprocessor_tuples and t…
Dec 5, 2018
8c5675b
TST: give better names in test_same_with_or_without_preprocessor
Dec 5, 2018
30061e4
MAINT: refactor list_estimators into metric_learners
Dec 5, 2018
e241c28
TST: uniformize names input_data - tuples, labels - y
Dec 5, 2018
b6d7de7
FIX: fix build_pairs and build_quadruplets
Dec 5, 2018
a44e29a
MAINT: remove forgotten code duplication
Dec 11, 2018
2db6410
MAINT: address https://github.com/metric-learn/metric-learn/pull/117#…
Dec 12, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
332 changes: 303 additions & 29 deletions metric_learn/_util.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
import numpy as np

import six
from sklearn.utils import check_array
from sklearn.utils.validation import check_X_y
from metric_learn.exceptions import PreprocessorError

# hack around lack of axis kwarg in older numpy versions
try:
Expand All @@ -12,39 +15,310 @@ def vector_norm(X):
return np.linalg.norm(X, axis=1)


def check_tuples(tuples):
"""Check that the input is a valid 3D array representing a dataset of tuples.

Equivalent of `check_array` in scikit-learn.
def check_input(input_data, y=None, preprocessor=None,
type_of_inputs='classic', tuple_size=None, accept_sparse=False,
dtype='numeric', order=None,
copy=False, force_all_finite=True,
multi_output=False, ensure_min_samples=1,
ensure_min_features=1, y_numeric=False,
warn_on_dtype=False, estimator=None):
"""Checks that the input format is valid, and converts it if specified
(this is the equivalent of scikit-learn's `check_array` or `check_X_y`).
All arguments following tuple_size are scikit-learn's `check_X_y`
arguments that will be enforced on the data and labels array. If
indicators are given as an input data array, the returned data array
will be the formed points/tuples, using the given preprocessor.

Parameters
----------
tuples : object
The tuples to check.
input : array-like
The input data array to check.

y : array-like
The input labels array to check.

preprocessor : callable (default=`None`)
The preprocessor to use. If None, no preprocessor is used.

type_of_inputs : `str` {'classic', 'tuples'}
The type of inputs to check. If 'classic', the input should be
a 2D array-like of points or a 1D array like of indicators of points. If
'tuples', the input should be a 3D array-like of tuples or a 2D
array-like of indicators of tuples.

accept_sparse : `bool`
Set to true to allow sparse inputs (only works for sparse inputs with
dim < 3).

tuple_size : int
The number of elements in a tuple (e.g. 2 for pairs).

dtype : string, type, list of types or None (default='numeric')
Data type of result. If None, the dtype of the input is preserved.
If 'numeric', dtype is preserved unless array.dtype is object.
If dtype is a list of types, conversion on the first type is only
performed if the dtype of the input is not in the list.

order : 'F', 'C' or None (default=`None`)
Whether an array will be forced to be fortran or c-style.

copy : boolean (default=False)
Whether a forced copy will be triggered. If copy=False, a copy might
be triggered by a conversion.

force_all_finite : boolean or 'allow-nan', (default=True)
Whether to raise an error on np.inf and np.nan in X. This parameter
does not influence whether y can have np.inf or np.nan values.
The possibilities are:
- True: Force all values of X to be finite.
- False: accept both np.inf and np.nan in X.
- 'allow-nan': accept only np.nan values in X. Values cannot be
infinite.

ensure_min_samples : int (default=1)
Make sure that X has a minimum number of samples in its first
axis (rows for a 2D array).

ensure_min_features : int (default=1)
Make sure that the 2D array has some minimum number of features
(columns). The default value of 1 rejects empty datasets.
This check is only enforced when X has effectively 2 dimensions or
is originally 1D and ``ensure_2d`` is True. Setting to 0 disables
this check.

warn_on_dtype : boolean (default=False)
Raise DataConversionWarning if the dtype of the input data structure
does not match the requested dtype, causing a memory copy.

estimator : str or estimator instance (default=`None`)
If passed, include the name of the estimator in warning messages.

Returns
-------
tuples_valid : object
The validated input.
X : `numpy.ndarray`
The checked input data array.

y: `numpy.ndarray` (optional)
The checked input labels array.
"""
# If input is scalar raise error
if np.isscalar(tuples):
raise ValueError(
"Expected 3D array, got scalar instead. Cannot apply this function on "
"scalars.")
# If input is 1D raise error
if len(tuples.shape) == 1:
raise ValueError(
"Expected 3D array, got 1D array instead:\ntuples={}.\n"
"Reshape your data using tuples.reshape(1, -1, 1) if it contains a "
"single tuple and the points in the tuple have a single "
"feature.".format(tuples))
# If input is 2D raise error
if len(tuples.shape) == 2:
raise ValueError(
"Expected 3D array, got 2D array instead:\ntuples={}.\n"
"Reshape your data either using tuples.reshape(-1, {}, 1) if "
"your data has a single feature or tuples.reshape(1, {}, -1) "
"if it contains a single tuple.".format(tuples, tuples.shape[1],
tuples.shape[0]))

context = make_context(estimator)

args_for_sk_checks = dict(accept_sparse=accept_sparse,
dtype=dtype, order=order,
copy=copy, force_all_finite=force_all_finite,
ensure_min_samples=ensure_min_samples,
ensure_min_features=ensure_min_features,
warn_on_dtype=warn_on_dtype, estimator=estimator)

# We need to convert input_data into a numpy.ndarray if possible, before
# any further checks or conversions, and deal with y if needed. Therefore
# we use check_array/check_X_y with fixed permissive arguments.
if y is None:
input_data = check_array(input_data, ensure_2d=False, allow_nd=True,
copy=False, force_all_finite=False,
accept_sparse=True, dtype=None,
ensure_min_features=0, ensure_min_samples=0)
else:
input_data, y = check_X_y(input_data, y, ensure_2d=False, allow_nd=True,
copy=False, force_all_finite=False,
accept_sparse=True, dtype=None,
ensure_min_features=0, ensure_min_samples=0,
multi_output=multi_output,
y_numeric=y_numeric)

if type_of_inputs == 'classic':
input_data = check_input_classic(input_data, context, preprocessor,
args_for_sk_checks)

elif type_of_inputs == 'tuples':
input_data = check_input_tuples(input_data, context, preprocessor,
args_for_sk_checks, tuple_size)

else:
raise ValueError("Unknown value {} for type_of_inputs. Valid values are "
"'classic' or 'tuples'.".format(type_of_inputs))

return input_data if y is None else (input_data, y)


def check_input_tuples(input_data, context, preprocessor, args_for_sk_checks,
tuple_size):
preprocessor_has_been_applied = False
if input_data.ndim == 2:
if preprocessor is not None:
input_data = preprocess_tuples(input_data, preprocessor)
preprocessor_has_been_applied = True
else:
make_error_input(201, input_data, context)
elif input_data.ndim == 3:
pass
else:
if preprocessor is not None:
make_error_input(420, input_data, context)
else:
make_error_input(200, input_data, context)
input_data = check_array(input_data, allow_nd=True, ensure_2d=False,
**args_for_sk_checks)
# we need to check num_features because check_array does not check it
# for 3D inputs:
if args_for_sk_checks['ensure_min_features'] > 0:
n_features = input_data.shape[2]
if n_features < args_for_sk_checks['ensure_min_features']:
raise ValueError("Found array with {} feature(s) (shape={}) while"
" a minimum of {} is required{}."
.format(n_features, input_data.shape,
args_for_sk_checks['ensure_min_features'],
context))
# normally we don't need to check_tuple_size too because tuple_size
# shouldn't be able to be modified by any preprocessor
if input_data.ndim != 3:
# we have to ensure this because check_array above does not
if preprocessor_has_been_applied:
make_error_input(211, input_data, context)
else:
make_error_input(201, input_data, context)
check_tuple_size(input_data, tuple_size, context)
return input_data


def check_input_classic(input_data, context, preprocessor, args_for_sk_checks):
preprocessor_has_been_applied = False
if input_data.ndim == 1:
if preprocessor is not None:
input_data = preprocess_points(input_data, preprocessor)
preprocessor_has_been_applied = True
else:
make_error_input(101, input_data, context)
elif input_data.ndim == 2:
pass # OK
else:
if preprocessor is not None:
make_error_input(320, input_data, context)
else:
make_error_input(100, input_data, context)

input_data = check_array(input_data, allow_nd=True, ensure_2d=False,
**args_for_sk_checks)
if input_data.ndim != 2:
# we have to ensure this because check_array above does not
if preprocessor_has_been_applied:
make_error_input(111, input_data, context)
else:
make_error_input(101, input_data, context)
return input_data


def make_error_input(code, input_data, context):
code_str = {'expected_input': {'1': '2D array of formed points',
'2': '3D array of formed tuples',
'3': ('1D array of indicators or 2D array of '
'formed points'),
'4': ('2D array of indicators or 3D array '
'of formed tuples')},
'additional_context': {'0': '',
'2': ' when using a preprocessor',
'1': (' after the preprocessor has been '
'applied')},
'possible_preprocessor': {'0': '',
'1': ' and/or use a preprocessor'
}}
code_list = str(code)
err_args = dict(expected_input=code_str['expected_input'][code_list[0]],
additional_context=code_str['additional_context']
[code_list[1]],
possible_preprocessor=code_str['possible_preprocessor']
[code_list[2]],
input_data=input_data, context=context,
found_size=input_data.ndim)
err_msg = ('{expected_input} expected'
'{context}{additional_context}. Found {found_size}D array '
'instead:\ninput={input_data}. Reshape your data'
'{possible_preprocessor}.\n')
raise ValueError(err_msg.format(**err_args))


def preprocess_tuples(tuples, preprocessor):
try:
tuples = np.column_stack([preprocessor(tuples[:, i])[:, np.newaxis] for
i in range(tuples.shape[1])])
except Exception as e:
raise PreprocessorError(e)
return tuples


def preprocess_points(points, preprocessor):
"""form points if there is a preprocessor else keep them as such (assumes
that check_points has already been called)"""
try:
points = preprocessor(points)
except Exception as e:
raise PreprocessorError(e)
return points


def make_context(estimator):
"""Helper function to create a string with the estimator name.
Taken from check_array function in scikit-learn.
Will return the following for instance:
NCA: ' by NCA'
'NCA': ' by NCA'
None: ''
"""
estimator_name = make_name(estimator)
context = (' by ' + estimator_name) if estimator_name is not None else ''
return context


def make_name(estimator):
"""Helper function that returns the name of estimator or the given string
if a string is given
"""
if estimator is not None:
if isinstance(estimator, six.string_types):
estimator_name = estimator
else:
estimator_name = estimator.__class__.__name__
else:
estimator_name = None
return estimator_name


def check_tuple_size(tuples, tuple_size, context):
"""Helper function to check that the number of points in each tuple is
equal to tuple_size (e.g. 2 for pairs), and raise a `ValueError` otherwise"""
if tuple_size is not None and tuples.shape[1] != tuple_size:
msg_t = (("Tuples of {} element(s) expected{}. Got tuples of {} "
"element(s) instead (shape={}):\ninput={}.\n")
.format(tuple_size, context, tuples.shape[1], tuples.shape,
tuples))
raise ValueError(msg_t)


class ArrayIndexer:

def __init__(self, X):
# we check the array-like preprocessor here, and we as much permissive
# as possible (because the user will check for the desired
# format with arguments in check_input, and only this latter function
# should return the appropriate errors). We do this only to have a numpy
# array object which can be indexed by another numpy array object.
X = check_array(X,
accept_sparse=True, dtype=None,
force_all_finite=False,
ensure_2d=False, allow_nd=True,
ensure_min_samples=0,
ensure_min_features=0,
warn_on_dtype=False, estimator=None)
self.X = X

def __call__(self, indices):
return self.X[indices]


def check_collapsed_pairs(pairs):
num_ident = (vector_norm(pairs[:, 0] - pairs[:, 1]) < 1e-9).sum()
if num_ident:
raise ValueError("{} collapsed pairs found (where the left element is "
"the same as the right element), out of {} pairs "
"in total.".format(num_ident, pairs.shape[0]))
Loading