Skip to content

ENH add pairwise distances backend for LocalOutlierFactor #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

OmarManzoor
Copy link
Owner

Reference Issues/PRs

What does this implement/fix? Explain your changes.

  • PR for review and guidance.

Any other comments?

CC: @jjerphan Could you kindly have a look at this because I might need some guidance on how to access the indices and distances arrays properly

@OmarManzoor OmarManzoor force-pushed the pairwise_distance_backend_for_lof branch from 4c1d801 to fed7db5 Compare April 3, 2023 10:44
@OmarManzoor
Copy link
Owner Author

@jjerphan I think the part that I might need guidance with is that we actually want the indices and distances computed completely as dist_k = self._distances_fit_X_[neighbors_indices, self.n_neighbors_ - 1] we actually need the distances associated with the neighbors as well.

@jjerphan
Copy link

jjerphan commented Apr 6, 2023

Those distances are stored in the heaps, so after the execution, it should be possible to access the distances (here self._distances_fit_X_[neighbors_indices, self.n_neighbors_ - 1]).

Those are technical details that I should probably better document. I think I need some more time.

@OmarManzoor
Copy link
Owner Author

Those distances are stored in the heaps, so after the execution, it should be possible to access the distances (here self._distances_fit_X_[neighbors_indices, self.n_neighbors_ - 1]).

Those are technical details that I should probably better document. I think I need some more time.

I understand. However I think the point where I added the code is not correct as _parallel_on_X_prange_iter_finalize will not contain the complete X. I think we might need to define this in _finalize where the computation is complete?

@OmarManzoor
Copy link
Owner Author

Those distances are stored in the heaps, so after the execution, it should be possible to access the distances (here self._distances_fit_X_[neighbors_indices, self.n_neighbors_ - 1]).
Those are technical details that I should probably better document. I think I need some more time.

I understand. However I think the point where I added the code is not correct as _parallel_on_X_prange_iter_finalize will not contain the complete X. I think we might need to define this in _finalize where the computation is complete?

@jjerphan Does it seem correct to include the implementation in the _finalize_results method?

@jjerphan
Copy link

@jjerphan Does it seem correct to include the implementation in the _finalize_results method?

Thanks for the heads-up. I think we can first try implementing the logic in _finalize_results as you suggest.

@OmarManzoor OmarManzoor force-pushed the pairwise_distance_backend_for_lof branch 2 times, most recently from a0f5945 to 9d66cd7 Compare April 26, 2023 12:09
@OmarManzoor
Copy link
Owner Author

@jjerphan I have made changes to accommodate the fit method with the dedicated backend. However there seems to be a complication in the score samples method now. The local reachability density needs the fit distances of the training data X.
dist_k = self._distances_fit_X_[neighbors_indices, self.n_neighbors_ - 1]

But now since we use a dedicated backend we are not storing these distances anymore and as a result they are not available in the score_samples method.

@OmarManzoor
Copy link
Owner Author

OmarManzoor commented Apr 26, 2023

@jjerphan I have made changes to accommodate the fit method with the dedicated backend. However there seems to be a complication in the score samples method now. The local reachability density needs the fit distances of the training data X. dist_k = self._distances_fit_X_[neighbors_indices, self.n_neighbors_ - 1]

But now since we use a dedicated backend we are not storing these distances anymore and as a result they are not available in the score_samples method.

I would be grateful if you could provide any suggestions. I was thinking about storing the distances in the call to fit but then we would still need these distances when computing score_samples. That would mean the fit distances would need to be available in the ArgKminLRD backend but this kind of creates a difference between how we are using it in fit and score_samples.

@OmarManzoor
Copy link
Owner Author

OmarManzoor commented Apr 26, 2023

@jjerphan I have made changes to accommodate the fit method with the dedicated backend. However there seems to be a complication in the score samples method now. The local reachability density needs the fit distances of the training data X. dist_k = self._distances_fit_X_[neighbors_indices, self.n_neighbors_ - 1]
But now since we use a dedicated backend we are not storing these distances anymore and as a result they are not available in the score_samples method.

I would be grateful if you could provide any suggestions. I was thinking about storing the distances in the call to fit but then we would still need these distances when computing score_samples. That would mean the fit distances would need to be available in the ArgKminLRD backend but this kind of creates a difference between how we are using it in fit and score_samples.

I just realized we also need self._lrd in score_samples
lrd_ratios_array = self._lrd[neighbors_indices_X] / X_lrd[:, np.newaxis]

distances_X, neighbors_indices_X = self.kneighbors(
X, n_neighbors=self.n_neighbors_
)
if X.dtype == np.float32:
distances_X = distances_X.astype(X.dtype, copy=False)
X_lrd = self._local_reachability_density(
distances_X,
neighbors_indices_X,
)
lrd_ratios_array = self._lrd[neighbors_indices_X] / X_lrd[:, np.newaxis]
# as bigger is better:
return -np.mean(lrd_ratios_array, axis=1)

This means that we need to return the lrd from the backend in any case. So we can't compute the negative outlier factor in the backend. But then we would also need to return neighbors_indices_fit_X as they are used to index into self._lrd to calculate lrd_ratios_array in the fit method

self._distances_fit_X_, _neighbors_indices_fit_X_ = self.kneighbors(
n_neighbors=self.n_neighbors_
)
if self._fit_X.dtype == np.float32:
self._distances_fit_X_ = self._distances_fit_X_.astype(
self._fit_X.dtype,
copy=False,
)
self._lrd = self._local_reachability_density(
self._distances_fit_X_, _neighbors_indices_fit_X_
)
# Compute lof score over training samples to define offset_:
lrd_ratios_array = (
self._lrd[_neighbors_indices_fit_X_] / self._lrd[:, np.newaxis]
)

Looking at this I think we might need to return the neighbor_sample_indices along with lrd.

@OmarManzoor OmarManzoor force-pushed the pairwise_distance_backend_for_lof branch from 9d66cd7 to 1c5ddc0 Compare April 26, 2023 14:28
@OmarManzoor OmarManzoor force-pushed the pairwise_distance_backend_for_lof branch from 1c5ddc0 to efeab7a Compare May 2, 2023 09:08
@jjerphan
Copy link

Hi @OmarManzoor, must scikit-learn#26316 be reviewed before this PR? I trying to understand why this branch targets the main branch of your fork.

@OmarManzoor
Copy link
Owner Author

Hi @OmarManzoor, must scikit-learn#26316 be reviewed before this PR? I trying to understand why this branch targets the main branch of your fork.

Closed. This was just for discussion before opening the actual PR on scikit-learn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants