ENH add pairwise distances backend for LocalOutlierFactor #2

OmarManzoor · 2023-04-03T10:29:39Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

PR for review and guidance.

Any other comments?

CC: @jjerphan Could you kindly have a look at this because I might need some guidance on how to access the indices and distances arrays properly

OmarManzoor · 2023-04-03T11:04:27Z

@jjerphan I think the part that I might need guidance with is that we actually want the indices and distances computed completely as dist_k = self._distances_fit_X_[neighbors_indices, self.n_neighbors_ - 1] we actually need the distances associated with the neighbors as well.

jjerphan · 2023-04-06T13:20:23Z

Those distances are stored in the heaps, so after the execution, it should be possible to access the distances (here self._distances_fit_X_[neighbors_indices, self.n_neighbors_ - 1]).

Those are technical details that I should probably better document. I think I need some more time.

OmarManzoor · 2023-04-06T14:36:22Z

Those distances are stored in the heaps, so after the execution, it should be possible to access the distances (here self._distances_fit_X_[neighbors_indices, self.n_neighbors_ - 1]).

Those are technical details that I should probably better document. I think I need some more time.

I understand. However I think the point where I added the code is not correct as _parallel_on_X_prange_iter_finalize will not contain the complete X. I think we might need to define this in _finalize where the computation is complete?

OmarManzoor · 2023-04-19T11:07:04Z

Those distances are stored in the heaps, so after the execution, it should be possible to access the distances (here self._distances_fit_X_[neighbors_indices, self.n_neighbors_ - 1]).
Those are technical details that I should probably better document. I think I need some more time.

I understand. However I think the point where I added the code is not correct as _parallel_on_X_prange_iter_finalize will not contain the complete X. I think we might need to define this in _finalize where the computation is complete?

@jjerphan Does it seem correct to include the implementation in the _finalize_results method?

jjerphan · 2023-04-19T14:05:39Z

@jjerphan Does it seem correct to include the implementation in the _finalize_results method?

Thanks for the heads-up. I think we can first try implementing the logic in _finalize_results as you suggest.

OmarManzoor · 2023-04-26T12:24:06Z

@jjerphan I have made changes to accommodate the fit method with the dedicated backend. However there seems to be a complication in the score samples method now. The local reachability density needs the fit distances of the training data X.
dist_k = self._distances_fit_X_[neighbors_indices, self.n_neighbors_ - 1]

But now since we use a dedicated backend we are not storing these distances anymore and as a result they are not available in the score_samples method.

OmarManzoor · 2023-04-26T12:48:32Z

@jjerphan I have made changes to accommodate the fit method with the dedicated backend. However there seems to be a complication in the score samples method now. The local reachability density needs the fit distances of the training data X. dist_k = self._distances_fit_X_[neighbors_indices, self.n_neighbors_ - 1]

But now since we use a dedicated backend we are not storing these distances anymore and as a result they are not available in the score_samples method.

I would be grateful if you could provide any suggestions. I was thinking about storing the distances in the call to fit but then we would still need these distances when computing score_samples. That would mean the fit distances would need to be available in the ArgKminLRD backend but this kind of creates a difference between how we are using it in fit and score_samples.

OmarManzoor · 2023-04-26T13:08:53Z

@jjerphan I have made changes to accommodate the fit method with the dedicated backend. However there seems to be a complication in the score samples method now. The local reachability density needs the fit distances of the training data X. dist_k = self._distances_fit_X_[neighbors_indices, self.n_neighbors_ - 1]
But now since we use a dedicated backend we are not storing these distances anymore and as a result they are not available in the score_samples method.

I would be grateful if you could provide any suggestions. I was thinking about storing the distances in the call to fit but then we would still need these distances when computing score_samples. That would mean the fit distances would need to be available in the ArgKminLRD backend but this kind of creates a difference between how we are using it in fit and score_samples.

I just realized we also need self._lrd in score_samples
lrd_ratios_array = self._lrd[neighbors_indices_X] / X_lrd[:, np.newaxis]

scikit-learn/sklearn/neighbors/_lof.py

Lines 468 to 483 in a9b06a7

    
           distances_X, neighbors_indices_X = self.kneighbors( 
        
               X, n_neighbors=self.n_neighbors_ 
        
           ) 
        
           if X.dtype == np.float32: 
        
               distances_X = distances_X.astype(X.dtype, copy=False) 
        
           X_lrd = self._local_reachability_density( 
        
               distances_X, 
        
               neighbors_indices_X, 
        
           ) 
        
           lrd_ratios_array = self._lrd[neighbors_indices_X] / X_lrd[:, np.newaxis] 
        
           # as bigger is better: 
        
           return -np.mean(lrd_ratios_array, axis=1)

This means that we need to return the lrd from the backend in any case. So we can't compute the negative outlier factor in the backend. But then we would also need to return neighbors_indices_fit_X as they are used to index into self._lrd to calculate lrd_ratios_array in the fit method

scikit-learn/sklearn/neighbors/_lof.py

Lines 290 to 307 in a9b06a7

    
           self._distances_fit_X_, _neighbors_indices_fit_X_ = self.kneighbors( 
        
               n_neighbors=self.n_neighbors_ 
        
           ) 
        
           if self._fit_X.dtype == np.float32: 
        
               self._distances_fit_X_ = self._distances_fit_X_.astype( 
        
                   self._fit_X.dtype, 
        
                   copy=False, 
        
               ) 
        
           self._lrd = self._local_reachability_density( 
        
               self._distances_fit_X_, _neighbors_indices_fit_X_ 
        
           ) 
        
           # Compute lof score over training samples to define offset_: 
        
           lrd_ratios_array = ( 
        
               self._lrd[_neighbors_indices_fit_X_] / self._lrd[:, np.newaxis] 
        
           )

Looking at this I think we might need to return the neighbor_sample_indices along with lrd.

…it method

jjerphan · 2023-05-14T08:46:23Z

Hi @OmarManzoor, must scikit-learn#26316 be reviewed before this PR? I trying to understand why this branch targets the main branch of your fork.

OmarManzoor · 2023-05-15T06:47:33Z

Hi @OmarManzoor, must scikit-learn#26316 be reviewed before this PR? I trying to understand why this branch targets the main branch of your fork.

Closed. This was just for discussion before opening the actual PR on scikit-learn.

ENH add pairwise distances backend for LocalOutlierFactor

fed7db5

OmarManzoor force-pushed the pairwise_distance_backend_for_lof branch from 4c1d801 to fed7db5 Compare April 3, 2023 10:44

OmarManzoor force-pushed the pairwise_distance_backend_for_lof branch 2 times, most recently from a0f5945 to 9d66cd7 Compare April 26, 2023 12:09

OmarManzoor force-pushed the pairwise_distance_backend_for_lof branch from 9d66cd7 to 1c5ddc0 Compare April 26, 2023 14:28

Update to _compute_local_reachability_density through _finalize_results

efeab7a

OmarManzoor force-pushed the pairwise_distance_backend_for_lof branch from 1c5ddc0 to efeab7a Compare May 2, 2023 09:08

OmarManzoor added 3 commits May 2, 2023 14:42

Merge branch 'main' into pairwise_distance_backend_for_lof

12d1f2f

PR suggestions

3926220

Remove the train and factor parameter and focus this PR on just the f…

1ea53da

…it method

OmarManzoor closed this May 15, 2023

jjerphan mentioned this pull request Jul 11, 2023

PERF Implement PairwiseDistancesReduction backend for RadiusNeighbors.predict_proba #4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ENH add pairwise distances backend for LocalOutlierFactor #2

ENH add pairwise distances backend for LocalOutlierFactor #2

Uh oh!

OmarManzoor commented Apr 3, 2023

Uh oh!

OmarManzoor commented Apr 3, 2023

Uh oh!

jjerphan commented Apr 6, 2023 •

edited

Loading

Uh oh!

OmarManzoor commented Apr 6, 2023

Uh oh!

OmarManzoor commented Apr 19, 2023

Uh oh!

jjerphan commented Apr 19, 2023

Uh oh!

OmarManzoor commented Apr 26, 2023

Uh oh!

OmarManzoor commented Apr 26, 2023 •

edited

Loading

Uh oh!

OmarManzoor commented Apr 26, 2023 •

edited

Loading

Uh oh!

jjerphan commented May 14, 2023

Uh oh!

OmarManzoor commented May 15, 2023

Uh oh!

Uh oh!

ENH add pairwise distances backend for LocalOutlierFactor #2

ENH add pairwise distances backend for LocalOutlierFactor #2

Uh oh!

Conversation

OmarManzoor commented Apr 3, 2023

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

OmarManzoor commented Apr 3, 2023

Uh oh!

jjerphan commented Apr 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

OmarManzoor commented Apr 6, 2023

Uh oh!

OmarManzoor commented Apr 19, 2023

Uh oh!

jjerphan commented Apr 19, 2023

Uh oh!

OmarManzoor commented Apr 26, 2023

Uh oh!

OmarManzoor commented Apr 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

OmarManzoor commented Apr 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjerphan commented May 14, 2023

Uh oh!

OmarManzoor commented May 15, 2023

Uh oh!

Uh oh!

jjerphan commented Apr 6, 2023 •

edited

Loading

OmarManzoor commented Apr 26, 2023 •

edited

Loading

OmarManzoor commented Apr 26, 2023 •

edited

Loading