Skip to content

[MRG+1] OPTICS Change quick_scan floating point comparison to isclose #11929

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Sep 16, 2018

Conversation

adrinjalali
Copy link
Member

@adrinjalali adrinjalali commented Aug 28, 2018

See #11857, #11916, #11878, and MacPython/scikit-learn-wheels#7

The part of the fix which I believe is valid, is the fix for comparison between two floats.

Other than that, scaling up the locations of the points, and reducing the number of points in each group, fixes the test issue, while still keeping the test valid (IMO).

Does this look okay @jnothman, @espg ?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm happy with this

@adrinjalali adrinjalali changed the title Fix quick_scan and reachability test for OPTICS [MRG+1] Fix quick_scan and reachability test for OPTICS Aug 29, 2018
@@ -190,226 +189,65 @@ def test_reach_dists():
# Expected values, matches 'RD' results from:
# http://chemometria.us.edu.pl/download/optics.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you made sure the results indeed match with this implementation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lesteve I actually had not. Somehow I missed this comment. But I did now, and the results are not the same. Not even close actually. But I also checked if the values I'm replacing match the output of that code, and they don't match them. So at some point in time, either somebody forgot to check against the code provided in that URL, or that URL has changed the code and it doesn't match what it used to be. @espg any idea?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check for the iPython notebook where I generated the values; I think I remember having to sort the output by the ordering indexes to get them to match...

C4 = [-2, 3] + .3 * rng.randn(n_points_per_cluster, 2)
C5 = [3, -2] + 1.6 * rng.randn(n_points_per_cluster, 2)
C6 = [5, 6] + 2 * rng.randn(n_points_per_cluster, 2)
n_points_per_cluster = 50
Copy link
Member

@qinhanmin2014 qinhanmin2014 Sep 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you change the test here? I'm fine with reducing n_points_per_cluster, but .8 * rng.randn(n_points_per_cluster, 2) -> 8 * rng.randn(n_points_per_cluster, 2) seems strange. It's hard to form a cluster with your version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue was kinda related to floating point precisions, therefore we had the idea to scale up the data. But you're right, to scale the data I also need to scale the centers. I'll do so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about keeping the original test? Or the original test cannot detect your problem, but the new test can?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I don't think it's good to get expected result from our implementation. Where's your expected result come from?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original test fails on 32 bit systems.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The related conversation happened after this comment: #11857 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original test fails on 32 bit systems.

I assume this PR aims to solve the problem. Seems that I'm wrong? So it's impossible to pass the original test on 32-bit system? If so, is it possible to only reduce n_points_per_cluster? If not, then I think it's acceptable to scale up the data, as long as you get expected result from another implementation (e.g., R)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could do that, but are you opposed to changing the == to np.isclose when comparing floats?

@qinhanmin2014
Copy link
Member

I can't find any detailed information about the code from http://chemometria.us.edu.pl/download/optics.py. It's it cited or mentioned elsewhere? Maybe it's better to compare with R?

@qinhanmin2014
Copy link
Member

are you opposed to changing the == to np.isclose

No, we've encountered similar problems (e.g., #11551).

It seems that the old test and the new test pass on master and in this PR (correct me if I'm wrong). So I'd prefer a +1/-1 PR unless you provide me with your reason to change the test.

But there's another problem here. It seems that the performance of OPTICS will be influenced if we use np.isclose in cython code. I'm not sure what's the solution (maybe seeking for some low-level functions).

%timeit test_reach_dists() # using ==
#1 loop, best of 3: 1.11 s per loop
%timeit test_reach_dists() # using np.isclose
1 loop, best of 3: 38 s per loop

@adrinjalali
Copy link
Member Author

It passes the tests because we don't build/test for 32 bit systems on this repo. That's why @jnothman sent this quick fix: (#11916).

The issue is not fixed yet on master.

On top of that, the current test on master fails on 32bit arch even with the np.isclose proposed here. That's why we're changing the test.

I'm trying to figure a faster way than np.isclose, unless you have a solution in mind.

@adrinjalali
Copy link
Member Author

regarding the np.isclose performance issue, it seems if we replace it with the cython version of what's proposed in PEP485, we don't pay any performance penalty.

I've put it in this PR, since it's supported only from Python 3.5.

@qinhanmin2014
Copy link
Member

and then when asked to compare against a reference implementation, the reachability output matched.

Yes, but my point here is that if the referenced implementation is not cited/mentioned elsewhere (indicating that it's not a popular implementation), this implementation might have some bugs.

I'd trust the Elki / java implementation more than the R one, since it was done by the original paper authors...

Agree, it's good to compare with the implementation from the original author. @espg Your contribution here will be very welcomed. But I think it's not the most urgent thing. For now, I think we need to ensure the correctness of our implementation.

@adrinjalali For current PR, I still fail to understand why we need to change the test.

On top of that, the current test on master fails on 32bit arch even with the np.isclose proposed here. That's why we're changing the test.

Yes, it fails on 32bit machine, but we actually can't find the reason, right? If so, I prefer to keep the test to make the PR minimal. We've already skipped some tests on 32bit machine.

Regarding the PR itself, I'm not against getting rid of ==, but would like to see a test which fails in master but passes in the PR. I think it will be difficult so I won't insist on it.

@qinhanmin2014
Copy link
Member

Brian Clowers code is a port of Michal Daszykowski's matlab code; see here and here. Michal Daszykowski has two publications that reference or use his OPTICS implementation, and his implementation has it's own DOI (DOI: 10.13140/RG.2.1.3998.3843).

Thanks @espg (though I don't have time to read them now), I'm now much more confident to use the implementation.

@jnothman
Copy link
Member

jnothman commented Sep 4, 2018 via email

@adrinjalali
Copy link
Member Author

This is what I can add:

  1. Issue Related to 32bit Systems
    The issue starts here:
        if len(unproc) > 0:
            dists = pairwise_distances(P, np.take(X, unproc, axis=0),
                                       self.metric, n_jobs=None).ravel()

            rdists = np.maximum(dists, self.core_distances_[point_index])
            new_reach = np.minimum(np.take(self.reachability_, unproc), rdists)
            self.reachability_[unproc] = new_reach

        # Checks to see if everything is already processed;
        # if so, return control to main loop
        if unproc.size > 0:
            # Define return order based on reachability distance
            return(unproc[quick_scan(np.take(self.reachability_, unproc),
                                     dists)])

and comparing the values between 32 and 64 archs, it's the new_reach which is different in the two archs (of the order of 1e-13). As a result of that, quick_scan returns a different index and therefore points are processed in a different order. This is why one of the reachability values is significantly different in the two archs, and IMO, if we're testing the reachability part of the algorithm, we should also check the ordering_, and if we do that, we see that there's a significant different in the two archs and the test would fail as it is now, no matter what we set the comparison tolerance for reachability_.

That means, the test fails on a 32 bit arch because of a precision issue, but that precision issue happens at a point which makes the algorithm follow different paths on the two archs, i.e. reachability_ values are different because the algorithm processes points in a different order, not because of some 32bit floating point calculation issue directly on reachability_.

  1. Performance Issue
    This PR's implementation of the _optics_inner.pyx:
cdef isclose(double a, double b, double rel_tol=1e-09, double abs_tol=0.0):
    return abs(a-b) <= max(rel_tol * max(abs(a), abs(b)), abs_tol)

does NOT impose any computation penalty, i.e.:

%timeit test_reach_dists() # with ==
736 ms ± 48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit test_reach_dists() # with isclose as implemented in this PR
735 ms ± 20.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  1. Different Proposed Solutions
    3.1. Add tolerance to all comparisons
    The first solution I came up with, which isn't a good idea. It would be a change like this (f50e778?diff=unified):
    for i from 0 <= i < n:
-       if rdists[i] < rdist:
+       if rdists[i] + 1e-7 < rdist:
            rdist = rdists[i]
            dist = dists[i]
            idx = i
-       if rdists[i] == rdist:
-           if dists[i] < dist:
+       if np.abs(rdists[i] - rdist) < 1e-7:
+           if dists[i] + 1e-7 < dist:
                dist = dists[i]
                idx = i

The above proposal would make the test pass without any change to the test, on both archs. But as @jnothman also pointed out (#11857 (comment)).

I'm not entirely convinced that it's the right fix. It will make it consistent across platforms, but can easily produce a different order from expected.

3.2. Upscale the space of the test by a factor of 100
The idea was to multiply the data by 100 for example, and hope that the distances are large enough for the two archs to give the same results. But it didn't.

3.3. Reduce the number of points in each cluster
The data is only 2 dimensional, therefore it may be the case that some of the randomly generated data are really close to each other. Reducing the number of points in each cluster to 50, makes both archs to give the same results (including ordering_). So yes, we don't have to upscale the data if we're reducing the number of points to 50. I hadn't checked this scenario due to time pressure. I tested it today and it works.

  1. Summary
  • There seems to be consensus that using isclose is better than == comparing floats, and since this implementation doesn't lower the performance of the algorithm, it seems reasonable to keep it.
  • Reducing the number of values in each group to 50 would make the algorithm produce same results on both archs (with or without isclose), yet it keeps the test reasonable enough to be a valid test.

I really prefer a test which passes regardless of the arch, rather than having two tests. That would mean the only thing left is to get the reachability values from a different implementation rather than our implementation.

Does it seam reasonable @espg, @jnothman, @qinhanmin2014?

@espg
Copy link
Contributor

espg commented Sep 4, 2018

I'm fine with using 50 points per cluster if it fixes all of our problems; the only reason that 250 was used was to match the unit test to the documentation example. The 'chemometria' OPTICS implementation that I used for the test array is here , although note that it has a dependency of hcluster that prevents it from being called in the unit test itself. If you'd like me to, I can open a PR that amends the testing array to use 50 points per cluster.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very happy with that analysis and the proposal to reduce the number of samples per cluster.

@@ -5,6 +5,9 @@ cimport cython
ctypedef np.float64_t DTYPE_t
ctypedef np.int_t DTYPE

cdef isclose(double a, double b, double rel_tol=1e-09, double abs_tol=0.0):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use cdef inline rather than relying on the C compiler to work it out

@qinhanmin2014
Copy link
Member

Thanks @adrinjalali and @espg
(1) +1 to replace == with cdef inline isclose, though I still wonder if there's a better way to do such a simple thing than defining a function.
(2) +1 to reduce number of samples to 50 since it will make us easier to construct a test. I'm still not persuaded to change the definition of C1-C6, or at least @adrinjalali you should scale up the center to make the test reasonable.
(3) +1 to check reachability_ and ordering_, but that's not urgent so it can be done in a seperate PR.
(4) @adrinjalali Please make sure that you're generating the expected result using the referenced script.

@adrinjalali
Copy link
Member Author

@espg I've reverted the test to master, since I couldn't generate expected values with your script (checked the order issue, and possible missing sqrt, still).

Could you please send a PR with 50 points in each group? No need to change C1 to C6 definitions.

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, ping @jnothman

@adrinjalali adrinjalali changed the title [MRG+1] Fix quick_scan and reachability test for OPTICS [MRG+1] OPTICS Change quick_scan floating point comparison to isclose Sep 5, 2018
@espg
Copy link
Contributor

espg commented Sep 6, 2018

@adrinjalali Sure, working on it now.

@espg
Copy link
Contributor

espg commented Sep 7, 2018

I'm running into weirdness... Identical results between the two implementations when using n=250 per group, but divergent answers when using n=50. Not sure what to do here; apparently the implementations only agree on datasets above a certain size. So, using n=50 will cause the comparison to fail. Using 250 will cause them to agree, but then the numerical instability will cause the 32-bit comparisons to diverge.

I like this test quite a lot, as it directly verifies that the reachability values aren't changing or drifting... but I'm at a loss with the other OPTICS code and why it only agrees on large datasets. We could change to 50 points per cluster and use the output from OPTICS at merge time... that would fix the 32-bit issue and let us know if we accidentally implemented any changes that change the core numeric output of the algorithm. But we would be only looking internally, and not checking against another independent implementation. Are we comfortable with that at this point?

@jnothman
Copy link
Member

jnothman commented Sep 7, 2018 via email

@jnothman
Copy link
Member

jnothman commented Sep 7, 2018 via email

@espg
Copy link
Contributor

espg commented Sep 10, 2018

min samples doesn't seem to change things much-- setting different values still has disagreement between the implementations at lower densities. There is an offset between the implementations; the Brian Clowers code doesn't include the point under consideration in the density estimation, while ours does. In practice, this means that the outputs are identical in the 250 points per cluster case if we set min_samples = 10 for our implementation and min_samples = 9 for the Clowers code. This might explain why we don't get identical results at the lower point density, although I'd have to think a bit more on why...

@jnothman
Copy link
Member

jnothman commented Sep 11, 2018 via email

@espg
Copy link
Contributor

espg commented Sep 12, 2018

@jnothman It took me a bit to digest your earlier comment, but I think I finally have a solution for this. See #12054

@espg
Copy link
Contributor

espg commented Sep 12, 2018

@jnothman our definition of whether or not min_samples includes the query point is tied to maintaining compatibility with the DBSCAN module, which includes the query point. Keeping them both to include the query point ensures that extraction using extract_dbscan gives identical or near identical results.

@jnothman
Copy link
Member

jnothman commented Sep 13, 2018 via email

@adrinjalali
Copy link
Member Author

Anything left here I should do? I guess it can be merged.

@jnothman jnothman merged commit a79d44e into scikit-learn:master Sep 16, 2018
@jnothman
Copy link
Member

Thanks @adrinjalali

@adrinjalali adrinjalali deleted the optics32_pr branch September 16, 2018 07:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants