-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Improve the creation of KDTree and BallTree on their worst-case time complexity #19473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… original partition algorithm.
# Conflicts: # sklearn/neighbors/setup.py
1. Add "_" in the file name; 2. Make the comparator stable; 3. Use a template implementation of partition_node_indices, the original one has hard-coded types(DTYPE_t -> double, ITYPE_t -> Py_intptr_t) which would be easily broken when these type changes.
Using a simple ASV benchmark (to put in There is a significant improvement for
|
The latest asv benchmark script is available online here: https://gist.github.com/jjerphan/6c1f3e4c80908b862ccce6682835d36a
Actually, It used to be worst-case O(n log n) on GNU, as they used fallback to heapselect. This was a known bug as it violated the C++ standard. It is about 10 years since I saw a bug report on this, so presumably it is fixed by now. |
Thanks @sturlamolden for this pointer! My results rely on GCC:
I'll include my machine setup in my comment above. Ideally, we would like to run those benchmarks on configurations using different compilers. |
Thanks for the PR @jjerphan As far as I know, we've been reluctant to include C++ code - unless in cases of vendoring libraries like libsvm, liblinear and murmur Have you considered alternatives that might not require writing C++ ourselves? For example, is there a way to write the comparator directly in Cython and bypass the wrapper? Personally, I'm not necessarily opposed to having C++ within the codebase (this is something I considered at some point for #18394). But this needs to be carefully examined, as this would set a precedent. |
Hi @NicolasHug Thanks for your comment. Yes, I am currently trying to see how I can define and use |
Got it, you might want to prefix the PR tile as |
It is only a few lines of C++ inlined into the Cython code. In this case it cannot be avoided, as Cython cannot implement a C++ class with member functions. The The alternative would be to write introselect in C or Cython. There is C code for it in NumPy. It uses median-of-5 as fallback, so is worst-case O(n). It should not be that difficult to clean it up. https://github.com/numpy/numpy/blob/master/numpy/core/src/npysort/selection.c.src |
I agree, although the size or the complexity of the code is not really what matters here.
I agree this is worth considering - in Cython though, not C, for the same reasons as above. For ref, we do have a custom implementation of introsort (not introselect) with median3 in Cython: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/tree/_splitter.pyx#L496:L496 |
Then what is it exactly that matters here? In terms of maintainability, this snipplet of inlined C++ is a nobrainer. It is just there because Cython cannot use
True, but this would lead to code that are much more complex and has a higher chance of containing errors than a tiny driver for |
Isolated, yes. In the context of the whole project, no. This one might be simple, but it opens the door to more complex contributions that we may not want. It sets a precedent, as I mentioned, and we need to have clear requirements as to what we can and cannot accept in the project. So far, we have not accepted C++ code (except when vendoring stuff) Again, I'm not opposed to it. I'm only saying this needs to be discussed, in particular with other devs. If there's a simple enough solution that doesn't require input from the rest of the core devs, it is preferable. The decision to favor Cython over C/C++ code was made long before I got involved, so I'm not entirely sure of what drove the decision. One argument I believe is that Cython is easier to write than C/C++ for contributors who are used to Python. If you ask me, after having written thousands of lines of Cython, I personally find that this is absolutely not true. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's difficult to implement a compatible comparator in Cython, I see no great problem with this. I don't think this is going to set us on a dangerous slippery slope towards accepting contributions with large portions in C/++. It's only really defining a comparator (and abstracting that away so that it can be used in Cython). I think it would benefit from a few more comments under the assumption that readers of this codebase won't often know C++.
It might actually be possible to avoid a functor here. One could use a global Using a mutex would also work, but it would prevent concurrent construction of kd-trees. |
We could reduce the amount of inlined C++ to three lines:
Then we could cdef these static globals without |
On the one hand, I believe in the idea that relying on a efficient and well tested implementation of algorithms is better than reimplementing it. It seems that this is the case for GCC's implementation of But on the other hand, I do understand and respect the will of minimising — or even of getting ride of — code which isn't written Python or Cython within scikit-learn. I've spent some time yesterday trying to reduce the amount of C++ code: it is not easy when it comes to implement a comparator, as underlined by @jnothman and I haven't been able to converge to something that was simpler¹. To pursue on @NicolasHug's remark that I share:
Would having a Cython comparator for a C++ interface be even easier for another code reader to understand? Furthermore, I haven't inspect @sturlamolden's proposals to remove the C++ part and @NicolasHug's pointer to a Cython implementation of another algorithm. In both case, this would remove C++ adherence, but , to me, this would also add complexity with bits that are harder to understand. Do you agree? Finally. I haven't (yet) been able to find other discussions or PRs which are mentioning the motive for solely using Python and Cython — please do share pointers if you have! Using C++ here (as off 711ea35) does not come with the cons of code vendoring (for instance: new file extension, an extra setup for builds, code duplication between projects, larger wheels and source distributions, etc.): it is one good/simple/natural glue which offers a fair trade-off to expose an efficient and trusted implementation. But there might be others, and I would be glad if this could be further improved in an even simpler fashion! 🙂 Technical note: [1] See this part of Cython docs for technical aspects regarding using C++ interfaces in Cython while minimising the amount of C++ code). |
I don’t know. Using global variables does not add complexity, in terms of the required C++.
Personally I think the solution with an inlined struct is perfectly ok. It is what we use in SciPy, though almost all of the SciPy kd-tree code is C++. |
I should perhaps mention that NumPy also makes use of thread local globals. E.g. if we have a Fortran function that makes callbacks to Python, then f2py will store the global state in a threadlocal variable. |
I'm wondering if this change is worth a changelog entry... the check is green because no tests are modified, but this is meant to improve performances, right? |
This is solely meant to improve performance, yes. Let me know if I should add a change-log entry, on my end I am mainly looking for other maintainer's opinions, suggestions and/or approvals. |
A change log entry is worthwhile. Apart from attribution, it allows people
to expect, and diagnose a difference.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changelog should reflect what the C++ standard says and all current implementations do, not what a 8 year old bug in GNU libstdc++ used to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to side with allowing more c++ code because I think it makes code easier to read and write. Having access to c++ containers and algorithms can help reduce the amount of code we have to maintain. (In this PR, we are taking advantage of algorithm
)
Reintroduce the docstring of partition_node_indices. Co-authored-by: "Thomas J. Fan" <thomasjpfan@gmail.com>
Co-authored-by: "Thomas J. Fan" <thomasjpfan@gmail.com>
As the changes relate to both BinaryTree's subclasses.
Here are some (re-organisated) results from the latest benchmark script which now also compares changes on
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MInor comment, otherwise LGTM
…ime complexity (scikit-learn#19473) Co-authored-by: jiefangxuanyan <505745416@qq.com> Co-authored-by: "Thomas J. Fan" <thomasjpfan@gmail.com>
Reference Issues/PRs
Supersedes #11103.
Fixes #7687.
What does this implement/fix? Explain your changes.
The creation of$\mathcal{O}(n^2)$ on some edges cases (for instance on sorted inputs and duplicate inputs).
KDTree
andBallTree
(which relies on partial quick-sort) suffers from higher worst-case time complexity ofThe C++$\mathcal{O}(n)$ .
std::nth_element
algorithm is a better alternative in those case as it relies on a intro-select algorithm which has a worst case time complexity ofThis PR replaces this first algorithm with this second using a comparator and a minimal Cython wrapping.