-
Notifications
You must be signed in to change notification settings - Fork 34
Allow strings and numbers as labels in cut #393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Could you please add to the docstring (or comment here) what contract change you exactly propose, i.e. the exact rules what inputs are accepted and what outputs they should produce. Thank you! |
I'm sorry! Guess I jumped the gun there a bit. Let Currently: Proposed: Allow It appears the limitation to String levels was introduced due to stability issues in the return type. As far as I can tell, those do not exist in more recent versions of Julia. The change would mean closer feature parity with In light of that, my above comments hopefully make more sense. There is one thing that could potentially break something downstream. I've amended the original comment to keep bullet points together. |
OK - so I understand the use-case is to e.g. assign a floating middle of the interval as a level, so that later user can work with it programmatically. Right? |
Exactly. I saw the (linked above) topic on discourse and wondered why that wasn't supported. Maybe you know of a good reason why not. |
@nalimilan designed it, but I guess the reason was that cut values are naturally considered to be labels. However, I agree that sometimes it is useful to be able to work with them programmatically later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I think the points you list are OK. I just wonder whether we can preserve inferrability.
src/extras.jl
Outdated
levs = [labels(from[1], to[1], 1, | ||
leftclosed=breaks[1] != breaks[2], rightclosed=false)] | ||
resize!(levs, n-1) | ||
_L = eltype(levs) | ||
for i in 2:n-2 | ||
levs[i] = labels(from[i], to[i], i, | ||
leftclosed=breaks[i] != breaks[i+1], rightclosed=false) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach is a bit weird. Can you instead try something along the lines of this?
levs = [i <= 2 ? ... : ... for i in 1:n]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It reads very clunkily, and the resizing is certainly not optimal. But at least on 1.7 inference works, while it fails with a comprehension. I will keep it close to your original implementation
firstlevel = labels(from[1], to[1], 1, leftclosed=breaks[1] != breaks[2], rightclosed=false)
levs = Vector{typeof(firstlevel)}(undef, n-1)
levs[begin] = firstlevel
for i in 2:n-2
levs[i] = labels(from[i], to[i], i, leftclosed=breaks[i] != breaks[i+1], rightclosed=false)
end
levs[end] = labels(from[end], to[end], n-1,
leftclosed=breaks[end-1] != breaks[end],
rightclosed=coalesce(extend, false))
It seems crucial to typecast levs
explicitly. I tried a map
too, but that doesn't work either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. That's weird, but I guess using the type of the first value is OK as we don't expect people to use multiple types anyway.
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Could you please add tests for typically problematic cases like: duplicate numeric labels, or having both |
@nalimilan Thank you for the thorough review. I will push changes soon. I think I have tracked down an inference issue and it may just work on Julia 1.6. It had trouble inferring @bkamins Certainly! |
Concerning
Do you want |
Yes, it is valid in CategoricalArrays.jl and expected. I think the same should be respected in this PR (i.e. labels should be unique w.r.t. The point is that
but of course this is an orthogonal issue (it just prompted me to cover this case in tests) |
nicer levels gathering
Int64 -> Int for x86
But only when used as @nalimilan Tests on Julia 1.0 failed, because of a |
I'm not aware of any automated tool. Compat.jl is often useful but it doesn't support all new syntaxes. |
src/extras.jl
Outdated
@@ -152,10 +160,11 @@ function _cut(x::AbstractArray{T, N}, breaks::AbstractVector, | |||
end | |||
end | |||
if !ismissing(min_x) && breaks[1] > min_x | |||
breaks = [min_x; breaks] | |||
# this typecast is needed on Julia<1.7 for stable inference | |||
breaks = eltype(breaks)[min_x; breaks] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this also work? Using eltype(breaks)
won't work in all cases, e.g. if breaks are Integer
but x
contains floats.
breaks = eltype(breaks)[min_x; breaks] | |
ET = promote_type(nonmissingtype(eltype(x)), eltype(breaks)) | |
breaks = ET[min_x; breaks] |
Another solution which would be cleaner if it works would be to just add min_x::nonmissingtype(eltype(x))
as it's probably the source of the problem. Same below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your last proposal seems to work fine. I'm just running tests locally before pushing. Indeed, a Float to Int conversion due to my implementation was the reason for a failing doc test.
src/extras.jl
Outdated
levs = [labels(from[1], to[1], 1, | ||
leftclosed=breaks[1] != breaks[2], rightclosed=false)] | ||
resize!(levs, n-1) | ||
_L = eltype(levs) | ||
for i in 2:n-2 | ||
levs[i] = labels(from[i], to[i], i, | ||
leftclosed=breaks[i] != breaks[i+1], rightclosed=false) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. That's weird, but I guess using the type of the first value is OK as we don't expect people to use multiple types anyway.
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Thanks! |
Thank you for patient reviews :) |
Referring to https://discourse.julialang.org/t/creating-data-bins-with-numeric-labels-with-cut/81281/10?u=skleinbo
Inference seems to be fine. Tests pass on
Julia 1.7
.Three things:
1.6
, but that is due toBase.return_types
not inferring correctly. The actual return type is as expected.That's not ideal. Should the test be taken out, somehow altered to pass on older Julia versions (don't know how tbh), or executed conditionally?
AbstractVector{<:SupportedTypes}
. E.g. just passing[1,"2",3]::Vector{Any}
fails, whileUnion{Int, String}[1,"2",3]
is fine. Should an effort be made to automatically do a conversion? Given the IMHO obscurity of mixed labels, should this even be mentioned in the docs? I say no on both points.cut
implicitly casts labels of anyAbstractString
toString
. While not promised, someone downstream might rely on that conversion.Thanks!