Skip to content

Allow strings and numbers as labels in cut #393

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 23, 2022

Conversation

skleinbo
Copy link
Contributor

@skleinbo skleinbo commented May 20, 2022

Referring to https://discourse.julialang.org/t/creating-data-bins-with-numeric-labels-with-cut/81281/10?u=skleinbo

Inference seems to be fine. Tests pass on Julia 1.7.

Three things:

  1. There is a test error on 1.6, but that is due to Base.return_types not inferring correctly. The actual return type is as expected.
    That's not ideal. Should the test be taken out, somehow altered to pass on older Julia versions (don't know how tbh), or executed conditionally?
  2. Technically, mixed labels are possible when passed as a AbstractVector{<:SupportedTypes}. E.g. just passing [1,"2",3]::Vector{Any} fails, while Union{Int, String}[1,"2",3] is fine. Should an effort be made to automatically do a conversion? Given the IMHO obscurity of mixed labels, should this even be mentioned in the docs? I say no on both points.
  3. A label formatter must always return the same concrete type. That's a disparity to 2. and one could jump through the same hoops to allow for mixed labels, but is it worth it?
  4. The current cut implicitly casts labels of any AbstractString to String. While not promised, someone downstream might rely on that conversion.

Thanks!

@bkamins
Copy link
Member

bkamins commented May 20, 2022

Could you please add to the docstring (or comment here) what contract change you exactly propose, i.e. the exact rules what inputs are accepted and what outputs they should produce. Thank you!

@skleinbo
Copy link
Contributor Author

I'm sorry! Guess I jumped the gun there a bit.

Let SupportedTypes = Union{AbstractString, AbstractChar, Number}.

Currently: cut takes a labels::Union[AbstractString, Function::AbstractString} kwarg and returns CategoricalArray with levels of type String or Union{String, Missing} depending on whether the input array has missing values or extend=missing is given.

Proposed: Allow cut to take more general labels::Union{AbstractVector{L}, Function::L} where L<:SupportedTypes and return CategoricalArray with levels of type L or Union{L, Missing} depending on whether the input array has missing values or extend=missing is given.

It appears the limitation to String levels was introduced due to stability issues in the return type. As far as I can tell, those do not exist in more recent versions of Julia.

The change would mean closer feature parity with pandas.cut

In light of that, my above comments hopefully make more sense.

There is one thing that could potentially break something downstream. I've amended the original comment to keep bullet points together.

@bkamins
Copy link
Member

bkamins commented May 20, 2022

OK - so I understand the use-case is to e.g. assign a floating middle of the interval as a level, so that later user can work with it programmatically. Right?

@skleinbo
Copy link
Contributor Author

Exactly. I saw the (linked above) topic on discourse and wondered why that wasn't supported. Maybe you know of a good reason why not.

@bkamins
Copy link
Member

bkamins commented May 20, 2022

@nalimilan designed it, but I guess the reason was that cut values are naturally considered to be labels. However, I agree that sometimes it is useful to be able to work with them programmatically later.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I think the points you list are OK. I just wonder whether we can preserve inferrability.

src/extras.jl Outdated
Comment on lines 192 to 199
levs = [labels(from[1], to[1], 1,
leftclosed=breaks[1] != breaks[2], rightclosed=false)]
resize!(levs, n-1)
_L = eltype(levs)
for i in 2:n-2
levs[i] = labels(from[i], to[i], i,
leftclosed=breaks[i] != breaks[i+1], rightclosed=false)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach is a bit weird. Can you instead try something along the lines of this?

levs = [i <= 2 ? ... : ... for i in 1:n]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It reads very clunkily, and the resizing is certainly not optimal. But at least on 1.7 inference works, while it fails with a comprehension. I will keep it close to your original implementation

firstlevel = labels(from[1], to[1], 1, leftclosed=breaks[1] != breaks[2], rightclosed=false)
levs = Vector{typeof(firstlevel)}(undef, n-1)
levs[begin] = firstlevel
for i in 2:n-2
    levs[i] = labels(from[i], to[i], i, leftclosed=breaks[i] != breaks[i+1], rightclosed=false)
end
levs[end] = labels(from[end], to[end], n-1,
                   leftclosed=breaks[end-1] != breaks[end],
                   rightclosed=coalesce(extend, false))

It seems crucial to typecast levs explicitly. I tried a map too, but that doesn't work either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. That's weird, but I guess using the type of the first value is OK as we don't expect people to use multiple types anyway.

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member

bkamins commented May 21, 2022

Could you please add tests for typically problematic cases like: duplicate numeric labels, or having both -0.0 and 0.0 as labels? Thank you!

@skleinbo
Copy link
Contributor Author

skleinbo commented May 21, 2022

@nalimilan Thank you for the thorough review. I will push changes soon. I think I have tracked down an inference issue and it may just work on Julia 1.6. It had trouble inferring breaks with input arrays that had Missing in their type. I'll have to write more tests first though.

@bkamins Certainly!

@skleinbo
Copy link
Contributor Author

Concerning -0.0, 0.0. This is valid independent of cut

julia> CategoricalArray([-0.0, 0.0, 1.0]; levels=[-0.0, 0.0, 1.0])
3-element CategoricalArray{Float64,1,UInt32}:
 -0.0
 0.0
 1.0

Do you want cut to check for duplicates with == rather than isequal regardless?

@bkamins
Copy link
Member

bkamins commented May 22, 2022

Yes, it is valid in CategoricalArrays.jl and expected. I think the same should be respected in this PR (i.e. labels should be unique w.r.t. isequal). I just want it tested.

The point is that cat uses == for values:

julia> cut([-0.0, 0.0], 2)
ERROR: ArgumentError: could not extend breaks as all values are equal: please specify at least two breaks manually

but of course this is an orthogonal issue (it just prompted me to cover this case in tests)

skleinbo added 3 commits May 22, 2022 16:55
nicer levels gathering
Int64 -> Int for x86
@skleinbo
Copy link
Contributor Author

The point is that cat uses == for values:

But only when used as cut(x, n), because then values pass through quantile I guess?`
Anyway, orthogonal issue like you say.

@nalimilan Tests on Julia 1.0 failed, because of a begin in indexing, and on x86 because of an Int64 instead of Int in a test. Regarding the former, I find it difficult to remember what syntax was valid at which point. Wasn't there a tool to check for backward compatible syntax?

@nalimilan
Copy link
Member

@nalimilan Tests on Julia 1.0 failed, because of a begin in indexing, and on x86 because of an Int64 instead of Int in a test. Regarding the former, I find it difficult to remember what syntax was valid at which point. Wasn't there a tool to check for backward compatible syntax?

I'm not aware of any automated tool. Compat.jl is often useful but it doesn't support all new syntaxes.

src/extras.jl Outdated
@@ -152,10 +160,11 @@ function _cut(x::AbstractArray{T, N}, breaks::AbstractVector,
end
end
if !ismissing(min_x) && breaks[1] > min_x
breaks = [min_x; breaks]
# this typecast is needed on Julia<1.7 for stable inference
breaks = eltype(breaks)[min_x; breaks]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this also work? Using eltype(breaks) won't work in all cases, e.g. if breaks are Integer but x contains floats.

Suggested change
breaks = eltype(breaks)[min_x; breaks]
ET = promote_type(nonmissingtype(eltype(x)), eltype(breaks))
breaks = ET[min_x; breaks]

Another solution which would be cleaner if it works would be to just add min_x::nonmissingtype(eltype(x)) as it's probably the source of the problem. Same below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your last proposal seems to work fine. I'm just running tests locally before pushing. Indeed, a Float to Int conversion due to my implementation was the reason for a failing doc test.

src/extras.jl Outdated
Comment on lines 192 to 199
levs = [labels(from[1], to[1], 1,
leftclosed=breaks[1] != breaks[2], rightclosed=false)]
resize!(levs, n-1)
_L = eltype(levs)
for i in 2:n-2
levs[i] = labels(from[i], to[i], i,
leftclosed=breaks[i] != breaks[i+1], rightclosed=false)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. That's weird, but I guess using the type of the first value is OK as we don't expect people to use multiple types anyway.

skleinbo and others added 3 commits May 23, 2022 07:31
@nalimilan nalimilan marked this pull request as ready for review May 23, 2022 19:48
@nalimilan nalimilan merged commit a8e4787 into JuliaData:master May 23, 2022
@nalimilan
Copy link
Member

Thanks!

@nalimilan
Copy link
Member

@skleinbo
Copy link
Contributor Author

Thank you for patient reviews :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants