Allow strings and numbers as labels in cut #393

skleinbo · 2022-05-20T14:18:45Z

Referring to https://discourse.julialang.org/t/creating-data-bins-with-numeric-labels-with-cut/81281/10?u=skleinbo

Inference seems to be fine. Tests pass on Julia 1.7.

Three things:

There is a test error on 1.6, but that is due to Base.return_types not inferring correctly. The actual return type is as expected.
That's not ideal. Should the test be taken out, somehow altered to pass on older Julia versions (don't know how tbh), or executed conditionally?
Technically, mixed labels are possible when passed as a AbstractVector{<:SupportedTypes}. E.g. just passing [1,"2",3]::Vector{Any} fails, while Union{Int, String}[1,"2",3] is fine. Should an effort be made to automatically do a conversion? Given the IMHO obscurity of mixed labels, should this even be mentioned in the docs? I say no on both points.
A label formatter must always return the same concrete type. That's a disparity to 2. and one could jump through the same hoops to allow for mixed labels, but is it worth it?
The current cut implicitly casts labels of any AbstractString to String. While not promised, someone downstream might rely on that conversion.

Thanks!

test/15_extras.jl

bkamins · 2022-05-20T14:36:23Z

Could you please add to the docstring (or comment here) what contract change you exactly propose, i.e. the exact rules what inputs are accepted and what outputs they should produce. Thank you!

skleinbo · 2022-05-20T16:10:52Z

I'm sorry! Guess I jumped the gun there a bit.

Let SupportedTypes = Union{AbstractString, AbstractChar, Number}.

Currently: cut takes a labels::Union[AbstractString, Function::AbstractString} kwarg and returns CategoricalArray with levels of type String or Union{String, Missing} depending on whether the input array has missing values or extend=missing is given.

Proposed: Allow cut to take more general labels::Union{AbstractVector{L}, Function::L} where L<:SupportedTypes and return CategoricalArray with levels of type L or Union{L, Missing} depending on whether the input array has missing values or extend=missing is given.

It appears the limitation to String levels was introduced due to stability issues in the return type. As far as I can tell, those do not exist in more recent versions of Julia.

The change would mean closer feature parity with pandas.cut

In light of that, my above comments hopefully make more sense.

There is one thing that could potentially break something downstream. I've amended the original comment to keep bullet points together.

bkamins · 2022-05-20T16:19:13Z

OK - so I understand the use-case is to e.g. assign a floating middle of the interval as a level, so that later user can work with it programmatically. Right?

skleinbo · 2022-05-20T16:26:13Z

Exactly. I saw the (linked above) topic on discourse and wondered why that wasn't supported. Maybe you know of a good reason why not.

bkamins · 2022-05-20T17:41:01Z

@nalimilan designed it, but I guess the reason was that cut values are naturally considered to be labels. However, I agree that sometimes it is useful to be able to work with them programmatically later.

nalimilan

Thanks! I think the points you list are OK. I just wonder whether we can preserve inferrability.

src/extras.jl

test/15_extras.jl

nalimilan · 2022-05-20T18:46:06Z

src/extras.jl

+        levs = [labels(from[1], to[1], 1,
+            leftclosed=breaks[1] != breaks[2], rightclosed=false)]
+        resize!(levs, n-1)
+        _L = eltype(levs)
+        for i in 2:n-2
            levs[i] = labels(from[i], to[i], i,
                             leftclosed=breaks[i] != breaks[i+1], rightclosed=false)
        end


This approach is a bit weird. Can you instead try something along the lines of this?

levs = [i <= 2 ? ... : ... for i in 1:n]

It reads very clunkily, and the resizing is certainly not optimal. But at least on 1.7 inference works, while it fails with a comprehension. I will keep it close to your original implementation

firstlevel = labels(from[1], to[1], 1, leftclosed=breaks[1] != breaks[2], rightclosed=false) levs = Vector{typeof(firstlevel)}(undef, n-1) levs[begin] = firstlevel for i in 2:n-2 levs[i] = labels(from[i], to[i], i, leftclosed=breaks[i] != breaks[i+1], rightclosed=false) end levs[end] = labels(from[end], to[end], n-1, leftclosed=breaks[end-1] != breaks[end], rightclosed=coalesce(extend, false))

It seems crucial to typecast levs explicitly. I tried a map too, but that doesn't work either.

OK. That's weird, but I guess using the type of the first value is OK as we don't expect people to use multiple types anyway.

test/15_extras.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2022-05-21T09:01:32Z

Could you please add tests for typically problematic cases like: duplicate numeric labels, or having both -0.0 and 0.0 as labels? Thank you!

skleinbo · 2022-05-21T16:54:03Z

@nalimilan Thank you for the thorough review. I will push changes soon. I think I have tracked down an inference issue and it may just work on Julia 1.6. It had trouble inferring breaks with input arrays that had Missing in their type. I'll have to write more tests first though.

@bkamins Certainly!

skleinbo · 2022-05-22T11:03:16Z

Concerning -0.0, 0.0. This is valid independent of cut

julia> CategoricalArray([-0.0, 0.0, 1.0]; levels=[-0.0, 0.0, 1.0])
3-element CategoricalArray{Float64,1,UInt32}:
 -0.0
 0.0
 1.0

Do you want cut to check for duplicates with == rather than isequal regardless?

bkamins · 2022-05-22T11:14:10Z

Yes, it is valid in CategoricalArrays.jl and expected. I think the same should be respected in this PR (i.e. labels should be unique w.r.t. isequal). I just want it tested.

The point is that cat uses == for values:

julia> cut([-0.0, 0.0], 2)
ERROR: ArgumentError: could not extend breaks as all values are equal: please specify at least two breaks manually

but of course this is an orthogonal issue (it just prompted me to cover this case in tests)

nicer levels gathering

Int64 -> Int for x86

skleinbo · 2022-05-22T15:58:34Z

The point is that cat uses == for values:

But only when used as cut(x, n), because then values pass through quantile I guess?`
Anyway, orthogonal issue like you say.

@nalimilan Tests on Julia 1.0 failed, because of a begin in indexing, and on x86 because of an Int64 instead of Int in a test. Regarding the former, I find it difficult to remember what syntax was valid at which point. Wasn't there a tool to check for backward compatible syntax?

nalimilan · 2022-05-22T19:58:03Z

@nalimilan Tests on Julia 1.0 failed, because of a begin in indexing, and on x86 because of an Int64 instead of Int in a test. Regarding the former, I find it difficult to remember what syntax was valid at which point. Wasn't there a tool to check for backward compatible syntax?

I'm not aware of any automated tool. Compat.jl is often useful but it doesn't support all new syntaxes.

src/extras.jl

nalimilan · 2022-05-22T15:27:20Z

src/extras.jl

@@ -152,10 +160,11 @@ function _cut(x::AbstractArray{T, N}, breaks::AbstractVector,
            end
        end
        if !ismissing(min_x) && breaks[1] > min_x
-            breaks = [min_x; breaks]
+            # this typecast is needed on Julia<1.7 for stable inference
+            breaks = eltype(breaks)[min_x; breaks]


Does this also work? Using eltype(breaks) won't work in all cases, e.g. if breaks are Integer but x contains floats.

Suggested change

breaks = eltype(breaks)[min_x; breaks]

ET = promote_type(nonmissingtype(eltype(x)), eltype(breaks))

breaks = ET[min_x; breaks]

Another solution which would be cleaner if it works would be to just add min_x::nonmissingtype(eltype(x)) as it's probably the source of the problem. Same below.

Your last proposal seems to work fine. I'm just running tests locally before pushing. Indeed, a Float to Int conversion due to my implementation was the reason for a failing doc test.

src/extras.jl

test/15_extras.jl

src/extras.jl

nalimilan · 2022-05-22T19:54:47Z

src/extras.jl

+        levs = [labels(from[1], to[1], 1,
+            leftclosed=breaks[1] != breaks[2], rightclosed=false)]
+        resize!(levs, n-1)
+        _L = eltype(levs)
+        for i in 2:n-2
            levs[i] = labels(from[i], to[i], i,
                             leftclosed=breaks[i] != breaks[i+1], rightclosed=false)
        end


OK. That's weird, but I guess using the type of the first value is OK as we don't expect people to use multiple types anyway.

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

whitespaces

nalimilan · 2022-05-23T19:48:35Z

Thanks!

nalimilan · 2022-05-23T20:34:49Z

JuliaRegistries/General#60890

skleinbo · 2022-05-24T05:48:32Z

Thank you for patient reviews :)

skleinbo added 2 commits May 20, 2022 14:59

Allow labels of AbstractVector{<:SupportedTypes} in cut.

ddf065b

tests

cb1448e

skleinbo commented May 20, 2022

View reviewed changes

test/15_extras.jl Outdated Show resolved Hide resolved

nalimilan reviewed May 20, 2022

View reviewed changes

Apply suggestions from code review part I

a0730b9

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

skleinbo added 3 commits May 22, 2022 16:55

type stabiliity

0217f95

nicer levels gathering

more tests

8e397a7

begin -> 1 for Julia 1.0

419f5ed

Int64 -> Int for x86

nalimilan reviewed May 22, 2022

View reviewed changes

skleinbo and others added 3 commits May 23, 2022 07:31

Apply suggestions from code review part II

3bf98bb

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

type annotation on min_x/max_x rather than breaks

9fc4503

whitespaces

additional test

890d777

nalimilan marked this pull request as ready for review May 23, 2022 19:48

nalimilan merged commit a8e4787 into JuliaData:master May 23, 2022

	breaks = eltype(breaks)[min_x; breaks]
	ET = promote_type(nonmissingtype(eltype(x)), eltype(breaks))
	breaks = ET[min_x; breaks]

Allow strings and numbers as labels in cut #393

Allow strings and numbers as labels in cut #393

Uh oh!

Conversation

skleinbo commented May 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

bkamins commented May 20, 2022

Uh oh!

skleinbo commented May 20, 2022

Uh oh!

bkamins commented May 20, 2022

Uh oh!

skleinbo commented May 20, 2022

Uh oh!

bkamins commented May 20, 2022

Uh oh!

nalimilan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nalimilan May 20, 2022

Choose a reason for hiding this comment

Uh oh!

skleinbo May 21, 2022

Choose a reason for hiding this comment

Uh oh!

nalimilan May 22, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bkamins commented May 21, 2022

Uh oh!

skleinbo commented May 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skleinbo commented May 22, 2022

Uh oh!

bkamins commented May 22, 2022

Uh oh!

skleinbo commented May 22, 2022

Uh oh!

nalimilan commented May 22, 2022

Uh oh!

Uh oh!

nalimilan May 22, 2022

Choose a reason for hiding this comment

Uh oh!

skleinbo May 23, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nalimilan May 22, 2022

Choose a reason for hiding this comment

Uh oh!

nalimilan commented May 23, 2022

Uh oh!

nalimilan commented May 23, 2022

Uh oh!

skleinbo commented May 24, 2022

Uh oh!

Uh oh!

skleinbo commented May 20, 2022 •

edited

Loading

skleinbo commented May 21, 2022 •

edited

Loading