Skip to content

implement a way to deduplicate labels easily #444

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
gdementen opened this issue Sep 22, 2017 · 2 comments
Open

implement a way to deduplicate labels easily #444

gdementen opened this issue Sep 22, 2017 · 2 comments

Comments

@gdementen
Copy link
Contributor

this issue is about appending some suffix (eg _1, _2) to duplicate labels, but the end goal is to make "aggregate all duplicate labels together" as easy as possible.

@gdementen
Copy link
Contributor Author

Here is the code Bernard is currently using to aggregate all duplicate labels. We need to make this easier (and more generic obviously).

def agg_dupl(mat):
    l_c = []
    for i, p in enumerate(mat.prods):
        l_c.append((p + '_' + str(i)) if p in l_c else str(p))
    mat = mat.set_labels(x.prods, l_c)
    agg = tuple(mat.prods.startswith(code) >> code for code in mat.prods.matches('^.....$'))
    return mat.sum(agg)

Here is a first step.

class Axis(object):
    def deduplicate(self):
        # TODO: handle non-string labels (unsure how!? add a user-defined constant?)
        seen = set()
        labels = []
        for i, label in enumerate(self.labels):
            if label in seen:
                label = '{}_{}'.format(label, i)
            else:
                seen.add(label)
            labels.append(label)
        return Axis(labels, self.name)

def agg_dupl(mat):    
    mat = mat.set_axes('prods', mat.prods.deduplicate())
    agg = tuple(mat.prods.startswith(code) >> code for code in mat.prods.matches('^.....$'))
    return mat.sum(agg)

Remaining problems:

  • the aggregation relies on the fact his codes are always exactly the same length
  • we should provide a way to "deduplicate()" all axes at once

We should be able to write something like:

def agg_dupl(mat):
    mat = mat.deduplicate()
    agg = tuple(mat.prods.startingwith(code) >> code for code in mat.prods[~mat.prods.endswith('/DUPE/')])
    return mat.sum(agg)

or even

def agg_dupl(mat):
    mat = mat.deduplicate()
    agg = tuple(mat.prods.startingwith(code) >> code for code in ~mat.prods.endingwith('/DUPE/'))
    return mat.sum(agg)

Remaining problems:

  • ideally the .deduplicate() step should not be necessary. It should be possible to easily create a positional-based agg directly.

@alixdamman alixdamman added this to the 0.30 milestone Jan 25, 2018
@gdementen
Copy link
Contributor Author

This particular problem (aggregating all duplicates) can be solved much more elegantly by a groupby.

agg = mat.groupby(mat.prods).sum()
# and/or
agg = mat.sum(mat.prods.groupunique())

The question is then whether de-duplication has value outside of aggregating the duplicates. I can see some cases where this is true, but it is lower priority I think.

@gdementen gdementen self-assigned this Mar 23, 2018
gdementen added a commit to gdementen/larray that referenced this issue Jul 20, 2018
…-project#444)

implemented LArray.groupby, Grid, _grid_aggregate
implemented a lot of misc stuff to extract !

* bugfix for arr['a.i[0]'])
* performance improvements all over the place

some design work and thoughts on how to go forward
gdementen added a commit to gdementen/larray that referenced this issue Jul 27, 2018
…-project#444)

implemented LArray.groupby, Grid, _grid_aggregate
implemented a lot of misc stuff to extract !

* performance improvements all over the place

some design work and thoughts on how to go forward
@gdementen gdementen removed their assignment Aug 23, 2018
@alixdamman alixdamman modified the milestones: 0.30, 0.32 Oct 16, 2018
@alixdamman alixdamman removed this from the 0.32 milestone Aug 1, 2019
@alixdamman alixdamman added this to the nice_to_have milestone Oct 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants