-
Notifications
You must be signed in to change notification settings - Fork 6
implement a way to deduplicate labels easily #444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Here is the code Bernard is currently using to aggregate all duplicate labels. We need to make this easier (and more generic obviously). def agg_dupl(mat):
l_c = []
for i, p in enumerate(mat.prods):
l_c.append((p + '_' + str(i)) if p in l_c else str(p))
mat = mat.set_labels(x.prods, l_c)
agg = tuple(mat.prods.startswith(code) >> code for code in mat.prods.matches('^.....$'))
return mat.sum(agg) Here is a first step. class Axis(object):
def deduplicate(self):
# TODO: handle non-string labels (unsure how!? add a user-defined constant?)
seen = set()
labels = []
for i, label in enumerate(self.labels):
if label in seen:
label = '{}_{}'.format(label, i)
else:
seen.add(label)
labels.append(label)
return Axis(labels, self.name)
def agg_dupl(mat):
mat = mat.set_axes('prods', mat.prods.deduplicate())
agg = tuple(mat.prods.startswith(code) >> code for code in mat.prods.matches('^.....$'))
return mat.sum(agg) Remaining problems:
We should be able to write something like: def agg_dupl(mat):
mat = mat.deduplicate()
agg = tuple(mat.prods.startingwith(code) >> code for code in mat.prods[~mat.prods.endswith('/DUPE/')])
return mat.sum(agg) or even def agg_dupl(mat):
mat = mat.deduplicate()
agg = tuple(mat.prods.startingwith(code) >> code for code in ~mat.prods.endingwith('/DUPE/'))
return mat.sum(agg) Remaining problems:
|
This particular problem (aggregating all duplicates) can be solved much more elegantly by a groupby. agg = mat.groupby(mat.prods).sum()
# and/or
agg = mat.sum(mat.prods.groupunique()) The question is then whether de-duplication has value outside of aggregating the duplicates. I can see some cases where this is true, but it is lower priority I think. |
…-project#444) implemented LArray.groupby, Grid, _grid_aggregate implemented a lot of misc stuff to extract ! * bugfix for arr['a.i[0]']) * performance improvements all over the place some design work and thoughts on how to go forward
…-project#444) implemented LArray.groupby, Grid, _grid_aggregate implemented a lot of misc stuff to extract ! * performance improvements all over the place some design work and thoughts on how to go forward
this issue is about appending some suffix (eg _1, _2) to duplicate labels, but the end goal is to make "aggregate all duplicate labels together" as easy as possible.
The text was updated successfully, but these errors were encountered: