Skip to content

better support for duplicate labels (on the same axis) #1126

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
gdementen opened this issue Dec 11, 2024 · 5 comments
Open

better support for duplicate labels (on the same axis) #1126

gdementen opened this issue Dec 11, 2024 · 5 comments

Comments

@gdementen
Copy link
Contributor

gdementen commented Dec 11, 2024

We cannot load >2D array with duplicate labels:

arr = ndtest("a=a0,a1;b=x,x;c=c0,c1")

arr.to_hdf('test.h5', 'arr')
arr = read_hdf('test.h5', 'arr')
ValueError: cannot reshape array of size 8 into shape (2,1,2)

arr.to_csv('test.csv')
arr = read_csv('test.csv')
ValueError: cannot handle a non-unique multi-index!

arr.to_excel('test.xlsx')
arr = read_excel('test.xlsx')
ValueError: cannot handle a non-unique multi-index!

For HDF, this is clearly a limitation in larray's code. In pandas.py/index_to_labels, I used the following code:

return [unique_list(idx.get_level_values(label)) for label in range(idx.nlevels)]

where unique_list returns the unique labels for that index "level", and that obviously breaks in the presence of duplicate labels.

For csv and Excel, this is not so clear-cut. This seems to be a limitation in Pandas reindex, and I am unsure we can do anything about that (except not going via Pandas to load data).

@gdementen gdementen added the bug label Dec 11, 2024
@gdementen
Copy link
Contributor Author

In fact, duplicate labels on <= 2D arrays are kinda hit-or-miss:

>>> arr2d = ndtest("a=a0,a1;b=x,x")
>>> arr2d.to_csv('arr2d.csv')
>>> read_csv('arr2d.csv')
a\b  x  x.1
 a0  0    1
 a1  2    3

>>> arr2d.to_excel('test.xlsx', 'arr2d')
>>> read_excel('test.xlsx', 'arr2d')
a\b  x  x
 a0  0  1
 a1  2  3

>>> arr2d.to_hdf('test.h5', 'arr2d')
ValueError: Columns index has to be unique for fixed format

>>> arr2d_t = arr2d.T
>>> arr2d_t.to_hdf('test.h5', 'arr2d_t')
>>> read_hdf('test.h5', 'arr2d_t')
b\a  a0  a1
  x   0   2
  x   1   3

More testing is necessary (other combinations of duplicates in columns vs rows vs number of dimensions) but the conclusion is that duplicate labels are very poorly supported

@gdementen gdementen changed the title cannot load >2D array with duplicate labels better support for duplicate labels Dec 11, 2024
@gdementen
Copy link
Contributor Author

gdementen commented Dec 11, 2024

Full support for duplicate labels would be nice, but is probably hard to achieve. As an intermediate goal, I would be content with only supporting loading data with duplicate labels as array, changing those labels (to deduplicate them -- assuming we do not do it automatically in the loading code) and writing such arrays (some output may require it). In that case, we should make it clear in the documentation it is all we support and ideally add a warning when trying to do another operation on such an array.

@gdementen
Copy link
Contributor Author

I consider the lack of warning/documentation about this a bug, and the actual support an enhancement, hence the double classification 😉

@gdementen gdementen changed the title better support for duplicate labels better support for duplicate labels (on the same axis) Apr 7, 2025
@gdementen
Copy link
Contributor Author

see also #444

@gdementen
Copy link
Contributor Author

yvda bumped on this problem again, hence the priority bump. Short of better support (via actual loading or automatic deduplicate), a clear error message when reading such a file from hdf would already go a long way to improve the user experience. A warning (or even an error?) when writing would also help (it does not make sense to allow writing a file we cannot read back) but we cannot block all exports with duplicates because we need them in some final/formatted output to excel.

@gdementen gdementen added this to the 0.35 milestone May 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant