-
Notifications
You must be signed in to change notification settings - Fork 6
better support for duplicate labels (on the same axis) #1126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
In fact, duplicate labels on <= 2D arrays are kinda hit-or-miss: >>> arr2d = ndtest("a=a0,a1;b=x,x")
>>> arr2d.to_csv('arr2d.csv')
>>> read_csv('arr2d.csv')
a\b x x.1
a0 0 1
a1 2 3
>>> arr2d.to_excel('test.xlsx', 'arr2d')
>>> read_excel('test.xlsx', 'arr2d')
a\b x x
a0 0 1
a1 2 3
>>> arr2d.to_hdf('test.h5', 'arr2d')
ValueError: Columns index has to be unique for fixed format
>>> arr2d_t = arr2d.T
>>> arr2d_t.to_hdf('test.h5', 'arr2d_t')
>>> read_hdf('test.h5', 'arr2d_t')
b\a a0 a1
x 0 2
x 1 3 More testing is necessary (other combinations of duplicates in columns vs rows vs number of dimensions) but the conclusion is that duplicate labels are very poorly supported |
Full support for duplicate labels would be nice, but is probably hard to achieve. As an intermediate goal, I would be content with only supporting loading data with duplicate labels as array, changing those labels (to deduplicate them -- assuming we do not do it automatically in the loading code) and writing such arrays (some output may require it). In that case, we should make it clear in the documentation it is all we support and ideally add a warning when trying to do another operation on such an array. |
I consider the lack of warning/documentation about this a bug, and the actual support an enhancement, hence the double classification 😉 |
see also #444 |
yvda bumped on this problem again, hence the priority bump. Short of better support (via actual loading or automatic deduplicate), a clear error message when reading such a file from hdf would already go a long way to improve the user experience. A warning (or even an error?) when writing would also help (it does not make sense to allow writing a file we cannot read back) but we cannot block all exports with duplicates because we need them in some final/formatted output to excel. |
We cannot load >2D array with duplicate labels:
For HDF, this is clearly a limitation in larray's code. In pandas.py/index_to_labels, I used the following code:
where unique_list returns the unique labels for that index "level", and that obviously breaks in the presence of duplicate labels.
For csv and Excel, this is not so clear-cut. This seems to be a limitation in Pandas reindex, and I am unsure we can do anything about that (except not going via Pandas to load data).
The text was updated successfully, but these errors were encountered: