implement nsplit argument in Array.split_axes with default="auto" (raise if variable)

When the number of separators is not the same in all labels, labels which have **more** separators have their last part dropped silently and any label which becomes a duplicate because of this is dropped silently too (it seems like the last duplicate wins). 

```python
>>> arr = la.ndtest("a_b=a0_b0,a0_b1_1,a0_b1_2")
>>> arr
a_b  a0_b0  a0_b1_1  a0_b1_2
         0        1        2
>>> arr.split_axes()
a\b   b0   b1
 a0  0.0  2.0
```

Also, when there is a label with **fewer** separators than expected, it raises a very weird error.

```python
>>> arr = la.ndtest("a_b=a0_b0,a0b1,a0_b2")
>>> arr
a_b  a0_b0  a0b1  a0_b2
         0     1      2
>>> arr.split_axes()
ValueError: Value {a_b} axis is not present in target subset {a*}. A value can only have the same axes or fewer axes than the subset being targeted
```

The two issues are because the zip(*sequences) idiom that we use happily ignores sequences longer than the shortest of the sequences.

The code of Axis.split is roughly like this:

```python
>>> split_labels = np.char.split(arr.a_b.labels, '_')
>>> indexing_labels = zip(*split_labels)
>>> list(indexing_labels)
[('a0', 'a0', 'a0'), ('b0', 'b1', 'b1')]
>>> split_axes = [Axis(unique_list(ax_labels), name) for ax_labels, name in zip(indexing_labels, names)]
```

Our options are :
1. raise an error if fewer or too many separators detected. Safest (zero chance to have bad labels pass unnoticed) but not practical (it leaves the problem to the user)
1. use maxsplit=len(names) - 1 in np.char.split. This would solve the **extra** separators issue but not the **fewer** separators issues. Also it could mask a problem if there are extra seps in some labels that the user does not know about. I can live with the later problem though.
1. or configurable maxsplit, defaulting to the above. More flexible, but same problems as option 1.
1. or force splitting using the maximum number of separators (i.e. complete the sequences with as many '' as necessary). In the presence of too many labels, it will probably not be the result the user wants in most cases.
1. force splitting using len(names) - 1 (i.e use maxsplit but then complete the too short sequences)
1. configurable force splitting (e.g. `nsplit` argument), using len(names) - 1 as default. 
1. configurable force splitting , but with a default which raises if not all sequence have the same length. I hesitate with this option. **The selected option**


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

implement nsplit argument in Array.split_axes with default="auto" (raise if variable) #1078

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

implement nsplit argument in Array.split_axes with default="auto" (raise if variable) #1078

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions