Skip to content

ENH: Introduce new functions mean_std and mean_var. #23889

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 14 commits into from

Conversation

RonaldAJ
Copy link
Contributor

@RonaldAJ RonaldAJ commented Jun 6, 2023

ENH: Introduce two new functions that return:

  • both the standard deviation and the mean: mean_std
  • both the variance and the mean: mean_var.

Often when the standard deviation is needed the mean is also needed; the same holds for the variance and the mean. With the current code the mean is then calculated twice, this can be prevented if the functions calculating the variance or the standard deviation also return the obtained mean. See ticket #23741. These changes can be largely independent from those for ticket #13199, which suggests reducing memory usage by avoiding storing data-mean in a separate intermediate array.

Note:
fromnumeric.py is littered with constructions like:

    if type(a) is not mu.ndarray:
        try:
            var = a.var
        except AttributeError:
            pass
        else:
            return var(axis=axis, dtype=dtype, out=out, ddof=ddof, **kwargs)

This looks like untested defensive coding, and because the data object is not passed to the method it is probably dysfunctional. I copied it, but I think it can go. Which would be my preferred way of improving the code coverage here.

In fact this is where the alternative implementation for MaskedArrays gets called.

RonaldAJ added 2 commits June 6, 2023 14:40
…and the mean: mean_std and the variance and the mean: mean_var.

IN the current setup the mean is often calculated twice, this can be prevented if the functions calculating the variance of the standard deviation also return the mean,
…and the mean: mean_std and the variance and the mean: mean_var.

IN the current setup the mean is often calculated twice, this can be prevented if the functions calculating the variance of the standard deviation also return the mean,

Fix typo's, minor problems and add tests to increase coverage.

Note:
fromnumeric.py is littered with constructions like:

```Python
    if type(a) is not mu.ndarray:
        try:
            var = a.var
        except AttributeError:
            pass
        else:
            return var(axis=axis, dtype=dtype, out=out, ddof=ddof, **kwargs)
```
This looks like untested defensive coding, and because the data object is not passed to the method it is disfunctional. I copied it, but I think it can go. Which would be my preferred way of improving the code coverage here.
@RonaldAJ
Copy link
Contributor Author

RonaldAJ commented Jun 8, 2023

I don't know how to address the azure-pipeline problems.

Chocolatey has deprecated the `--side-by-side` option and suggest
either `--force` or `--allow-downgrade` instead. IIRC, we started
with `--force` and it failed.

(cherry picked from commit c055ee0)
@mattip
Copy link
Member

mattip commented Jun 11, 2023

See also the discussion on the mailing list and on the issue #23741.

@mattip
Copy link
Member

mattip commented Jun 11, 2023

There was also a suggestion for an implementation using ufunc loops in a comment to issue 13199

@RonaldAJ
Copy link
Contributor Author

The discussion on ufuncs is mostly independent. I moved most code from the _var function to _mean_var. The ufuncs discussion was focusing on rewriting _var, now that would have to become a rewrite of _mean_var and all the benefits of a rewrite would also benefit mean_var, var, std, mean_std and std as they all call _mean_var.

@InessaPawson InessaPawson added the triage review Issue/PR to be discussed at the next triage meeting label Jun 15, 2023
@RonaldAJ
Copy link
Contributor Author

Short note after today's triage review:

  • Using a keyword to the existing var and std functions to pass in a precalculated mean is preferred over calculating the mean inside the function and getting it out. This would serve slightly more use cases and not introduce new functions in the API.
  • For the first implementation getting the mean in the correct shape for broadcasting against the data could be regarded the responsibility of the calling code.

Also it would be good to see whether this has been discussed at the array API site: https://data-apis.org/array-api/latest/

@seberg seberg removed the triage review Issue/PR to be discussed at the next triage meeting label Jul 3, 2023
@RonaldAJ
Copy link
Contributor Author

RonaldAJ commented Jul 3, 2023

Also it would be good to see whether this has been discussed at the array API site: https://data-apis.org/array-api/latest/

The discussion on standardizing the statistical functions doesn't mention mean_std or mean_var or variants thereof. Also there is no mention of passing the mean into the calculation of var or std. It is mentioned that keywords can be somewhat freely added. Passing in the mean is consistent with that.

@seberg Would it make sense to create an issue for this?

@seberg
Copy link
Member

seberg commented Jul 3, 2023

@seberg Would it make sense to create an issue for this?

No. First, it seems like its super low priority: I doubt there are a host of libraries itching to use it (or needing it). Second, nobody has it in practice, so there is no prior art for them to go on.

@RonaldAJ
Copy link
Contributor Author

RonaldAJ commented Jul 4, 2023

For the solution as discussed at the triage review it makes sense (to me at least) to close this PR and start a new one referencing this one.

To implement that solution based on what is included in this PR a lot of work needs to be undone, while making those changes on top of the main branch requires in my view minor changes both in code and documentation.

@RonaldAJ RonaldAJ closed this Jul 5, 2023
@RonaldAJ
Copy link
Contributor Author

RonaldAJ commented Jul 5, 2023

Instead of this solution a new PR will be opened to add the mean keyword to var and std.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

4 participants