From a425211d46d99ed60c9b7ffaacc9dbbd5b988fc5 Mon Sep 17 00:00:00 2001 From: dcherian Date: Wed, 14 Jun 2023 11:31:45 -0600 Subject: [PATCH 01/38] Update and reorganize apply_ufunc material --- .pre-commit-config.yaml | 4 - _toc.yml | 2 + .../automatic-vectorizing-numpy.ipynb | 499 ++++++++++++++++++ .../apply_ufunc/complex-output-numpy.ipynb | 167 ++++++ .../apply_ufunc/simple_dask_apply_ufunc.ipynb | 16 +- .../simple_numpy_apply_ufunc.ipynb | 88 ++- advanced/apply_ufunc/vectorize_1d.ipynb | 22 +- 7 files changed, 758 insertions(+), 40 deletions(-) create mode 100644 advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb create mode 100644 advanced/apply_ufunc/complex-output-numpy.ipynb diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 42fcf502..a6b1f814 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -28,10 +28,6 @@ repos: hooks: - id: flake8 - - repo: https://github.com/asottile/seed-isort-config - rev: v2.2.0 - hooks: - - id: seed-isort-config - repo: https://github.com/PyCQA/isort rev: 5.12.0 hooks: diff --git a/_toc.yml b/_toc.yml index 81c44f23..3773f2ed 100644 --- a/_toc.yml +++ b/_toc.yml @@ -47,6 +47,8 @@ parts: - file: advanced/apply_ufunc/apply_ufunc.md sections: - file: advanced/apply_ufunc/simple_numpy_apply_ufunc + - file: advanced/apply_ufunc/complex-output-numpy + - file: advanced/apply_ufunc/automatic-vectorizing-numpy - file: advanced/apply_ufunc/simple_dask_apply_ufunc - file: advanced/apply_ufunc/vectorize_1d - file: advanced/map_blocks/map_blocks.md diff --git a/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb b/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb new file mode 100644 index 00000000..e1ab1783 --- /dev/null +++ b/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb @@ -0,0 +1,499 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "6849dcdc-3484-4f41-8b23-51613d36812f", + "metadata": { + "tags": [], + "user_expressions": [] + }, + "source": [ + "# Automatic Vectorization" + ] + }, + { + "cell_type": "markdown", + "id": "afc56d28-6e55-4967-b27d-28e2cc539cc7", + "metadata": { + "tags": [], + "user_expressions": [] + }, + "source": [ + "Previously we looked at applying functions on numpy arrays, and the concept of core dimensions.\n", + "We learned that functions commonly support specifying \"core dimensions\" through the `axis` keyword\n", + "argument. However many functions exist, that implicitly have core dimensions, but do not provide an `axis` keyword\n", + "argument. Applying such functions to a nD array usually involves one or multiple loops.\n", + "Our goal here is to learn how to apply such functions without loops using `apply_ufunc` by providing the `vectorize`\n", + "keyword argument.\n", + "\n", + "\n", + "## Introduction\n", + "\n", + "A good example is numpy's 1D interpolate function :py:func:`numpy.interp`:\n", + "\n", + "```\n", + " Signature: np.interp(x, xp, fp, left=None, right=None, period=None)\n", + " Docstring:\n", + " One-dimensional linear interpolation.\n", + "\n", + " Returns the one-dimensional piecewise linear interpolant to a function\n", + " with given discrete data points (`xp`, `fp`), evaluated at `x`.\n", + "```\n", + "\n", + "This functions expects 1D arrays as input, so there is one core dimension and we cannot easily apply \n", + "it to a nD array since there is no `axis` keyword argument. \n", + "\n", + "\n", + "### Core dimensions and looping\n", + "A common pattern to solve this problem is to loop over all the other dimensions. Let's say we want to\n", + "interpolate an array with two dimensions (`space`, `time`) over the `time` dimension, we might \n", + "1. loop over the space dimension, \n", + "2. subset the array to a 1D array at that `space` loction, \n", + "3. Interpolate the 1D arrays to the new time vector, and\n", + "4. Assign that new interpolated 1D array to the appropriate location of a 2D output array\n", + "\n", + "In pseudo-code this might look like\n", + "\n", + "```python\n", + "for index in range(size_of_space_axis):\n", + " out[index, :] = np.interp(..., array[index, :], ...)\n", + "```\n", + "\n", + "\n", + "#### Exercise\n", + "\n", + "Consider the example problem of interpolating a 2D array with dimensions `space` and `time` along the `time` dimension.\n", + "Which dimension is the core dimension, and which is the \"loop dimension\"?\n", + "\n", + "\n", + "## Vectorization\n", + "\n", + "The pattern of looping over any number of \"loop dimensions\" and applying a function along \"core dimensions\" \n", + "is so common that numpy provides wrappers that automate these steps: \n", + "1. numpy.apply_along_axis\n", + "1. numpy.apply_along_axes\n", + "1. numpy.vectorize\n", + "\n", + "\n", + "`apply_ufunc` provides an easy interface to `numpy.vectorize` through the keyword argument `vectorize`. Here we see how to use\n", + "that to automatically apply `np.interp` along a single axis of a nD array\n", + "\n", + "## Load data\n", + "\n", + "First lets load an example dataset" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "76aa13b8-5ced-4468-a72e-6b0a29172d6d", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import xarray as xr\n", + "import numpy as np\n", + "\n", + "air = (\n", + " xr.tutorial.load_dataset(\"air_temperature\")\n", + " .air.sortby(\"lat\") # np.interp needs coordinate in ascending order\n", + " .isel(time=slice(4), lon=slice(3)) # choose a small subset for convenience\n", + ")\n", + "air" + ] + }, + { + "cell_type": "markdown", + "id": "ef161fb5-15d7-4d13-831e-6407d5293bc1", + "metadata": {}, + "source": [ + "The function we will apply is `np.interp` which expects 1D numpy arrays. This functionality is already implemented in xarray so we use that capability to make sure we are not making mistakes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bf37f81b-f6b5-42d4-b20e-0d22e3d124d6", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "newlat = np.linspace(15, 75, 100)\n", + "air.interp(lat=newlat)" + ] + }, + { + "cell_type": "markdown", + "id": "426f5211-031e-4096-8bdc-78a0cb4c1766", + "metadata": {}, + "source": [ + "Let's define a function that works with one vector of data along `lat` at a time." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cb286fa0-deba-4929-b18a-79af5acb0b5b", + "metadata": {}, + "outputs": [], + "source": [ + "def interp1d_np(data, x, xi):\n", + " return np.interp(xi, x, data)\n", + "\n", + "\n", + "interped = interp1d_np(air.isel(time=0, lon=0), air.lat, newlat)\n", + "expected = air.interp(lat=newlat)\n", + "\n", + "# no errors are raised if values are equal to within floating point precision\n", + "np.testing.assert_allclose(expected.isel(time=0, lon=0).values, interped)" + ] + }, + { + "cell_type": "markdown", + "id": "454a9887-948a-45f0-9b7c-dc07bb542726", + "metadata": {}, + "source": [ + "## `exclude_dims`" + ] + }, + { + "cell_type": "markdown", + "id": "42dad673-7987-4dff-870b-c2bdff440993", + "metadata": {}, + "source": [ + "\n", + "```\n", + "exclude_dims : set, optional\n", + " Core dimensions on the inputs to exclude from alignment and\n", + " broadcasting entirely. Any input coordinates along these dimensions\n", + " will be dropped. Each excluded dimension must also appear in\n", + " ``input_core_dims`` for at least one argument. Only dimensions listed\n", + " here are allowed to change size between input and output objects.\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0ce03d38-db86-49c2-a60d-46d312ad60f8", + "metadata": { + "tags": [ + "raises-exception" + ] + }, + "outputs": [], + "source": [ + "xr.apply_ufunc(\n", + " interp1d_np, # first the function\n", + " air.isel(time=0, lon=0), # now arguments in the order expected by 'interp1_np'\n", + " air.lat,\n", + " newlat,\n", + " exclude_dims=set((\"lat\",)), # dimensions allowed to change size. Must be set!\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "79db343e-efcc-4fa5-a2ec-1da91bc93225", + "metadata": {}, + "source": [ + "## Core dimensions\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "e7c81b20-83a1-4575-8e9f-bf929ecc66ba", + "metadata": {}, + "source": [ + "Core dimensions are central to using `apply_ufunc`. In our case, our function expects to receive a 1D vector along `lat` — this is the dimension that is \"core\" to the function's functionality. Multiple core dimensions are possible. `apply_ufunc` needs to know which dimensions of each variable are core dimensions.\n", + "\n", + " input_core_dims : Sequence[Sequence], optional\n", + " List of the same length as ``args`` giving the list of core dimensions\n", + " on each input argument that should not be broadcast. By default, we\n", + " assume there are no core dimensions on any input arguments.\n", + "\n", + " For example, ``input_core_dims=[[], ['time']]`` indicates that all\n", + " dimensions on the first argument and all dimensions other than 'time'\n", + " on the second argument should be broadcast.\n", + "\n", + " Core dimensions are automatically moved to the last axes of input\n", + " variables before applying ``func``, which facilitates using NumPy style\n", + " generalized ufuncs [2]_.\n", + " \n", + " output_core_dims : List[tuple], optional\n", + " List of the same length as the number of output arguments from\n", + " ``func``, giving the list of core dimensions on each output that were\n", + " not broadcast on the inputs. By default, we assume that ``func``\n", + " outputs exactly one array, with axes corresponding to each broadcast\n", + " dimension.\n", + "\n", + " Core dimensions are assumed to appear as the last dimensions of each\n", + " output in the provided order.\n", + " \n", + "Next we specify `\"lat\"` as `input_core_dims` on both `air` and `air.lat`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2ced88c6-84ca-4294-9ed6-479ee353b3c8", + "metadata": { + "tags": [ + "raises-exception" + ] + }, + "outputs": [], + "source": [ + "xr.apply_ufunc(\n", + " interp1d_np, # first the function\n", + " air.isel(time=0, lon=0), # now arguments in the order expected by 'interp1_np'\n", + " air.lat,\n", + " newlat,\n", + " input_core_dims=[[\"lat\"], [\"lat\"], []],\n", + " exclude_dims=set((\"lat\",)), # dimensions allowed to change size. Must be set!\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "cb20872d-06ae-400d-ada7-443ce6fa02a4", + "metadata": {}, + "source": [ + "xarray is telling us that it expected to receive back a numpy array with 0 dimensions but instead received an array with 1 dimension corresponding to `newlat`. We can fix this by specifying `output_core_dims`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b3957ac5-c622-41f9-b774-5527693d370e", + "metadata": {}, + "outputs": [], + "source": [ + "xr.apply_ufunc(\n", + " interp1d_np, # first the function\n", + " air.isel(time=0, lon=0), # now arguments in the order expected by 'interp1_np'\n", + " air.lat,\n", + " newlat,\n", + " input_core_dims=[[\"lat\"], [\"lat\"], []], # list with one entry per arg\n", + " output_core_dims=[[\"lat\"]],\n", + " exclude_dims=set((\"lat\",)), # dimensions allowed to change size. Must be set!\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "60cc2c23-bdf9-41f1-a09a-a6df8531fd21", + "metadata": {}, + "source": [ + "Finally we get some output! Let's check that this is right\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d6778fe6-5433-4f8d-bb40-2fb41b78b574", + "metadata": {}, + "outputs": [], + "source": [ + "interped = xr.apply_ufunc(\n", + " interp1d_np, # first the function\n", + " air.isel(time=0, lon=0), # now arguments in the order expected by 'interp1_np'\n", + " air.lat,\n", + " newlat,\n", + " input_core_dims=[[\"lat\"], [\"lat\"], []], # list with one entry per arg\n", + " output_core_dims=[[\"lat\"]],\n", + " exclude_dims=set((\"lat\",)), # dimensions allowed to change size. Must be set!\n", + ")\n", + "interped[\"lat\"] = newlat # need to add this manually\n", + "xr.testing.assert_allclose(expected.isel(time=0, lon=0), interped)" + ] + }, + { + "cell_type": "markdown", + "id": "3c7ea6dd-260e-4315-9d6d-3dbe2af2538e", + "metadata": {}, + "source": [ + "No errors are raised so it is right!" + ] + }, + { + "cell_type": "markdown", + "id": "3441f26d-8076-47a4-82c1-108c4c870cb9", + "metadata": {}, + "source": [ + "## Vectorization with `np.vectorize`" + ] + }, + { + "cell_type": "markdown", + "id": "121aeb07-d444-4394-a4ef-9199dc2725a5", + "metadata": {}, + "source": [ + "Now our function currently only works on one vector of data which is not so useful given our 3D dataset.\n", + "Let's try passing the whole dataset. We add a `print` statement so we can see what our function receives." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9097c318-5114-4f00-a8e8-f85cdae39bf3", + "metadata": { + "tags": [ + "raises-exception" + ] + }, + "outputs": [], + "source": [ + "def interp1d_np(data, x, xi):\n", + " print(f\"data: {data.shape} | x: {x.shape} | xi: {xi.shape}\")\n", + " return np.interp(xi, x, data)\n", + "\n", + "\n", + "interped = xr.apply_ufunc(\n", + " interp1d_np, # first the function\n", + " air.isel(lon=slice(3), time=slice(4)), # now arguments in the order expected by 'interp1_np'\n", + " air.lat,\n", + " newlat,\n", + " input_core_dims=[[\"lat\"], [\"lat\"], []], # list with one entry per arg\n", + " output_core_dims=[[\"lat\"]],\n", + " exclude_dims=set((\"lat\",)), # dimensions allowed to change size. Must be set!\n", + ")\n", + "interped[\"lat\"] = newlat # need to add this manually\n", + "xr.testing.assert_allclose(expected.isel(time=0, lon=0), interped)" + ] + }, + { + "cell_type": "markdown", + "id": "f4873ac9-7000-4a0a-9d8f-e425e2bf2671", + "metadata": { + "tags": [], + "user_expressions": [] + }, + "source": [ + "That's a hard-to-interpret error but our `print` call helpfully printed the shapes of the input data: \n", + "\n", + " data: (10, 53, 25) | x: (25,) | xi: (100,)\n", + "\n", + "We see that `air` has been passed as a 3D numpy array which is not what `np.interp` expects. Instead we want loop over all combinations of `lon` and `time`; and apply our function to each corresponding vector of data along `lat`.\n", + "`apply_ufunc` makes this easy by specifying `vectorize=True`:\n", + "\n", + " vectorize : bool, optional\n", + " If True, then assume ``func`` only takes arrays defined over core\n", + " dimensions as input and vectorize it automatically with\n", + " :py:func:`numpy.vectorize`. This option exists for convenience, but is\n", + " almost always slower than supplying a pre-vectorized function.\n", + " Using this option requires NumPy version 1.12 or newer.\n", + " \n", + "Also see the documentation for `np.vectorize`: . Most importantly\n", + "\n", + " The vectorize function is provided primarily for convenience, not for performance. \n", + " The implementation is essentially a for loop." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7f8d8f74-3282-403d-8cce-7434780b29a3", + "metadata": { + "tags": [ + "raises-exception" + ] + }, + "outputs": [], + "source": [ + "def interp1d_np(data, x, xi):\n", + " print(f\"data: {data.shape} | x: {x.shape} | xi: {xi.shape}\")\n", + " return np.interp(xi, x, data)\n", + "\n", + "\n", + "interped = xr.apply_ufunc(\n", + " interp1d_np, # first the function\n", + " air, # now arguments in the order expected by 'interp1_np'\n", + " air.lat, # as above\n", + " newlat, # as above\n", + " input_core_dims=[[\"lat\"], [\"lat\"], []], # list with one entry per arg\n", + " output_core_dims=[[\"lat\"]], # returned data has one dimension\n", + " exclude_dims=set((\"lat\",)), # dimensions allowed to change size. Must be set!\n", + " vectorize=True, # loop over non-core dims\n", + ")\n", + "interped[\"lat\"] = newlat # need to add this manually\n", + "xr.testing.assert_allclose(expected, interped)" + ] + }, + { + "cell_type": "markdown", + "id": "a0e1347f-3473-4fb6-9506-de46e314ba8c", + "metadata": {}, + "source": [ + "This unfortunately is another cryptic error from numpy. \n", + "\n", + "Notice that `newlat` is not an xarray object. Let's add a dimension name `new_lat` and modify the call. Note this cannot be `lat` because xarray expects dimensions to be the same size (or broadcastable) among all inputs. `output_core_dims` needs to be modified appropriately. We'll manually rename `new_lat` back to `lat` for easy checking." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "add52a3e-6c10-4761-ad39-101a58cda921", + "metadata": {}, + "outputs": [], + "source": [ + "def interp1d_np(data, x, xi):\n", + " print(f\"data: {data.shape} | x: {x.shape} | xi: {xi.shape}\")\n", + " return np.interp(xi, x, data)\n", + "\n", + "\n", + "interped = xr.apply_ufunc(\n", + " interp1d_np, # first the function\n", + " air, # now arguments in the order expected by 'interp1_np'\n", + " air.lat, # as above\n", + " newlat, # as above\n", + " input_core_dims=[[\"lat\"], [\"lat\"], [\"new_lat\"]], # list with one entry per arg\n", + " output_core_dims=[[\"new_lat\"]], # returned data has one dimension\n", + " exclude_dims=set((\"lat\",)), # dimensions allowed to change size. Must be a set!\n", + " vectorize=True, # loop over non-core dims\n", + ")\n", + "interped = interped.rename({\"new_lat\": \"lat\"})\n", + "interped[\"lat\"] = newlat # need to add this manually\n", + "xr.testing.assert_allclose(\n", + " expected.transpose(*interped.dims), interped # order of dims is different\n", + ")\n", + "interped" + ] + }, + { + "cell_type": "markdown", + "id": "d81f399e-1649-4d4b-ad28-81cba8403210", + "metadata": {}, + "source": [ + "Notice that the printed input shapes are all 1D and correspond to one vector along the `lat` dimension.\n", + "\n", + "The result is now an xarray object with coordinate values copied over from `data`. This is why `apply_ufunc` is so convenient; it takes care of a lot of boilerplate necessary to apply functions that consume and produce numpy arrays to xarray objects.\n", + "\n", + "One final point: `lat` is now the *last* dimension in `interped`. This is a \"property\" of core dimensions: they are moved to the end before being sent to `interp1d_np` as was noted in the docstring for `input_core_dims`\n", + "\n", + " Core dimensions are automatically moved to the last axes of input\n", + " variables before applying ``func``, which facilitates using NumPy style\n", + " generalized ufuncs [2]_." + ] + } + ], + "metadata": { + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/advanced/apply_ufunc/complex-output-numpy.ipynb b/advanced/apply_ufunc/complex-output-numpy.ipynb new file mode 100644 index 00000000..4bc77417 --- /dev/null +++ b/advanced/apply_ufunc/complex-output-numpy.ipynb @@ -0,0 +1,167 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "0203c86c-2e42-4226-acd1-1e9bcffc6708", + "metadata": { + "tags": [], + "user_expressions": [] + }, + "source": [ + "# Handling complex output\n", + "\n", + "\n", + "## Introduction\n", + "\n", + "A good example is numpy's 1D interpolate function :py:func:`numpy.interp`:\n", + "\n", + "```\n", + " Signature: np.interp(x, xp, fp, left=None, right=None, period=None)\n", + " Docstring:\n", + " One-dimensional linear interpolation.\n", + "\n", + " Returns the one-dimensional piecewise linear interpolant to a function\n", + " with given discrete data points (`xp`, `fp`), evaluated at `x`.\n", + "```\n", + "\n", + "This function expects a 1D array as input, and returns a 1D array as output.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "18135586-b2d9-4f1d-abe2-d7afa036e2bc", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import xarray as xr\n", + "import numpy as np\n", + "\n", + "air = (\n", + " xr.tutorial.load_dataset(\"air_temperature\")\n", + " .air.sortby(\"lat\") # np.interp needs coordinate in ascending order\n", + " .isel(time=-0, lon=0) # choose a 1D subset\n", + ")\n", + "air" + ] + }, + { + "cell_type": "markdown", + "id": "db79d3e3-2a1a-4644-82a0-d58da74e0e36", + "metadata": { + "tags": [], + "user_expressions": [] + }, + "source": [ + "## Adding a new dimension" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b5e9b4b4-712d-4ef7-8e13-9bbbad59728a", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "newlat = np.linspace(15, 75, 100)\n", + "\n", + "xr.apply_ufunc(\n", + " np.interp, # first the function\n", + " newlat,\n", + " air.lat,\n", + " air,\n", + " input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n", + " output_core_dims=[[\"lat_interp\"]],\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "7767a63d-20c5-4c2d-8bf0-3b26bc2b336f", + "metadata": { + "tags": [], + "user_expressions": [] + }, + "source": [ + "## Dimensions that change size" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0fc8f8f5-0560-46f2-be65-8daabea2d837", + "metadata": { + "tags": [ + "raises-exception" + ] + }, + "outputs": [], + "source": [ + "%xmode minimal\n", + "\n", + "newlat = np.linspace(15, 75, 100)\n", + "\n", + "xr.apply_ufunc(\n", + " np.interp, # first the function\n", + " newlat,\n", + " air.lat,\n", + " air,\n", + " input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n", + " output_core_dims=[[\"lat\"]],\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "61a39f36-9386-407b-b1c0-b49953da9d0c", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "newlat = np.linspace(15, 75, 100)\n", + "\n", + "xr.apply_ufunc(\n", + " np.interp, # first the function\n", + " newlat,\n", + " air.lat,\n", + " air,\n", + " input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n", + " output_core_dims=[[\"lat\"]],\n", + " exclude_dims={\"lat\"},\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "a3c45b16-795c-4549-95ee-8e9c0d7dd517", + "metadata": { + "tags": [], + "user_expressions": [] + }, + "source": [ + "## TODO: Returning multiple variables" + ] + } + ], + "metadata": { + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb b/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb index 9acc9ef2..3d4c6254 100644 --- a/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb +++ b/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb @@ -5,22 +5,14 @@ "id": "7a8ff28a-4be3-469f-8cf4-9297e71cc4ca", "metadata": {}, "source": [ - "\n", - "\n", "# Handling dask arrays\n", "\n", - "`apply_ufunc` is a more advanced wrapper that is designed to apply functions\n", - "that expect and return NumPy (or other arrays). For example, this would include\n", - "all of SciPy's API. Since `apply_ufunc` operates on lower-level NumPy or Dask\n", - "objects, it skips the overhead of using Xarray objects making it a good choice\n", - "for performance-critical functions.\n", - "\n", - "`apply_ufunc` can be a little tricky to get right since it operates at a lower\n", - "level than `map_blocks`. On the other hand, Xarray uses `apply_ufunc` internally\n", - "to implement much of its API, meaning that it is quite powerful!\n", + "We have previously worked over applying functions to NumPy arrays contained in Xarray objects.\n", + "`apply_ufunc` also lets you easily perform many of the steps involving in applying \n", + "functions that expect and return Dask arrays.\n", "\n", "Learning goals:\n", - "- Learn that `apply_ufunc` automates aspects of applying computation functions that are designed for pure arrays (like numpy arrays) on xarray objects\n", + "- Learn that `apply_ufunc` can automate aspects of applying computation functions on dask arrays\n", "\n", "## Setup" ] diff --git a/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb b/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb index a5d9e32e..37d783ca 100644 --- a/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb +++ b/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb @@ -3,21 +3,29 @@ { "cell_type": "markdown", "id": "7a8ff28a-4be3-469f-8cf4-9297e71cc4ca", - "metadata": {}, + "metadata": { + "tags": [], + "user_expressions": [] + }, "source": [ - "\n", - "\n", "# A gentle introduction\n", "\n", - "`apply_ufunc` is a more advanced wrapper that is designed to apply functions\n", - "that expect and return NumPy (or other arrays). For example, this would include\n", - "all of SciPy's API. Since `apply_ufunc` operates on lower-level NumPy or Dask\n", - "objects, it skips the overhead of using Xarray objects making it a good choice\n", - "for performance-critical functions. Xarray uses `apply_ufunc` internally\n", - "to implement much of its API, meaning that it is quite powerful!\n", + "Many, but not all, useful array methods are wrapped by Xarray and accessible\n", + "as methods on Xarray objects. For example `DataArray.mean` calls `numpy.mean`.\n", + "A very common use-case is to apply functions that expect and return NumPy \n", + "(or other array types) on Xarray objects. For example, this would include all of SciPy's API. \n", + "Applying many of these functions to Xarray object involves a series of repeated steps.\n", + "`apply_ufunc` provides a convenient wrapper function that generalizes the steps\n", + "involved in applying such functions to Xarray objects.\n", + "\n", + "```{tip}\n", + "Xarray uses `apply_ufunc` internally to implement much of its API, meaning that it is quite powerful!\n", + "```\n", "\n", "Learning goals:\n", "- Learn that `apply_ufunc` automates aspects of applying computation functions that are designed for pure arrays (like numpy arrays) on xarray objects\n", + "- Learn how to identify \"core dimensions\" for the function you're applying.\n", + "- Learn that \"core dimensions\" are automatically moved or transposed to the end of the array.\n", "\n", "## Setup" ] @@ -204,11 +212,14 @@ { "cell_type": "markdown", "id": "79f2c642-b814-48e0-a84c-7f1fb0858c53", - "metadata": {}, + "metadata": { + "tags": [], + "user_expressions": [] + }, "source": [ - "apply_ufunc easily handles both dataarrays and datasets. \n", + "`apply_ufunc` easily handles both DataArrays and Datasets. \n", "\n", - "When passed a Dataset, apply-ufunc will loop over the data variables and sequentially pass those to `squared_error`. So `squared_error` always receives a numpy array" + "When passed a Dataset, `apply_ufunc` will loop over the data variables and sequentially pass those to `squared_error`. So `squared_error` always receives a numpy array" ] }, { @@ -231,16 +242,34 @@ "xr.apply_ufunc(squared_error, ds, 1)" ] }, + { + "cell_type": "markdown", + "id": "f8359efc-ffde-4357-a4fb-702dc064c85a", + "metadata": { + "tags": [], + "user_expressions": [] + }, + "source": [ + "## TODO: Passing args, kwargs" + ] + }, { "cell_type": "markdown", "id": "ac4f4511-2535-4dd8-af2e-085a9f453e9f", - "metadata": {}, + "metadata": { + "tags": [], + "user_expressions": [] + }, "source": [ "## Reductions and core dimensions\n", "\n", - "`squared_error` operated on a per-element basis. How about a reduction like `np.mean`?\n", + "`squared_error` operated on a per-element basis. How about a reduction like `np.mean`? \n", "\n", - "Such functions involve the concept of \"core dimensions\". One way to think about core dimensions is to consider the smallest dimensionality of data necessary to apply the function.\n", + "Such functions involve the concept of \"core dimensions\". \n", + "\n", + "```{important}\n", + "One way to think about core dimensions is to consider the smallest dimensionality of data necessary to apply the function.\n", + "```\n", "\n", "For using more complex operations that consider some array values collectively,\n", "it’s important to understand the idea of **core dimensions**. \n", @@ -314,10 +343,15 @@ { "cell_type": "markdown", "id": "f7be1720-b529-444d-852e-a6f63563ac3f", - "metadata": {}, + "metadata": { + "tags": [], + "user_expressions": [] + }, "source": [ "The error here\n", - "> applied function returned data with unexpected number of dimensions. Received 2 dimension(s) but expected 3 dimensions with names: ('time', 'lat', 'lon')\n", + "```\n", + "applied function returned data with unexpected number of dimensions. Received 2 dimension(s) but expected 3 dimensions with names: ('time', 'lat', 'lon')\n", + "```\n", "\n", "means that while `np.mean` did indeed reduce one dimension, we did not tell `apply_ufunc` that this would happen. That is, we need to specify the core dimensions on the input." ] @@ -346,11 +380,17 @@ { "cell_type": "markdown", "id": "534b12aa-8a69-457b-a901-70a2358b0b71", - "metadata": {}, + "metadata": { + "tags": [], + "user_expressions": [] + }, "source": [ "This next error is a little confusing.\n", "\n", - "> size of dimension 'lat' on inputs was unexpectedly changed by applied function from 25 to 53. Only dimensions specified in ``exclude_dims`` with xarray.apply_ufunc are allowed to change size.\n", + "```\n", + "size of dimension 'lat' on inputs was unexpectedly changed by applied function from 25 to 53. \n", + "Only dimensions specified in ``exclude_dims`` with xarray.apply_ufunc are allowed to change size.\n", + "```\n", "\n", "\n", "A good trick here is to pass a little wrapper function to `apply_ufunc` instead and inspect the shapes of data received by the wrapper.\n" @@ -410,9 +450,15 @@ { "cell_type": "markdown", "id": "81fe9fe1-83ff-4d60-bbee-fc11ecd4f73a", - "metadata": {}, + "metadata": { + "tags": [], + "user_expressions": [] + }, "source": [ - "This illustrates an important concept: **arrays are transposed so that core dimensions are at the end**. \n", + "This illustrates an important concept: \n", + "```{important}\n", + "Arrays are transposed so that core dimensions are at the end.\n", + "```\n", "\n", "With `apply_ufunc`, core dimensions are recognized by name, and then moved to\n", "the last dimension of any input arguments before applying the given function.\n", diff --git a/advanced/apply_ufunc/vectorize_1d.ipynb b/advanced/apply_ufunc/vectorize_1d.ipynb index f4398ce1..3932a5e8 100644 --- a/advanced/apply_ufunc/vectorize_1d.ipynb +++ b/advanced/apply_ufunc/vectorize_1d.ipynb @@ -2,16 +2,22 @@ "cells": [ { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "tags": [], + "user_expressions": [] + }, "source": [ - "# Applying unvectorized functions\n", + "# Interpolation : An end-to-end example\n", "\n", "**Author** [Deepak Cherian (NCAR)](https://cherian.net)" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "tags": [], + "user_expressions": [] + }, "source": [ "This example will illustrate how to conveniently apply an unvectorized function `func` to xarray objects using `apply_ufunc`. `func` expects 1D numpy arrays and returns a 1D numpy array. Our goal is to coveniently apply this function along a dimension of xarray objects that may or may not wrap dask arrays with a signature.\n", "\n", @@ -27,6 +33,16 @@ "and write an `xr_interp` function with signature\n", "\n", " xr_interp(xarray_object, dimension_name, new_coordinate_to_interpolate_to)\n", + " \n", + " \n", + "## Learning goals \n", + "\n", + "This example will illustrate \n", + "1. Specifying core dimensions with `input_core_dims`\n", + "1. Handling core dimensions of the output with `output_core_dims`\n", + "1. Handling core dimensions that change size using `exclude_dims`\n", + "1. Automatic vectorizing or looping over dimensions that are not core dimensions using `vectorize=True`\n", + "1. Automatically parallelization with dask arrays using `dask=\"parallelized\"`\n", "\n", "## Load data\n", "\n", From 1291a5c2fdb6e068a506347e575b559853f4dd33 Mon Sep 17 00:00:00 2001 From: dcherian Date: Fri, 16 Jun 2023 13:55:04 -0600 Subject: [PATCH 02/38] More complex output --- .../apply_ufunc/complex-output-numpy.ipynb | 207 +++++++++++++++++- .../simple_numpy_apply_ufunc.ipynb | 8 +- 2 files changed, 206 insertions(+), 9 deletions(-) diff --git a/advanced/apply_ufunc/complex-output-numpy.ipynb b/advanced/apply_ufunc/complex-output-numpy.ipynb index 4bc77417..fc26045b 100644 --- a/advanced/apply_ufunc/complex-output-numpy.ipynb +++ b/advanced/apply_ufunc/complex-output-numpy.ipynb @@ -10,10 +10,16 @@ "source": [ "# Handling complex output\n", "\n", + "We've seen how to use `apply_ufunc` to handle relatively simple functions that transform every element, or reduce along a single dimension.\n", + "\n", + "This lesson will show you how to handle cases where the output is more complex in two ways:\n", + "1. Handle adding a new dimension by speicfying `output_core_dims`\n", + "1. Handling the change in size of an existing dimension by specifying `exclude_dims`\n", + "\n", "\n", "## Introduction\n", "\n", - "A good example is numpy's 1D interpolate function :py:func:`numpy.interp`:\n", + "A good example of a function that returns relatively complex output is numpy's 1D interpolate function `numpy.interp`:\n", "\n", "```\n", " Signature: np.interp(x, xp, fp, left=None, right=None, period=None)\n", @@ -24,7 +30,17 @@ " with given discrete data points (`xp`, `fp`), evaluated at `x`.\n", "```\n", "\n", - "This function expects a 1D array as input, and returns a 1D array as output.\n" + "This function expects a 1D array as input, and returns a 1D array as output.\n", + "\n", + "\n", + "```{tip}Exercise\n", + "How many core dimensions does `numpy.interp` handle?\n", + "```\n", + "```{tip}Solution\n", + ":class:dropdown\n", + "\n", + "One.\n", + "```\n" ] }, { @@ -55,7 +71,29 @@ "user_expressions": [] }, "source": [ - "## Adding a new dimension" + "## Adding a new dimension\n", + "\n", + "1D interpolation transforms the size of the input along a single dimension.\n", + "\n", + "Logically, we can think of this as removing the old dimension and adding a new dimension.\n", + "\n", + "We provide this information to `apply_ufunc` using the `output_core_dims` keyword argument\n", + "\n", + "```\n", + " output_core_dims : List[tuple], optional\n", + " List of the same length as the number of output arguments from\n", + " ``func``, giving the list of core dimensions on each output that were\n", + " not broadcast on the inputs. By default, we assume that ``func``\n", + " outputs exactly one array, with axes corresponding to each broadcast\n", + " dimension.\n", + "\n", + " Core dimensions are assumed to appear as the last dimensions of each\n", + " output in the provided order.\n", + "```\n", + "\n", + "For `interp` we expect one returned output with one new core dimension that we will call `\"lat_interp\"`.\n", + "\n", + "Specify this using `output_core_dims=[[\"lat_interp\"]]`" ] }, { @@ -79,6 +117,40 @@ ")" ] }, + { + "cell_type": "markdown", + "id": "c0a5b8d4-729e-4d0e-b284-4751c5edc37c", + "metadata": { + "tags": [], + "user_expressions": [] + }, + "source": [ + "```{tip}Exercise\n", + "\n", + "\n", + "Apply the following function using `apply_ufunc`. It adds a new dimension to the input array, let's call it `newdim`. Specify the new dimension using `output_core_dims`\n", + "\n", + "```python\n", + "def add_new_dim(array):\n", + " return np.expand_dims(array, axis=0)\n", + "```\n", + "```{tip}Solution\n", + ":class: dropdown\n", + "\n", + "```{code-cell} python\n", + "def add_new_dim(array):\n", + " return np.expand_dims(array, axis=0)\n", + "\n", + "\n", + "xr.apply_ufunc(\n", + " add_new_dim,\n", + " air,\n", + " input_core_dims=[[\"lat\"]],\n", + " output_core_dims=[[\"newdim\", \"lat\"]],\n", + ")\n", + "```" + ] + }, { "cell_type": "markdown", "id": "7767a63d-20c5-4c2d-8bf0-3b26bc2b336f", @@ -87,7 +159,11 @@ "user_expressions": [] }, "source": [ - "## Dimensions that change size" + "## Dimensions that change size\n", + "\n", + "Imagine that you want the output to have the same dimension name `\"lat\"` i.e. applying`np.interp` changes the size of the `\"lat\"` dimension.\n", + "\n", + "We get an a error if we specify `\"lat\"` in `output_core_dims`" ] }, { @@ -101,7 +177,8 @@ }, "outputs": [], "source": [ - "%xmode minimal\n", + "# minimize error message\n", + "%xmode Minimal\n", "\n", "newlat = np.linspace(15, 75, 100)\n", "\n", @@ -112,7 +189,34 @@ " air,\n", " input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n", " output_core_dims=[[\"lat\"]],\n", - ")" + ")\n", + "\n", + "%xmode Context" + ] + }, + { + "cell_type": "markdown", + "id": "5276692d-0e1d-498a-8d60-e08a4d8b9d3a", + "metadata": { + "tags": [], + "user_expressions": [] + }, + "source": [ + "As the error message points out,\n", + "```\n", + "Only dimensions specified in ``exclude_dims`` with xarray.apply_ufunc are allowed to change size.\n", + "```\n", + "\n", + "Looking at the docstring we need to specify `exclude_dims` as a \"set\":\n", + "\n", + "```\n", + "exclude_dims : set, optional\n", + " Core dimensions on the inputs to exclude from alignment and\n", + " broadcasting entirely. Any input coordinates along these dimensions\n", + " will be dropped. Each excluded dimension must also appear in\n", + " ``input_core_dims`` for at least one argument. Only dimensions listed\n", + " here are allowed to change size between input and output objects.\n", + "```\n" ] }, { @@ -145,7 +249,96 @@ "user_expressions": [] }, "source": [ - "## TODO: Returning multiple variables" + "## Returning multiple variables\n", + "\n", + "Another common, but more complex, case is to handle multiple outputs returned by the function.\n", + "\n", + "As an example we will write a function that returns the minimum and maximum value along the last axis of the array.\n", + "\n", + "We will work with a 2D array, and apply the function `minmax` along the `\"lat\"` dimension:\n", + "```python\n", + "def minmax(array):\n", + " return array.min(axis=-1), array.max(axis=-1)\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c7accd69-fece-46be-852b-0cb7c432197a", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "def minmax(array):\n", + " return array.min(axis=-1), array.max(axis=-1)\n", + "\n", + "\n", + "air2d = xr.tutorial.load_dataset(\"air_temperature\").air.isel(time=0)\n", + "air2d" + ] + }, + { + "cell_type": "markdown", + "id": "c99f6a5e-f977-4828-9418-202d93d0acda", + "metadata": { + "tags": [], + "user_expressions": [] + }, + "source": [ + "By default, Xarray assumes one array is returned by the applied function.\n", + "\n", + "Here we have two returned arrays, and the input core dimension `\"lat\"` is removed (or reduced over).\n", + "\n", + "So we provide `output_core_dims=[[], []]` i.e. an empty list of core dimensions for each of the two returned arrays." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "204c649e-18d4-403a-9366-c46caaaefb52", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "minda, maxda = xr.apply_ufunc(\n", + " minmax,\n", + " air2d,\n", + " input_core_dims=[[\"lat\"]],\n", + " output_core_dims=[[], []],\n", + ")\n", + "minda" + ] + }, + { + "cell_type": "markdown", + "id": "b9b023e8-5ca4-436a-bdfb-3ce35f6ea712", + "metadata": { + "tags": [], + "user_expressions": [] + }, + "source": [ + "```{tip}Exercise\n", + "\n", + "We presented the concept of \"core dimensions\" as the \"smallest unit of data the function could handle.\" Do you understand how the following use of `apply_ufunc` generalizes to an array with more than one dimension? Try it with `air3d = xr.tutorial.load_dataset(\"air_temperature\").air.isel(time=0)`\n", + "\n", + "```{code-cell} python\n", + "minda, maxda = xr.apply_ufunc(\n", + " minmax,\n", + " air2d,\n", + " input_core_dims=[[\"lat\"]],\n", + " output_core_dims=[[],[]],\n", + ")\n", + "```\n", + "\n", + "```{tip}Solution\n", + ":class:dropdown\n", + "\n", + "We want to use `minmax` to compute the minimum and maximum along the \"lat\" dimension always, regardless of how many dimensions are on the input. So we specify `input_core_dims=[[\"lat\"]]`. The output does not contain the \"lat\" dimension, but we expect two returned variables. So we pass an empty list `[]` for each returned array, so `output_core_dims=[[], []]`\n", + "\n", + "```" ] } ], diff --git a/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb b/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb index 37d783ca..7a1b5db9 100644 --- a/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb +++ b/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb @@ -353,7 +353,9 @@ "applied function returned data with unexpected number of dimensions. Received 2 dimension(s) but expected 3 dimensions with names: ('time', 'lat', 'lon')\n", "```\n", "\n", - "means that while `np.mean` did indeed reduce one dimension, we did not tell `apply_ufunc` that this would happen. That is, we need to specify the core dimensions on the input." + "means that while `np.mean` did indeed reduce one dimension, we did not tell `apply_ufunc` that this would happen. That is, we need to specify the core dimensions on the input.\n", + "\n", + "Do that by passing a list of dimension names for each input object. For this function we have one input : `ds` and with a single core dimension `\"time\"` so we have `input_core_dims=[[\"time\"]]`" ] }, { @@ -372,7 +374,9 @@ " ds,\n", " # specify core dimensions as a list of lists\n", " # here 'time' is the core dimension on `ds`\n", - " input_core_dims=[[\"time\"]],\n", + " input_core_dims=[\n", + " [\"time\"], # core dimension for ds\n", + " ],\n", " kwargs={\"axis\": 0},\n", ")" ] From 799c94ac0d6dc4f5e60bbee2c7cdb11cdb02fb61 Mon Sep 17 00:00:00 2001 From: dcherian Date: Fri, 16 Jun 2023 13:59:57 -0600 Subject: [PATCH 03/38] Fix syntax --- advanced/apply_ufunc/complex-output-numpy.ipynb | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/advanced/apply_ufunc/complex-output-numpy.ipynb b/advanced/apply_ufunc/complex-output-numpy.ipynb index fc26045b..24f90470 100644 --- a/advanced/apply_ufunc/complex-output-numpy.ipynb +++ b/advanced/apply_ufunc/complex-output-numpy.ipynb @@ -33,10 +33,10 @@ "This function expects a 1D array as input, and returns a 1D array as output.\n", "\n", "\n", - "```{tip}Exercise\n", + "```{tip} Exercise\n", "How many core dimensions does `numpy.interp` handle?\n", "```\n", - "```{tip}Solution\n", + "```{tip} Solution\n", ":class:dropdown\n", "\n", "One.\n", @@ -125,8 +125,7 @@ "user_expressions": [] }, "source": [ - "```{tip}Exercise\n", - "\n", + "```{tip} Exercise\n", "\n", "Apply the following function using `apply_ufunc`. It adds a new dimension to the input array, let's call it `newdim`. Specify the new dimension using `output_core_dims`\n", "\n", @@ -134,7 +133,7 @@ "def add_new_dim(array):\n", " return np.expand_dims(array, axis=0)\n", "```\n", - "```{tip}Solution\n", + "```{tip} Solution\n", ":class: dropdown\n", "\n", "```{code-cell} python\n", @@ -320,7 +319,7 @@ "user_expressions": [] }, "source": [ - "```{tip}Exercise\n", + "```{tip} Exercise\n", "\n", "We presented the concept of \"core dimensions\" as the \"smallest unit of data the function could handle.\" Do you understand how the following use of `apply_ufunc` generalizes to an array with more than one dimension? Try it with `air3d = xr.tutorial.load_dataset(\"air_temperature\").air.isel(time=0)`\n", "\n", @@ -333,7 +332,7 @@ ")\n", "```\n", "\n", - "```{tip}Solution\n", + "```{tip} Solution\n", ":class:dropdown\n", "\n", "We want to use `minmax` to compute the minimum and maximum along the \"lat\" dimension always, regardless of how many dimensions are on the input. So we specify `input_core_dims=[[\"lat\"]]`. The output does not contain the \"lat\" dimension, but we expect two returned variables. So we pass an empty list `[]` for each returned array, so `output_core_dims=[[], []]`\n", From fccc15fd6a613ace1ee1cd543adf638a291d6e49 Mon Sep 17 00:00:00 2001 From: dcherian Date: Fri, 16 Jun 2023 14:06:10 -0600 Subject: [PATCH 04/38] fix syntax? --- advanced/apply_ufunc/complex-output-numpy.ipynb | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/advanced/apply_ufunc/complex-output-numpy.ipynb b/advanced/apply_ufunc/complex-output-numpy.ipynb index 24f90470..b4750295 100644 --- a/advanced/apply_ufunc/complex-output-numpy.ipynb +++ b/advanced/apply_ufunc/complex-output-numpy.ipynb @@ -37,7 +37,7 @@ "How many core dimensions does `numpy.interp` handle?\n", "```\n", "```{tip} Solution\n", - ":class:dropdown\n", + ":class: dropdown\n", "\n", "One.\n", "```\n" @@ -333,7 +333,7 @@ "```\n", "\n", "```{tip} Solution\n", - ":class:dropdown\n", + ":class: dropdown\n", "\n", "We want to use `minmax` to compute the minimum and maximum along the \"lat\" dimension always, regardless of how many dimensions are on the input. So we specify `input_core_dims=[[\"lat\"]]`. The output does not contain the \"lat\" dimension, but we expect two returned variables. So we pass an empty list `[]` for each returned array, so `output_core_dims=[[], []]`\n", "\n", From 627fcce66af445a48eacbff8688d03d1303c527c Mon Sep 17 00:00:00 2001 From: dcherian Date: Fri, 16 Jun 2023 14:09:18 -0600 Subject: [PATCH 05/38] Add more solutions --- .../simple_numpy_apply_ufunc.ipynb | 29 ++++++++----------- 1 file changed, 12 insertions(+), 17 deletions(-) diff --git a/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb b/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb index 7a1b5db9..d3d9b693 100644 --- a/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb +++ b/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb @@ -499,29 +499,24 @@ { "cell_type": "markdown", "id": "f62f8f73-cd7f-45fe-b379-6e9b675d38a0", - "metadata": {}, - "source": [ - "## Exercise\n", - "\n", - "Use `apply_ufunc` to apply `sp.integrate.trapz` along the `time` axis.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7e85abfe-51c3-45be-b47b-957f03a3306b", "metadata": { - "tags": [ - "hide-output", - "hide-input" - ] + "tags": [], + "user_expressions": [] }, - "outputs": [], "source": [ + "```{tip} Exercise\n", + "\n", + "Use `apply_ufunc` to apply `scipy.integrate.trapz` along the `time` axis.\n", + "```\n", + "```{tip} Solution\n", + ":class: dropdown\n", + "\n", + "```{code-cell} python\n", "import scipy as sp\n", "import scipy.integrate\n", "\n", - "xr.apply_ufunc(scipy.integrate.trapz, ds, input_core_dims=[[\"time\"]], kwargs={\"axis\": -1})" + "xr.apply_ufunc(scipy.integrate.trapz, ds, input_core_dims=[[\"time\"]], kwargs={\"axis\": -1})\n", + "```" ] } ], From 20c4537e339879d96f3a29666077c7d6579df7c3 Mon Sep 17 00:00:00 2001 From: dcherian Date: Fri, 16 Jun 2023 14:11:30 -0600 Subject: [PATCH 06/38] Remove nested code-cell --- advanced/apply_ufunc/complex-output-numpy.ipynb | 4 ++-- advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/advanced/apply_ufunc/complex-output-numpy.ipynb b/advanced/apply_ufunc/complex-output-numpy.ipynb index b4750295..4dadebb2 100644 --- a/advanced/apply_ufunc/complex-output-numpy.ipynb +++ b/advanced/apply_ufunc/complex-output-numpy.ipynb @@ -136,7 +136,7 @@ "```{tip} Solution\n", ":class: dropdown\n", "\n", - "```{code-cell} python\n", + "``` python\n", "def add_new_dim(array):\n", " return np.expand_dims(array, axis=0)\n", "\n", @@ -323,7 +323,7 @@ "\n", "We presented the concept of \"core dimensions\" as the \"smallest unit of data the function could handle.\" Do you understand how the following use of `apply_ufunc` generalizes to an array with more than one dimension? Try it with `air3d = xr.tutorial.load_dataset(\"air_temperature\").air.isel(time=0)`\n", "\n", - "```{code-cell} python\n", + "```python\n", "minda, maxda = xr.apply_ufunc(\n", " minmax,\n", " air2d,\n", diff --git a/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb b/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb index d3d9b693..28e3a2a4 100644 --- a/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb +++ b/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb @@ -511,7 +511,7 @@ "```{tip} Solution\n", ":class: dropdown\n", "\n", - "```{code-cell} python\n", + "```python\n", "import scipy as sp\n", "import scipy.integrate\n", "\n", From 3fe71520bc2c47928f0c57f67e81b5d8a395e460 Mon Sep 17 00:00:00 2001 From: dcherian Date: Wed, 21 Jun 2023 16:22:19 -0600 Subject: [PATCH 07/38] Updates --- .../automatic-vectorizing-numpy.ipynb | 68 +++- .../apply_ufunc/complex-output-numpy.ipynb | 22 +- ...ectorize_1d.ipynb => example-interp.ipynb} | 105 ++++-- .../apply_ufunc/simple_dask_apply_ufunc.ipynb | 321 +++++++++++++++--- .../simple_numpy_apply_ufunc.ipynb | 2 +- 5 files changed, 425 insertions(+), 93 deletions(-) rename advanced/apply_ufunc/{vectorize_1d.ipynb => example-interp.ipynb} (94%) diff --git a/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb b/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb index e1ab1783..949bd1b6 100644 --- a/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb +++ b/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb @@ -80,7 +80,11 @@ "\n", "## Load data\n", "\n", - "First lets load an example dataset" + "First lets load an example dataset\n", + "\n", + "```{tip}\n", + "We'll reduce the length of error messages using `%xmode minimal` See the [ipython documentation](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-xmode) for details.\n", + "```" ] }, { @@ -92,6 +96,8 @@ }, "outputs": [], "source": [ + "%xmode minimal\n", + "\n", "import xarray as xr\n", "import numpy as np\n", "\n", @@ -106,7 +112,9 @@ { "cell_type": "markdown", "id": "ef161fb5-15d7-4d13-831e-6407d5293bc1", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "The function we will apply is `np.interp` which expects 1D numpy arrays. This functionality is already implemented in xarray so we use that capability to make sure we are not making mistakes." ] @@ -127,7 +135,9 @@ { "cell_type": "markdown", "id": "426f5211-031e-4096-8bdc-78a0cb4c1766", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "Let's define a function that works with one vector of data along `lat` at a time." ] @@ -153,7 +163,9 @@ { "cell_type": "markdown", "id": "454a9887-948a-45f0-9b7c-dc07bb542726", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "## `exclude_dims`" ] @@ -161,7 +173,9 @@ { "cell_type": "markdown", "id": "42dad673-7987-4dff-870b-c2bdff440993", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "\n", "```\n", @@ -197,7 +211,9 @@ { "cell_type": "markdown", "id": "79db343e-efcc-4fa5-a2ec-1da91bc93225", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "## Core dimensions\n", "\n" @@ -206,7 +222,9 @@ { "cell_type": "markdown", "id": "e7c81b20-83a1-4575-8e9f-bf929ecc66ba", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "Core dimensions are central to using `apply_ufunc`. In our case, our function expects to receive a 1D vector along `lat` — this is the dimension that is \"core\" to the function's functionality. Multiple core dimensions are possible. `apply_ufunc` needs to know which dimensions of each variable are core dimensions.\n", "\n", @@ -260,7 +278,9 @@ { "cell_type": "markdown", "id": "cb20872d-06ae-400d-ada7-443ce6fa02a4", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "xarray is telling us that it expected to receive back a numpy array with 0 dimensions but instead received an array with 1 dimension corresponding to `newlat`. We can fix this by specifying `output_core_dims`" ] @@ -286,7 +306,9 @@ { "cell_type": "markdown", "id": "60cc2c23-bdf9-41f1-a09a-a6df8531fd21", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "Finally we get some output! Let's check that this is right\n", "\n" @@ -315,7 +337,9 @@ { "cell_type": "markdown", "id": "3c7ea6dd-260e-4315-9d6d-3dbe2af2538e", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "No errors are raised so it is right!" ] @@ -323,7 +347,9 @@ { "cell_type": "markdown", "id": "3441f26d-8076-47a4-82c1-108c4c870cb9", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "## Vectorization with `np.vectorize`" ] @@ -331,7 +357,9 @@ { "cell_type": "markdown", "id": "121aeb07-d444-4394-a4ef-9199dc2725a5", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "Now our function currently only works on one vector of data which is not so useful given our 3D dataset.\n", "Let's try passing the whole dataset. We add a `print` statement so we can see what our function receives." @@ -388,10 +416,13 @@ " almost always slower than supplying a pre-vectorized function.\n", " Using this option requires NumPy version 1.12 or newer.\n", " \n", - "Also see the documentation for `np.vectorize`: . Most importantly\n", + "\n", + "```{warning}\n", + "Also see the numpy documentation for [vectorize](https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html). Most importantly\n", "\n", " The vectorize function is provided primarily for convenience, not for performance. \n", - " The implementation is essentially a for loop." + " The implementation is essentially a for loop.\n", + "```" ] }, { @@ -427,7 +458,9 @@ { "cell_type": "markdown", "id": "a0e1347f-3473-4fb6-9506-de46e314ba8c", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "This unfortunately is another cryptic error from numpy. \n", "\n", @@ -467,7 +500,10 @@ { "cell_type": "markdown", "id": "d81f399e-1649-4d4b-ad28-81cba8403210", - "metadata": {}, + "metadata": { + "tags": [], + "user_expressions": [] + }, "source": [ "Notice that the printed input shapes are all 1D and correspond to one vector along the `lat` dimension.\n", "\n", diff --git a/advanced/apply_ufunc/complex-output-numpy.ipynb b/advanced/apply_ufunc/complex-output-numpy.ipynb index 4dadebb2..66b7126b 100644 --- a/advanced/apply_ufunc/complex-output-numpy.ipynb +++ b/advanced/apply_ufunc/complex-output-numpy.ipynb @@ -13,8 +13,8 @@ "We've seen how to use `apply_ufunc` to handle relatively simple functions that transform every element, or reduce along a single dimension.\n", "\n", "This lesson will show you how to handle cases where the output is more complex in two ways:\n", - "1. Handle adding a new dimension by speicfying `output_core_dims`\n", - "1. Handling the change in size of an existing dimension by specifying `exclude_dims`\n", + "1. Handle adding a new dimension by specifying `output_core_dims`\n", + "1. Handling the change in size of an existing dimension by specifying `exclude_dims` in addition to `output_core_dims`\n", "\n", "\n", "## Introduction\n", @@ -321,22 +321,22 @@ "source": [ "```{tip} Exercise\n", "\n", - "We presented the concept of \"core dimensions\" as the \"smallest unit of data the function could handle.\" Do you understand how the following use of `apply_ufunc` generalizes to an array with more than one dimension? Try it with `air3d = xr.tutorial.load_dataset(\"air_temperature\").air.isel(time=0)`\n", + "We presented the concept of \"core dimensions\" as the \"smallest unit of data the function could handle.\" Do you understand how the following use of `apply_ufunc` generalizes to an array with more than one dimension? Try it with `air3d = xr.tutorial.load_dataset(\"air_temperature\").air`\n", + "```\n", + "\n", + "```{tip} Solution\n", + ":class: dropdown\n", + "\n", + "We want to use `minmax` to compute the minimum and maximum along the \"lat\" dimension always, regardless of how many dimensions are on the input. So we specify `input_core_dims=[[\"lat\"]]`. The output does not contain the \"lat\" dimension, but we expect two returned variables. So we pass an empty list `[]` for each returned array, so `output_core_dims=[[], []]` just as before.\n", + "\n", "\n", "```python\n", "minda, maxda = xr.apply_ufunc(\n", " minmax,\n", - " air2d,\n", + " air3d,\n", " input_core_dims=[[\"lat\"]],\n", " output_core_dims=[[],[]],\n", ")\n", - "```\n", - "\n", - "```{tip} Solution\n", - ":class: dropdown\n", - "\n", - "We want to use `minmax` to compute the minimum and maximum along the \"lat\" dimension always, regardless of how many dimensions are on the input. So we specify `input_core_dims=[[\"lat\"]]`. The output does not contain the \"lat\" dimension, but we expect two returned variables. So we pass an empty list `[]` for each returned array, so `output_core_dims=[[], []]`\n", - "\n", "```" ] } diff --git a/advanced/apply_ufunc/vectorize_1d.ipynb b/advanced/apply_ufunc/example-interp.ipynb similarity index 94% rename from advanced/apply_ufunc/vectorize_1d.ipynb rename to advanced/apply_ufunc/example-interp.ipynb index 3932a5e8..1fd9cec6 100644 --- a/advanced/apply_ufunc/vectorize_1d.ipynb +++ b/advanced/apply_ufunc/example-interp.ipynb @@ -44,6 +44,12 @@ "1. Automatic vectorizing or looping over dimensions that are not core dimensions using `vectorize=True`\n", "1. Automatically parallelization with dask arrays using `dask=\"parallelized\"`\n", "\n", + "\n", + "\n", + "```{tip}\n", + "We'll reduce the length of error messages using in this tutorial using `%xmode minimal` See the [ipython documentation](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-xmode) for details.\n", + "```\n", + "\n", "## Load data\n", "\n", "First lets load an example dataset" @@ -55,6 +61,8 @@ "metadata": {}, "outputs": [], "source": [ + "%xmode minimal\n", + "\n", "import xarray as xr\n", "import numpy as np\n", "\n", @@ -70,7 +78,9 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "The function we will apply is `np.interp` which expects 1D numpy arrays. This functionality is already implemented in xarray so we use that capability to make sure we are not making mistakes." ] @@ -87,7 +97,9 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "Let's define a function that works with one vector of data along `lat` at a time." ] @@ -111,7 +123,9 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "No errors are raised so our interpolation is working.\n", "\n", @@ -148,21 +162,27 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "`apply_ufunc` needs to know a lot of information about what our function does so that it can reconstruct the outputs. In this case, the size of dimension lat has changed and we need to explicitly specify that this will happen. xarray helpfully tells us that we need to specify the kwarg `exclude_dims`." ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "## `exclude_dims`" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "\n", "```\n", @@ -196,7 +216,9 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "## Core dimensions\n", "\n" @@ -204,7 +226,9 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "Core dimensions are central to using `apply_ufunc`. In our case, our function expects to receive a 1D vector along `lat` — this is the dimension that is \"core\" to the function's functionality. Multiple core dimensions are possible. `apply_ufunc` needs to know which dimensions of each variable are core dimensions.\n", "\n", @@ -256,7 +280,9 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "xarray is telling us that it expected to receive back a numpy array with 0 dimensions but instead received an array with 1 dimension corresponding to `newlat`. We can fix this by specifying `output_core_dims`" ] @@ -280,7 +306,9 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "Finally we get some output! Let's check that this is right\n", "\n" @@ -307,21 +335,27 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "No errors are raised so it is right!" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "## Vectorization with `np.vectorize`" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "Now our function currently only works on one vector of data which is not so useful given our 3D dataset.\n", "Let's try passing the whole dataset. We add a `print` statement so we can see what our function receives." @@ -357,7 +391,9 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "That's a hard-to-interpret error but our `print` call helpfully printed the shapes of the input data: \n", "\n", @@ -410,7 +446,9 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "This unfortunately is another cryptic error from numpy. \n", "\n", @@ -448,7 +486,9 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "Notice that the printed input shapes are all 1D and correspond to one vector along the `lat` dimension.\n", "\n", @@ -463,7 +503,9 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "## Parallelization with dask\n", "\n" @@ -471,7 +513,9 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "So far our function can only handle numpy arrays. A real benefit of `apply_ufunc` is the ability to easily parallelize over dask chunks _when needed_. \n", "\n", @@ -511,7 +555,9 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "Yay! our function is receiving 1D vectors, so we've successfully parallelized applying a 1D function over a block. If you have a distributed dashboard up, you should see computes happening as equality is checked.\n", "\n" @@ -519,7 +565,9 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "## High performance vectorization: gufuncs, numba & guvectorize\n", "\n" @@ -527,7 +575,9 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "`np.vectorize` is a very convenient function but is unfortunately slow. It is only marginally faster than writing a for loop in Python and looping. A common way to get around this is to write a base interpolation function that can handle nD arrays in a compiled language like Fortran and then pass that to `apply_ufunc`.\n", "\n", @@ -565,7 +615,9 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "The warnings are about object-mode compilation relating to the `print` statement. This means we don't get much speed up: https://numba.pydata.org/numba-doc/latest/user/performance-tips.html#no-python-mode-vs-object-mode. We'll keep the `print` statement temporarily to make sure that `guvectorize` acts like we want it to." ] @@ -594,7 +646,9 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "Yay! Our function is receiving 1D vectors and is working automatically with dask arrays. Finally let's comment out the print line and wrap everything up in a nice reusable function" ] @@ -643,7 +697,10 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "tags": [], + "user_expressions": [] + }, "source": [ "This technique is generalizable to any 1D function." ] diff --git a/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb b/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb index 3d4c6254..7b653328 100644 --- a/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb +++ b/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb @@ -3,7 +3,10 @@ { "cell_type": "markdown", "id": "7a8ff28a-4be3-469f-8cf4-9297e71cc4ca", - "metadata": {}, + "metadata": { + "tags": [], + "user_expressions": [] + }, "source": [ "# Handling dask arrays\n", "\n", @@ -14,6 +17,10 @@ "Learning goals:\n", "- Learn that `apply_ufunc` can automate aspects of applying computation functions on dask arrays\n", "\n", + "```{tip}\n", + "We'll reduce the length of error messages using `%xmode minimal` See the [ipython documentation](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-xmode) for details.\n", + "```\n", + "\n", "## Setup" ] }, @@ -24,6 +31,8 @@ "metadata": {}, "outputs": [], "source": [ + "%xmode minimal\n", + "\n", "import dask\n", "import numpy as np\n", "import xarray as xr" @@ -32,7 +41,9 @@ { "cell_type": "markdown", "id": "e511efdf-1f39-4dcf-a111-660eeca2eb8c", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "First lets set up a `LocalCluster` using [dask.distributed](https://distributed.dask.org/).\n", "\n", @@ -57,7 +68,9 @@ { "cell_type": "markdown", "id": "9aa9663b-028a-4639-be90-5576f88d1bfa", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "

👆

Click the Dashboard link above. Or click the \"Search\" button in the dashboard.\n", "\n", @@ -79,7 +92,9 @@ { "cell_type": "markdown", "id": "e6abae80-004a-481f-9b1a-c476de951ef0", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "Let's open a dataset. We specify `chunks` so that we create a dask arrays for the DataArrays" ] @@ -98,13 +113,41 @@ { "cell_type": "markdown", "id": "a6b21b09-3ebb-4efb-9c57-44bf587ba92d", - "metadata": {}, + "metadata": { + "tags": [], + "user_expressions": [] + }, "source": [ "## A simple example\n", "\n", "All the concepts from applying numpy functions carry over.\n", "\n", - "However the handling of dask arrays needs to be explicitly activated." + "However the handling of dask arrays needs to be explicitly activated.\n", + "\n", + "There are three options for the `dask` kwarg.\n", + "\n", + "```\n", + " dask : {\"forbidden\", \"allowed\", \"parallelized\"}, default: \"forbidden\"\n", + " How to handle applying to objects containing lazy data in the form of\n", + " dask arrays:\n", + "\n", + " - 'forbidden' (default): raise an error if a dask array is encountered.\n", + " - 'allowed': pass dask arrays directly on to ``func``. Prefer this option if\n", + " ``func`` natively supports dask arrays.\n", + " - 'parallelized': automatically parallelize ``func`` if any of the\n", + " inputs are a dask array by using :py:func:`dask.array.apply_gufunc`. Multiple output\n", + " arguments are supported. Only use this option if ``func`` does not natively\n", + " support dask arrays (e.g. converts them to numpy arrays).\n", + "```\n", + "\n", + "We will work through the following two:\n", + "\n", + "1. `dask=\"allowed\"` Dask arrays are passed to the user function. This is a good\n", + " choice if your function can handle dask arrays and won't compute the result unless \n", + " explicitly requested.\n", + "2. `dask=\"parallelized\"`. This applies the user function over blocks of the dask\n", + " array using `dask.array.blockwise`. This is useful when your function cannot\n", + " handle dask arrays natively (e.g. scipy API)." ] }, { @@ -128,20 +171,41 @@ }, { "cell_type": "markdown", - "id": "28f4f80d-37cf-416a-a32d-9186c8df5f2d", - "metadata": {}, + "id": "5e2c1b9f-f436-414c-944b-fa134837ee32", + "metadata": { + "tags": [], + "user_expressions": [] + }, + "source": [ + " \n", + "A good thing to check is whether the applied function (here `squared_error`) can handle pure dask arrays. \n", + "To do this call `squared_error(ds.air.data, 1)` and make sure of the following:\n", + "1. That you don't see any activity on the dask dashboard\n", + "2. That the returned result is a dask array." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b2b6ffab-5b27-4169-b4c3-ce605707ba9d", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "squared_error(ds.air.data, 1)" + ] + }, + { + "cell_type": "markdown", + "id": "a4c17b34-7fd9-45ef-9f71-041fc8b16fdf", + "metadata": { + "tags": [], + "user_expressions": [] + }, "source": [ - "There are two options for the `dask` kwarg.\n", - "\n", - "1. `dask=\"allowed\"` Dask arrays are passed to the user function. This is a good\n", - " choice if your function can handle dask arrays and won't call compute\n", - " explicitly.\n", - "2. `dask=\"parallelized\"`. This applies the user function over blocks of the dask\n", - " array using `dask.array.blockwise`. This is useful when your function cannot\n", - " handle dask arrays natively (e.g. scipy API).\n", - "\n", "Since `squared_error` can handle dask arrays without computing them, we specify\n", - "`dask=\"allowed\"`.\n" + "`dask=\"allowed\"`." ] }, { @@ -163,11 +227,16 @@ { "cell_type": "markdown", "id": "c1418c06-e1f8-4db8-bdf5-a4e23f6524e1", - "metadata": {}, + "metadata": { + "tags": [], + "user_expressions": [] + }, "source": [ + "### Understanding what's happening\n", + "\n", "Let's again use the wrapper trick to understand what `squared_error` receives.\n", "\n", - "We see that it receives a dask array." + "We see that it receives a dask array (analagous to the numpy array in the previous example)." ] }, { @@ -189,7 +258,10 @@ { "cell_type": "markdown", "id": "ce1b0d1c-8bf3-4ddb-aa5f-e2ca70415ad6", - "metadata": {}, + "metadata": { + "tags": [], + "user_expressions": [] + }, "source": [ "## Reductions and core dimensions\n", "\n", @@ -209,17 +281,15 @@ "This means that for functions that accept an `axis` argument, you usually need\n", "to set `axis=-1`\n", "\n", - "Let's use `dask.array.mean` as an example of a function that can handle dask\n", - "arrays and uses an `axis` kwarg\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "447f79cd-8ee2-416f-9ebe-b1b4317583d0", - "metadata": {}, - "outputs": [], - "source": [ + "```{tip} Exercise\n", + "\n", + "Use `dask.array.mean` as an example of a function that can handle dask\n", + "arrays and uses an `axis` kwarg. \n", + "```\n", + "\n", + "```{tip} Solution\n", + ":class: dropdown\n", + "```python\n", "def time_mean(da):\n", " return xr.apply_ufunc(\n", " dask.array.mean,\n", @@ -230,15 +300,19 @@ " )\n", "\n", "\n", - "time_mean(ds.air)" + "time_mean(ds.air)\n", + "```\n" ] }, { "cell_type": "markdown", "id": "7952ceb1-8402-41d1-8813-31228c3e9ae6", - "metadata": {}, + "metadata": { + "tags": [], + "user_expressions": [] + }, "source": [ - "Again, this identical to the built-in `mean`" + "Again, this is identical to the built-in `mean`" ] }, { @@ -248,19 +322,34 @@ "metadata": {}, "outputs": [], "source": [ + "def time_mean(da):\n", + " return xr.apply_ufunc(\n", + " dask.array.mean,\n", + " da,\n", + " input_core_dims=[[\"time\"]],\n", + " dask=\"allowed\",\n", + " kwargs={\"axis\": -1}, # core dimensions are moved to the end\n", + " )\n", + "\n", + "\n", "ds.air.mean(\"time\").identical(time_mean(ds.air))" ] }, { "cell_type": "markdown", "id": "8abf7aa7-e1e6-4935-9d59-58d4f587135c", - "metadata": {}, + "metadata": { + "tags": [], + "user_expressions": [] + }, "source": [ "## Automatically parallelizing dask-unaware functions\n", "\n", "A very useful `apply_ufunc` feature is the ability to apply arbitrary functions\n", "in parallel to each block. This ability can be activated using\n", - "`dask=\"parallelized\"`. Again xarray needs a lot of extra metadata, so depending\n", + "`dask=\"parallelized\"`. \n", + "\n", + "Again xarray needs a lot of extra metadata, so depending\n", "on the function, extra arguments such as `output_dtypes` and `output_sizes` may\n", "be necessary.\n", "\n", @@ -280,7 +369,20 @@ "import scipy as sp\n", "import scipy.integrate\n", "\n", - "sp.integrate.trapz(ds.air.data, axis=ds.air.get_axis_num(\"lon\")) # does NOT return a dask array" + "sp.integrate.trapz(\n", + " ds.air.data, axis=ds.air.get_axis_num(\"lon\")\n", + ") # does NOT return a dask array, you should see activity on the dashboard" + ] + }, + { + "cell_type": "markdown", + "id": "f518099f-7778-4f0a-9d40-950968c651d5", + "metadata": { + "tags": [], + "user_expressions": [] + }, + "source": [ + "Let's activate automatic parallelization with `dask=\"parallelized\"`" ] }, { @@ -290,13 +392,148 @@ "metadata": {}, "outputs": [], "source": [ - "xr.apply_ufunc(\n", + "integrated = xr.apply_ufunc(\n", " sp.integrate.trapz,\n", " ds,\n", " input_core_dims=[[\"lon\"]],\n", " kwargs={\"axis\": -1},\n", " dask=\"parallelized\",\n", - ")" + ")\n", + "integrated" + ] + }, + { + "cell_type": "markdown", + "id": "3ec8e13e-77e0-4e46-a32d-dafa5da6fa6a", + "metadata": { + "tags": [], + "user_expressions": [] + }, + "source": [ + "And make sure the returned data is a dask array" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "95c9e6e4-5011-4968-84f7-befe25c049bb", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "integrated.air.data" + ] + }, + { + "cell_type": "markdown", + "id": "09e87266-2ff7-452d-b96d-d551d55c5e4e", + "metadata": { + "tags": [], + "user_expressions": [] + }, + "source": [ + "We'll define a function `integrate_lon` for later use:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "689909bf-884e-4dbd-9e28-ed2cb68f80b8", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "def integrate_lon(ds):\n", + " return xr.apply_ufunc(\n", + " sp.integrate.trapz,\n", + " ds,\n", + " input_core_dims=[[\"lon\"]],\n", + " kwargs={\"axis\": -1},\n", + " dask=\"parallelized\",\n", + " )" + ] + }, + { + "cell_type": "markdown", + "id": "a348a4c4-b36f-4ad8-8549-fce35ef429dd", + "metadata": { + "tags": [], + "user_expressions": [] + }, + "source": [ + "### Understanding whats happening\n", + "\n", + "\n", + "It is very important to understand what `dask=\"parallelized\"` does. It is quite convenient and works well for \"blockwise\" or \"emabrassingly parallel\" operations.\n", + "\n", + "These are operations where one block or chunk of the output array corresponds to one block or chunk of the input array. Specifically, the blocks or chunks of the _core dimension_ is what matters.\n", + "\n", + "Let's look at the dask repr for `ds` and note chunksizes are (100,25,53) for a array with shape (2920, 25, 53). This means that each block or chunk of the array contains all `lat`, `lon` points and a subset of `time` points." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "45d737df-8a6b-4244-bef7-c3464a1b65da", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "ds.air.data" + ] + }, + { + "cell_type": "markdown", + "id": "82b66790-4c7c-4902-aa3b-89f92c5641b5", + "metadata": { + "tags": [], + "user_expressions": [] + }, + "source": [ + "The core dimension for `trapz` is `lon`, and there is only one chunk along `lon`. This means that integrating along `lon` is a \"blockwise\" or \"embarassingly parallel\" operation and `dask=\"parallelized\"` works quite well. \n", + "\n", + "```{tip} Question\n", + "Do you understand why `integrate(ds)` when `ds` has a single chunk along `lon` is a \"embarassingly parallel\" operation?\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "4b4707d4-f596-47a8-8a3b-6a4a157f0759", + "metadata": { + "tags": [], + "user_expressions": [] + }, + "source": [ + "```{tip} Exercise\n", + "Apply the integrate function to `ds` after rechunking to have a different chunksize along `lon` using `ds.chunk(lon=4)` (for example). What happens?\n", + "```\n", + "```{tip} Solution\n", + ":class: dropdown\n", + "\n", + "`apply_ufunc` complains that it cannot automatically parallelize because the dataset `ds` is now chunked along the core dimension `lon`. You should see the following error:\n", + "\n", + " ValueError: dimension lon on 0th function argument to apply_ufunc with dask='parallelized' \n", + " consists of multiple chunks, but is also a core dimension. To fix, either rechunk \n", + " into a single array chunk along this dimension, i.e., \n", + " ``.chunk(dict(lon=-1))``, or pass ``allow_rechunk=True`` in ``dask_gufunc_kwargs`` \n", + " but beware that this may significantly increase memory usage.\n", + "\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "c3618181-7432-409e-95e7-d7bc4cc567ce", + "metadata": { + "tags": [], + "user_expressions": [] + }, + "source": [ + "## Clean up the cluster" ] }, { @@ -312,7 +549,9 @@ { "cell_type": "markdown", "id": "71492531-4eda-4c47-86e5-dad033c22751", - "metadata": {}, + "metadata": { + "user_expressions": [] + }, "source": [ "## More\n", "\n", diff --git a/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb b/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb index 28e3a2a4..93d1f44d 100644 --- a/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb +++ b/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb @@ -11,7 +11,7 @@ "# A gentle introduction\n", "\n", "Many, but not all, useful array methods are wrapped by Xarray and accessible\n", - "as methods on Xarray objects. For example `DataArray.mean` calls `numpy.mean`.\n", + "as methods on Xarray objects. For example `DataArray.mean` calls `numpy.nanmean`.\n", "A very common use-case is to apply functions that expect and return NumPy \n", "(or other array types) on Xarray objects. For example, this would include all of SciPy's API. \n", "Applying many of these functions to Xarray object involves a series of repeated steps.\n", From 5f7bf8e9efc6599ec24eda2eec44be79102f655a Mon Sep 17 00:00:00 2001 From: dcherian Date: Wed, 21 Jun 2023 16:28:47 -0600 Subject: [PATCH 08/38] fix toc --- _toc.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_toc.yml b/_toc.yml index 3773f2ed..982389b7 100644 --- a/_toc.yml +++ b/_toc.yml @@ -50,7 +50,7 @@ parts: - file: advanced/apply_ufunc/complex-output-numpy - file: advanced/apply_ufunc/automatic-vectorizing-numpy - file: advanced/apply_ufunc/simple_dask_apply_ufunc - - file: advanced/apply_ufunc/vectorize_1d + - file: advanced/apply_ufunc/example-interp - file: advanced/map_blocks/map_blocks.md sections: - file: advanced/map_blocks/simple_map_blocks From a31526a1ae51f066907f811b500c0100a6b0f1a3 Mon Sep 17 00:00:00 2001 From: dcherian Date: Thu, 22 Jun 2023 13:30:36 -0600 Subject: [PATCH 09/38] WIP --- .../automatic-vectorizing-numpy.ipynb | 3 +- .../apply_ufunc/complex-output-numpy.ipynb | 15 ++-- .../apply_ufunc/simple_dask_apply_ufunc.ipynb | 3 +- .../simple_numpy_apply_ufunc.ipynb | 68 +++++++++++++++---- 4 files changed, 62 insertions(+), 27 deletions(-) diff --git a/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb b/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb index 949bd1b6..67f6c113 100644 --- a/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb +++ b/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb @@ -398,8 +398,7 @@ "cell_type": "markdown", "id": "f4873ac9-7000-4a0a-9d8f-e425e2bf2671", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "That's a hard-to-interpret error but our `print` call helpfully printed the shapes of the input data: \n", diff --git a/advanced/apply_ufunc/complex-output-numpy.ipynb b/advanced/apply_ufunc/complex-output-numpy.ipynb index 66b7126b..3d6d8426 100644 --- a/advanced/apply_ufunc/complex-output-numpy.ipynb +++ b/advanced/apply_ufunc/complex-output-numpy.ipynb @@ -121,8 +121,7 @@ "cell_type": "markdown", "id": "c0a5b8d4-729e-4d0e-b284-4751c5edc37c", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "```{tip} Exercise\n", @@ -133,7 +132,7 @@ "def add_new_dim(array):\n", " return np.expand_dims(array, axis=0)\n", "```\n", - "```{tip} Solution\n", + "````{tip} Solution\n", ":class: dropdown\n", "\n", "``` python\n", @@ -147,7 +146,7 @@ " input_core_dims=[[\"lat\"]],\n", " output_core_dims=[[\"newdim\", \"lat\"]],\n", ")\n", - "```" + "````" ] }, { @@ -315,8 +314,7 @@ "cell_type": "markdown", "id": "b9b023e8-5ca4-436a-bdfb-3ce35f6ea712", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "```{tip} Exercise\n", @@ -324,7 +322,7 @@ "We presented the concept of \"core dimensions\" as the \"smallest unit of data the function could handle.\" Do you understand how the following use of `apply_ufunc` generalizes to an array with more than one dimension? Try it with `air3d = xr.tutorial.load_dataset(\"air_temperature\").air`\n", "```\n", "\n", - "```{tip} Solution\n", + "````{tip} Solution\n", ":class: dropdown\n", "\n", "We want to use `minmax` to compute the minimum and maximum along the \"lat\" dimension always, regardless of how many dimensions are on the input. So we specify `input_core_dims=[[\"lat\"]]`. The output does not contain the \"lat\" dimension, but we expect two returned variables. So we pass an empty list `[]` for each returned array, so `output_core_dims=[[], []]` just as before.\n", @@ -337,7 +335,8 @@ " input_core_dims=[[\"lat\"]],\n", " output_core_dims=[[],[]],\n", ")\n", - "```" + "```\n", + "````" ] } ], diff --git a/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb b/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb index 7b653328..5a7c2561 100644 --- a/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb +++ b/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb @@ -504,8 +504,7 @@ "cell_type": "markdown", "id": "4b4707d4-f596-47a8-8a3b-6a4a157f0759", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "```{tip} Exercise\n", diff --git a/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb b/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb index 93d1f44d..287ea77b 100644 --- a/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb +++ b/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb @@ -65,9 +65,12 @@ { "cell_type": "markdown", "id": "a6b21b09-3ebb-4efb-9c57-44bf587ba92d", - "metadata": {}, + "metadata": { + "tags": [], + "user_expressions": [] + }, "source": [ - "## A simple example\n", + "## A simple example: pure numpy\n", "\n", "Simple functions that act independently on each value should work without any\n", "additional arguments. \n", @@ -147,10 +150,24 @@ }, { "cell_type": "markdown", - "id": "193213d5-5644-4225-8f32-c73a20fa412d", - "metadata": {}, + "id": "b93a8aba-0427-4223-a4a3-7beb6ea4dc48", + "metadata": { + "tags": [], + "user_expressions": [] + }, "source": [ - "Using `DataArray.copy` works for such simple cases but doesn't generalize that well. For example, consider a function that removed one dimension and added a new dimension.\n", + "Using `DataArray.copy` works for such simple cases but doesn't generalize that well. For example, consider a function that removed one dimension and added a new dimension." + ] + }, + { + "cell_type": "markdown", + "id": "b5e335ab-b1c7-4b44-a373-7c9476427826", + "metadata": { + "tags": [], + "user_expressions": [] + }, + "source": [ + "## apply_ufunc\n", "\n", "`apply_ufunc` can handle such cases. Here's how to use it with `squared_error`" ] @@ -167,16 +184,27 @@ }, { "cell_type": "markdown", - "id": "007c45dd-b532-41d0-8a28-c35c7448b6ba", - "metadata": {}, + "id": "d2a667ad-81ad-42e1-a46f-f991cbbf36d0", + "metadata": { + "tags": [], + "user_expressions": [] + }, "source": [ "## How does apply_ufunc work?\n", "\n", + "\n", + "This line\n", + "```python\n", + "xr.apply_ufunc(squared_error, ds.air, 1)\n", + "```\n", + "is equivalent to `squared_error(ds.air, 1)` with automatic propagation of xarray metadata like dimension names, coordinate values etc.\n", + "\n", + "\n", "To illustrate how `apply_ufunc` works, let us write a small wrapper function. This will let us examine what data is received and returned from the applied function. \n", "\n", "```{tip}\n", "This trick is very useful for debugging\n", - "```\n" + "```" ] }, { @@ -197,8 +225,11 @@ }, { "cell_type": "markdown", - "id": "db9bc205-7c0d-4992-bda5-f16486d29361", - "metadata": {}, + "id": "cb1abf7b-ba81-420d-9b59-dc0439bcfa10", + "metadata": { + "tags": [], + "user_expressions": [] + }, "source": [ "We see that `wrapper` receives the underlying numpy array (`ds.air.data`), and the integer `1`. \n", "\n", @@ -206,20 +237,26 @@ "1. extracts the underlying array data, \n", "2. passes it to the user function, \n", "3. receives the returned values, and \n", - "4. then wraps that back up as an array" + "4. then wraps that back up as an array\n", + "\n", + "```{tip}\n", + "`apply_ufunc` always takes in at least one DataArray or Dataset and returns one DataArray or Dataset\n", + "```" ] }, { "cell_type": "markdown", - "id": "79f2c642-b814-48e0-a84c-7f1fb0858c53", + "id": "f31b53ae-b6bf-4259-8c2a-67a231c9de73", "metadata": { "tags": [], "user_expressions": [] }, "source": [ + "## Handling datasets\n", + "\n", "`apply_ufunc` easily handles both DataArrays and Datasets. \n", "\n", - "When passed a Dataset, `apply_ufunc` will loop over the data variables and sequentially pass those to `squared_error`. So `squared_error` always receives a numpy array" + "When passed a Dataset, `apply_ufunc` will loop over the data variables and sequentially pass those to `squared_error`. So `squared_error` always receives a _single_ numpy array." ] }, { @@ -508,7 +545,7 @@ "\n", "Use `apply_ufunc` to apply `scipy.integrate.trapz` along the `time` axis.\n", "```\n", - "```{tip} Solution\n", + "````{tip} Solution\n", ":class: dropdown\n", "\n", "```python\n", @@ -516,7 +553,8 @@ "import scipy.integrate\n", "\n", "xr.apply_ufunc(scipy.integrate.trapz, ds, input_core_dims=[[\"time\"]], kwargs={\"axis\": -1})\n", - "```" + "```\n", + "````" ] } ], From ad40947573985a6763166175015d4fd534787b9a Mon Sep 17 00:00:00 2001 From: dcherian Date: Fri, 23 Jun 2023 11:50:48 -0600 Subject: [PATCH 10/38] small edits --- .../apply_ufunc/complex-output-numpy.ipynb | 12 ++++---- advanced/apply_ufunc/example-interp.ipynb | 2 +- .../apply_ufunc/simple_dask_apply_ufunc.ipynb | 4 ++- .../simple_numpy_apply_ufunc.ipynb | 29 +++++++++++++++++-- 4 files changed, 37 insertions(+), 10 deletions(-) diff --git a/advanced/apply_ufunc/complex-output-numpy.ipynb b/advanced/apply_ufunc/complex-output-numpy.ipynb index 3d6d8426..b259e74b 100644 --- a/advanced/apply_ufunc/complex-output-numpy.ipynb +++ b/advanced/apply_ufunc/complex-output-numpy.ipynb @@ -52,9 +52,13 @@ }, "outputs": [], "source": [ + "%xmode minimal\n", + "\n", "import xarray as xr\n", "import numpy as np\n", "\n", + "xr.set_options(display_expand_data=False)\n", + "\n", "air = (\n", " xr.tutorial.load_dataset(\"air_temperature\")\n", " .air.sortby(\"lat\") # np.interp needs coordinate in ascending order\n", @@ -157,6 +161,7 @@ "user_expressions": [] }, "source": [ + "(complex-output-change-size)=\n", "## Dimensions that change size\n", "\n", "Imagine that you want the output to have the same dimension name `\"lat\"` i.e. applying`np.interp` changes the size of the `\"lat\"` dimension.\n", @@ -175,9 +180,6 @@ }, "outputs": [], "source": [ - "# minimize error message\n", - "%xmode Minimal\n", - "\n", "newlat = np.linspace(15, 75, 100)\n", "\n", "xr.apply_ufunc(\n", @@ -187,9 +189,7 @@ " air,\n", " input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n", " output_core_dims=[[\"lat\"]],\n", - ")\n", - "\n", - "%xmode Context" + ")" ] }, { diff --git a/advanced/apply_ufunc/example-interp.ipynb b/advanced/apply_ufunc/example-interp.ipynb index 1fd9cec6..878aabcf 100644 --- a/advanced/apply_ufunc/example-interp.ipynb +++ b/advanced/apply_ufunc/example-interp.ipynb @@ -66,7 +66,7 @@ "import xarray as xr\n", "import numpy as np\n", "\n", - "xr.set_options(display_style=\"html\") # fancy HTML repr\n", + "xr.set_options(display_expand_data=False) # fancy HTML repr\n", "\n", "air = (\n", " xr.tutorial.load_dataset(\"air_temperature\")\n", diff --git a/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb b/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb index 5a7c2561..8dce84b7 100644 --- a/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb +++ b/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb @@ -35,7 +35,9 @@ "\n", "import dask\n", "import numpy as np\n", - "import xarray as xr" + "import xarray as xr\n", + "\n", + "xr.set_options(display_expand_data=False)" ] }, { diff --git a/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb b/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb index 287ea77b..a30982eb 100644 --- a/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb +++ b/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb @@ -305,7 +305,7 @@ "Such functions involve the concept of \"core dimensions\". \n", "\n", "```{important}\n", - "One way to think about core dimensions is to consider the smallest dimensionality of data necessary to apply the function.\n", + "One way to think about core dimensions is to consider the smallest dimensionality of data that the function acts on.\n", "```\n", "\n", "For using more complex operations that consider some array values collectively,\n", @@ -315,7 +315,9 @@ "that core dimensions are needed is the presence of an `axis` argument on the\n", "corresponding NumPy function.\n", "\n", - "Let's write a function that computes the mean along `time` for a provided xarray object. This function requires one core dimension `time`. For `ds.air` note that `time` is the 0th axis." + "Let's write a function that computes the mean along `time` for a provided xarray object. \n", + "\n", + "This function requires one core dimension `time`. For `ds.air` note that `time` is the 0th axis." ] }, { @@ -328,6 +330,29 @@ "ds.air.dims" ] }, + { + "cell_type": "markdown", + "id": "f8edcfd8-3307-4417-bc38-19bda89df8e1", + "metadata": { + "tags": [], + "user_expressions": [] + }, + "source": [ + "`get_axis_num` is a useful method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8abba2b4-a011-4dd6-846a-c174fb674233", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "ds.air.get_axis_num(\"time\")" + ] + }, { "cell_type": "code", "execution_count": null, From 7884868cef39a28c2f02ebb95b9e7cf5560182dd Mon Sep 17 00:00:00 2001 From: dcherian Date: Fri, 23 Jun 2023 11:51:01 -0600 Subject: [PATCH 11/38] finish automatic vectorizing --- .../automatic-vectorizing-numpy.ipynb | 385 +++++------------- 1 file changed, 112 insertions(+), 273 deletions(-) diff --git a/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb b/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb index 67f6c113..5ce61891 100644 --- a/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb +++ b/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb @@ -21,15 +21,14 @@ "source": [ "Previously we looked at applying functions on numpy arrays, and the concept of core dimensions.\n", "We learned that functions commonly support specifying \"core dimensions\" through the `axis` keyword\n", - "argument. However many functions exist, that implicitly have core dimensions, but do not provide an `axis` keyword\n", - "argument. Applying such functions to a nD array usually involves one or multiple loops.\n", - "Our goal here is to learn how to apply such functions without loops using `apply_ufunc` by providing the `vectorize`\n", - "keyword argument.\n", + "argument. \n", "\n", + "However many functions exist, that implicitly have core dimensions, but do not provide an `axis` keyword\n", + "argument. Applying such functions to a nD array usually involves one or multiple loops over the other dimensions\n", + "--- termed \"loop dimensions\" or \"broadcast dimensions\".\n", "\n", - "## Introduction\n", "\n", - "A good example is numpy's 1D interpolate function :py:func:`numpy.interp`:\n", + "A good example is numpy's 1D interpolate function `numpy.interp`:\n", "\n", "```\n", " Signature: np.interp(x, xp, fp, left=None, right=None, period=None)\n", @@ -44,13 +43,19 @@ "it to a nD array since there is no `axis` keyword argument. \n", "\n", "\n", + "Our goal here is to \n", + "1. Understand the difference between core dimensions and loop dimensions\n", + "1. Understand vectorization\n", + "1. Learn how to apply such functions without loops using `apply_ufunc` by providing the `vectorize` keyword argument.\n", + "\n", "### Core dimensions and looping\n", - "A common pattern to solve this problem is to loop over all the other dimensions. Let's say we want to\n", + "\n", + "Let's say we want to\n", "interpolate an array with two dimensions (`space`, `time`) over the `time` dimension, we might \n", - "1. loop over the space dimension, \n", - "2. subset the array to a 1D array at that `space` loction, \n", - "3. Interpolate the 1D arrays to the new time vector, and\n", - "4. Assign that new interpolated 1D array to the appropriate location of a 2D output array\n", + "1. loop over the `space` dimension, \n", + "1. subset the array to a 1D array at that `space` loction, \n", + "1. Interpolate the 1D arrays to the new `time` vector, and\n", + "1. Assign that new interpolated 1D array to the appropriate location of a 2D output array\n", "\n", "In pseudo-code this might look like\n", "\n", @@ -60,11 +65,16 @@ "```\n", "\n", "\n", - "#### Exercise\n", + "```{tip} Exercise\n", "\n", "Consider the example problem of interpolating a 2D array with dimensions `space` and `time` along the `time` dimension.\n", "Which dimension is the core dimension, and which is the \"loop dimension\"?\n", + "```\n", + "```{tip} Solution\n", + ":class: dropdown\n", "\n", + "`time` is the core dimension, and `space` is the loop dimension.\n", + "```\n", "\n", "## Vectorization\n", "\n", @@ -101,6 +111,8 @@ "import xarray as xr\n", "import numpy as np\n", "\n", + "xr.set_options(display_expand_data=False)\n", + "\n", "air = (\n", " xr.tutorial.load_dataset(\"air_temperature\")\n", " .air.sortby(\"lat\") # np.interp needs coordinate in ascending order\n", @@ -111,153 +123,132 @@ }, { "cell_type": "markdown", - "id": "ef161fb5-15d7-4d13-831e-6407d5293bc1", + "id": "81356724-6c1a-4d4a-9a32-bb906a9419b2", "metadata": { + "tags": [], "user_expressions": [] }, "source": [ - "The function we will apply is `np.interp` which expects 1D numpy arrays. This functionality is already implemented in xarray so we use that capability to make sure we are not making mistakes." + "## Review\n", + "\n", + "\n", + "We'll work with the `apply_ufunc` call from the section on [handling dimensions that change size](complex-output-change-size). See the \"Handling Complex Output\" section for how to get here.\n", + "\n", + "This version only works with 1D vectors. We will expand that to work with inputs of any number of dimensions." ] }, { "cell_type": "code", "execution_count": null, - "id": "bf37f81b-f6b5-42d4-b20e-0d22e3d124d6", + "id": "cb286fa0-deba-4929-b18a-79af5acb0b5b", "metadata": { "tags": [] }, "outputs": [], "source": [ "newlat = np.linspace(15, 75, 100)\n", - "air.interp(lat=newlat)" + "\n", + "xr.apply_ufunc(\n", + " np.interp, # first the function\n", + " newlat,\n", + " air.lat,\n", + " air.isel(lon=0, time=0), # this version only works with 1D vectors\n", + " input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n", + " output_core_dims=[[\"lat\"]],\n", + " exclude_dims={\"lat\"},\n", + ")" ] }, { "cell_type": "markdown", - "id": "426f5211-031e-4096-8bdc-78a0cb4c1766", + "id": "e3382658-14e1-4842-a618-ce7a27948c31", "metadata": { + "tags": [], "user_expressions": [] }, "source": [ - "Let's define a function that works with one vector of data along `lat` at a time." + "## Try nD input\n", + "\n", + "If we blindly try passing `air` (a 3D DataArray), we get a hard-to-understand error" ] }, { "cell_type": "code", "execution_count": null, - "id": "cb286fa0-deba-4929-b18a-79af5acb0b5b", - "metadata": {}, - "outputs": [], - "source": [ - "def interp1d_np(data, x, xi):\n", - " return np.interp(xi, x, data)\n", - "\n", - "\n", - "interped = interp1d_np(air.isel(time=0, lon=0), air.lat, newlat)\n", - "expected = air.interp(lat=newlat)\n", - "\n", - "# no errors are raised if values are equal to within floating point precision\n", - "np.testing.assert_allclose(expected.isel(time=0, lon=0).values, interped)" - ] - }, - { - "cell_type": "markdown", - "id": "454a9887-948a-45f0-9b7c-dc07bb542726", + "id": "1476bcce-cc7b-4252-90dd-f45502dffb09", "metadata": { - "user_expressions": [] + "tags": [] }, + "outputs": [], "source": [ - "## `exclude_dims`" + "newlat = np.linspace(15, 75, 100)\n", + "\n", + "xr.apply_ufunc(\n", + " np.interp, # first the function\n", + " newlat,\n", + " air.lat,\n", + " air,\n", + " input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n", + " output_core_dims=[[\"lat\"]],\n", + " exclude_dims={\"lat\"},\n", + ")" ] }, { "cell_type": "markdown", - "id": "42dad673-7987-4dff-870b-c2bdff440993", + "id": "1d1da9c2-a634-4920-890c-74d9bec9eab9", "metadata": { + "tags": [], "user_expressions": [] }, "source": [ - "\n", - "```\n", - "exclude_dims : set, optional\n", - " Core dimensions on the inputs to exclude from alignment and\n", - " broadcasting entirely. Any input coordinates along these dimensions\n", - " will be dropped. Each excluded dimension must also appear in\n", - " ``input_core_dims`` for at least one argument. Only dimensions listed\n", - " here are allowed to change size between input and output objects.\n", - "```" + "We will use a \"wrapper\" function `interp1d_np` to examine what gets passed to `numpy.interp`" ] }, { "cell_type": "code", "execution_count": null, - "id": "0ce03d38-db86-49c2-a60d-46d312ad60f8", + "id": "fa306dcf-eec3-425c-b278-42d15bbc0e4f", "metadata": { - "tags": [ - "raises-exception" - ] + "tags": [] }, "outputs": [], "source": [ - "xr.apply_ufunc(\n", + "def interp1d_np(xi, x, data):\n", + " print(f\"data: {data.shape} | x: {x.shape} | xi: {xi.shape}\")\n", + " return np.interp(xi, x, data)\n", + "\n", + "\n", + "interped = xr.apply_ufunc(\n", " interp1d_np, # first the function\n", - " air.isel(time=0, lon=0), # now arguments in the order expected by 'interp1_np'\n", - " air.lat,\n", " newlat,\n", - " exclude_dims=set((\"lat\",)), # dimensions allowed to change size. Must be set!\n", + " air.lat,\n", + " air,\n", + " input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n", + " output_core_dims=[[\"lat\"]],\n", + " exclude_dims={\"lat\"}, # dimensions allowed to change size. Must be set!\n", ")" ] }, { "cell_type": "markdown", - "id": "79db343e-efcc-4fa5-a2ec-1da91bc93225", - "metadata": { - "user_expressions": [] - }, - "source": [ - "## Core dimensions\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "id": "e7c81b20-83a1-4575-8e9f-bf929ecc66ba", + "id": "6f5c928b-f8cb-4016-9d6d-39743f9c2976", "metadata": { + "tags": [], "user_expressions": [] }, "source": [ - "Core dimensions are central to using `apply_ufunc`. In our case, our function expects to receive a 1D vector along `lat` — this is the dimension that is \"core\" to the function's functionality. Multiple core dimensions are possible. `apply_ufunc` needs to know which dimensions of each variable are core dimensions.\n", - "\n", - " input_core_dims : Sequence[Sequence], optional\n", - " List of the same length as ``args`` giving the list of core dimensions\n", - " on each input argument that should not be broadcast. By default, we\n", - " assume there are no core dimensions on any input arguments.\n", + "That's a hard-to-interpret error but our `print` call helpfully printed the shapes of the input data: \n", "\n", - " For example, ``input_core_dims=[[], ['time']]`` indicates that all\n", - " dimensions on the first argument and all dimensions other than 'time'\n", - " on the second argument should be broadcast.\n", + " data: (4, 3, 25) | x: (25,) | xi: (100,)\n", "\n", - " Core dimensions are automatically moved to the last axes of input\n", - " variables before applying ``func``, which facilitates using NumPy style\n", - " generalized ufuncs [2]_.\n", - " \n", - " output_core_dims : List[tuple], optional\n", - " List of the same length as the number of output arguments from\n", - " ``func``, giving the list of core dimensions on each output that were\n", - " not broadcast on the inputs. By default, we assume that ``func``\n", - " outputs exactly one array, with axes corresponding to each broadcast\n", - " dimension.\n", - "\n", - " Core dimensions are assumed to appear as the last dimensions of each\n", - " output in the provided order.\n", - " \n", - "Next we specify `\"lat\"` as `input_core_dims` on both `air` and `air.lat`" + "We see that `apply_ufunc` passes the full 3D array to `interp1d_np` which in turn passes that on to `numpy.interp`. This cannot work." ] }, { "cell_type": "code", "execution_count": null, - "id": "2ced88c6-84ca-4294-9ed6-479ee353b3c8", + "id": "0ce03d38-db86-49c2-a60d-46d312ad60f8", "metadata": { "tags": [ "raises-exception" @@ -270,142 +261,30 @@ " air.isel(time=0, lon=0), # now arguments in the order expected by 'interp1_np'\n", " air.lat,\n", " newlat,\n", - " input_core_dims=[[\"lat\"], [\"lat\"], []],\n", " exclude_dims=set((\"lat\",)), # dimensions allowed to change size. Must be set!\n", ")" ] }, { "cell_type": "markdown", - "id": "cb20872d-06ae-400d-ada7-443ce6fa02a4", - "metadata": { - "user_expressions": [] - }, - "source": [ - "xarray is telling us that it expected to receive back a numpy array with 0 dimensions but instead received an array with 1 dimension corresponding to `newlat`. We can fix this by specifying `output_core_dims`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b3957ac5-c622-41f9-b774-5527693d370e", - "metadata": {}, - "outputs": [], - "source": [ - "xr.apply_ufunc(\n", - " interp1d_np, # first the function\n", - " air.isel(time=0, lon=0), # now arguments in the order expected by 'interp1_np'\n", - " air.lat,\n", - " newlat,\n", - " input_core_dims=[[\"lat\"], [\"lat\"], []], # list with one entry per arg\n", - " output_core_dims=[[\"lat\"]],\n", - " exclude_dims=set((\"lat\",)), # dimensions allowed to change size. Must be set!\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "60cc2c23-bdf9-41f1-a09a-a6df8531fd21", - "metadata": { - "user_expressions": [] - }, - "source": [ - "Finally we get some output! Let's check that this is right\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d6778fe6-5433-4f8d-bb40-2fb41b78b574", - "metadata": {}, - "outputs": [], - "source": [ - "interped = xr.apply_ufunc(\n", - " interp1d_np, # first the function\n", - " air.isel(time=0, lon=0), # now arguments in the order expected by 'interp1_np'\n", - " air.lat,\n", - " newlat,\n", - " input_core_dims=[[\"lat\"], [\"lat\"], []], # list with one entry per arg\n", - " output_core_dims=[[\"lat\"]],\n", - " exclude_dims=set((\"lat\",)), # dimensions allowed to change size. Must be set!\n", - ")\n", - "interped[\"lat\"] = newlat # need to add this manually\n", - "xr.testing.assert_allclose(expected.isel(time=0, lon=0), interped)" - ] - }, - { - "cell_type": "markdown", - "id": "3c7ea6dd-260e-4315-9d6d-3dbe2af2538e", - "metadata": { - "user_expressions": [] - }, - "source": [ - "No errors are raised so it is right!" - ] - }, - { - "cell_type": "markdown", - "id": "3441f26d-8076-47a4-82c1-108c4c870cb9", + "id": "737cc6b4-522f-488c-9124-524cc42ebef3", "metadata": { + "tags": [], "user_expressions": [] }, "source": [ - "## Vectorization with `np.vectorize`" + "## Vectorization with `np.vectorize`\n" ] }, { "cell_type": "markdown", - "id": "121aeb07-d444-4394-a4ef-9199dc2725a5", + "id": "b6dac8da-8420-4fc4-9aeb-29b8999d4b37", "metadata": { + "tags": [], "user_expressions": [] }, "source": [ - "Now our function currently only works on one vector of data which is not so useful given our 3D dataset.\n", - "Let's try passing the whole dataset. We add a `print` statement so we can see what our function receives." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9097c318-5114-4f00-a8e8-f85cdae39bf3", - "metadata": { - "tags": [ - "raises-exception" - ] - }, - "outputs": [], - "source": [ - "def interp1d_np(data, x, xi):\n", - " print(f\"data: {data.shape} | x: {x.shape} | xi: {xi.shape}\")\n", - " return np.interp(xi, x, data)\n", - "\n", - "\n", - "interped = xr.apply_ufunc(\n", - " interp1d_np, # first the function\n", - " air.isel(lon=slice(3), time=slice(4)), # now arguments in the order expected by 'interp1_np'\n", - " air.lat,\n", - " newlat,\n", - " input_core_dims=[[\"lat\"], [\"lat\"], []], # list with one entry per arg\n", - " output_core_dims=[[\"lat\"]],\n", - " exclude_dims=set((\"lat\",)), # dimensions allowed to change size. Must be set!\n", - ")\n", - "interped[\"lat\"] = newlat # need to add this manually\n", - "xr.testing.assert_allclose(expected.isel(time=0, lon=0), interped)" - ] - }, - { - "cell_type": "markdown", - "id": "f4873ac9-7000-4a0a-9d8f-e425e2bf2671", - "metadata": { - "tags": [] - }, - "source": [ - "That's a hard-to-interpret error but our `print` call helpfully printed the shapes of the input data: \n", - "\n", - " data: (10, 53, 25) | x: (25,) | xi: (100,)\n", - "\n", - "We see that `air` has been passed as a 3D numpy array which is not what `np.interp` expects. Instead we want loop over all combinations of `lon` and `time`; and apply our function to each corresponding vector of data along `lat`.\n", + "Instead of passing the full 3D array we want loop over all combinations of `lon` and `time`; and apply our function to each corresponding vector of data along `lat`.\n", "`apply_ufunc` makes this easy by specifying `vectorize=True`:\n", "\n", " vectorize : bool, optional\n", @@ -427,71 +306,22 @@ { "cell_type": "code", "execution_count": null, - "id": "7f8d8f74-3282-403d-8cce-7434780b29a3", - "metadata": { - "tags": [ - "raises-exception" - ] - }, - "outputs": [], - "source": [ - "def interp1d_np(data, x, xi):\n", - " print(f\"data: {data.shape} | x: {x.shape} | xi: {xi.shape}\")\n", - " return np.interp(xi, x, data)\n", - "\n", - "\n", - "interped = xr.apply_ufunc(\n", - " interp1d_np, # first the function\n", - " air, # now arguments in the order expected by 'interp1_np'\n", - " air.lat, # as above\n", - " newlat, # as above\n", - " input_core_dims=[[\"lat\"], [\"lat\"], []], # list with one entry per arg\n", - " output_core_dims=[[\"lat\"]], # returned data has one dimension\n", - " exclude_dims=set((\"lat\",)), # dimensions allowed to change size. Must be set!\n", - " vectorize=True, # loop over non-core dims\n", - ")\n", - "interped[\"lat\"] = newlat # need to add this manually\n", - "xr.testing.assert_allclose(expected, interped)" - ] - }, - { - "cell_type": "markdown", - "id": "a0e1347f-3473-4fb6-9506-de46e314ba8c", + "id": "d72fdd8c-44d2-4f6e-9fc4-7084e0e49986", "metadata": { + "tags": [], "user_expressions": [] }, - "source": [ - "This unfortunately is another cryptic error from numpy. \n", - "\n", - "Notice that `newlat` is not an xarray object. Let's add a dimension name `new_lat` and modify the call. Note this cannot be `lat` because xarray expects dimensions to be the same size (or broadcastable) among all inputs. `output_core_dims` needs to be modified appropriately. We'll manually rename `new_lat` back to `lat` for easy checking." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "add52a3e-6c10-4761-ad39-101a58cda921", - "metadata": {}, "outputs": [], "source": [ - "def interp1d_np(data, x, xi):\n", - " print(f\"data: {data.shape} | x: {x.shape} | xi: {xi.shape}\")\n", - " return np.interp(xi, x, data)\n", - "\n", - "\n", "interped = xr.apply_ufunc(\n", " interp1d_np, # first the function\n", - " air, # now arguments in the order expected by 'interp1_np'\n", - " air.lat, # as above\n", - " newlat, # as above\n", - " input_core_dims=[[\"lat\"], [\"lat\"], [\"new_lat\"]], # list with one entry per arg\n", - " output_core_dims=[[\"new_lat\"]], # returned data has one dimension\n", - " exclude_dims=set((\"lat\",)), # dimensions allowed to change size. Must be a set!\n", - " vectorize=True, # loop over non-core dims\n", - ")\n", - "interped = interped.rename({\"new_lat\": \"lat\"})\n", - "interped[\"lat\"] = newlat # need to add this manually\n", - "xr.testing.assert_allclose(\n", - " expected.transpose(*interped.dims), interped # order of dims is different\n", + " newlat,\n", + " air.lat,\n", + " air,\n", + " input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n", + " output_core_dims=[[\"lat\"]],\n", + " exclude_dims={\"lat\"}, # dimensions allowed to change size. Must be set!\n", + " vectorize=True,\n", ")\n", "interped" ] @@ -504,15 +334,24 @@ "user_expressions": [] }, "source": [ - "Notice that the printed input shapes are all 1D and correspond to one vector along the `lat` dimension.\n", + "Wow that worked!\n", + "\n", + "Notice that \n", + "1. the printed input shapes are all 1D and correspond to one vector of size 25 along the `lat` dimension.\n", + "2. `interp1d_np` was called 4x3 = 12 times which is the total number `lat` vectors since the size along `time` is 4, and the size along `lon` is 3.\n", + "3. The result `interped` is now an xarray object with coordinate values copied over from `data`. \n", "\n", - "The result is now an xarray object with coordinate values copied over from `data`. This is why `apply_ufunc` is so convenient; it takes care of a lot of boilerplate necessary to apply functions that consume and produce numpy arrays to xarray objects.\n", "\n", - "One final point: `lat` is now the *last* dimension in `interped`. This is a \"property\" of core dimensions: they are moved to the end before being sent to `interp1d_np` as was noted in the docstring for `input_core_dims`\n", + "```{note}\n", + "`lat` is now the *last* dimension in `interped`. This is a \"property\" of core dimensions: they are moved to the end before being sent to `interp1d_np` as noted in the docstring for `input_core_dims`\n", "\n", " Core dimensions are automatically moved to the last axes of input\n", " variables before applying ``func``, which facilitates using NumPy style\n", - " generalized ufuncs [2]_." + " generalized ufuncs [2]_.\n", + "```\n", + "\n", + "## Conclusion\n", + "This is why `apply_ufunc` is so convenient; it takes care of a lot of code necessary to apply functions that consume and produce numpy arrays to xarray objects.\n" ] } ], From 2e100c480f63ef714384aec6df6f6beef8b52ffc Mon Sep 17 00:00:00 2001 From: dcherian Date: Fri, 23 Jun 2023 11:52:25 -0600 Subject: [PATCH 12/38] spellcheck --- advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb b/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb index 8dce84b7..62e7f31c 100644 --- a/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb +++ b/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb @@ -238,7 +238,7 @@ "\n", "Let's again use the wrapper trick to understand what `squared_error` receives.\n", "\n", - "We see that it receives a dask array (analagous to the numpy array in the previous example)." + "We see that it receives a dask array (analogous to the numpy array in the previous example)." ] }, { @@ -465,10 +465,10 @@ "user_expressions": [] }, "source": [ - "### Understanding whats happening\n", + "### Understanding what is happening\n", "\n", "\n", - "It is very important to understand what `dask=\"parallelized\"` does. It is quite convenient and works well for \"blockwise\" or \"emabrassingly parallel\" operations.\n", + "It is very important to understand what `dask=\"parallelized\"` does. It is quite convenient and works well for \"blockwise\" or \"embarrassingly parallel\" operations.\n", "\n", "These are operations where one block or chunk of the output array corresponds to one block or chunk of the input array. Specifically, the blocks or chunks of the _core dimension_ is what matters.\n", "\n", @@ -495,7 +495,7 @@ "user_expressions": [] }, "source": [ - "The core dimension for `trapz` is `lon`, and there is only one chunk along `lon`. This means that integrating along `lon` is a \"blockwise\" or \"embarassingly parallel\" operation and `dask=\"parallelized\"` works quite well. \n", + "The core dimension for `trapz` is `lon`, and there is only one chunk along `lon`. This means that integrating along `lon` is a \"blockwise\" or \"embarrassingly parallel\" operation and `dask=\"parallelized\"` works quite well. \n", "\n", "```{tip} Question\n", "Do you understand why `integrate(ds)` when `ds` has a single chunk along `lon` is a \"embarassingly parallel\" operation?\n", From 62ddb3b664f25ee15bfecfed1502cb2ca5b4cfae Mon Sep 17 00:00:00 2001 From: dcherian Date: Fri, 23 Jun 2023 12:27:05 -0600 Subject: [PATCH 13/38] Expect exception raising --- .../automatic-vectorizing-numpy.ipynb | 25 +++++++++---------- 1 file changed, 12 insertions(+), 13 deletions(-) diff --git a/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb b/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb index 5ce61891..33cb9c25 100644 --- a/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb +++ b/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb @@ -4,8 +4,7 @@ "cell_type": "markdown", "id": "6849dcdc-3484-4f41-8b23-51613d36812f", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "# Automatic Vectorization" @@ -15,8 +14,7 @@ "cell_type": "markdown", "id": "afc56d28-6e55-4967-b27d-28e2cc539cc7", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "Previously we looked at applying functions on numpy arrays, and the concept of core dimensions.\n", @@ -48,7 +46,7 @@ "1. Understand vectorization\n", "1. Learn how to apply such functions without loops using `apply_ufunc` by providing the `vectorize` keyword argument.\n", "\n", - "### Core dimensions and looping\n", + "## Core dimensions and looping\n", "\n", "Let's say we want to\n", "interpolate an array with two dimensions (`space`, `time`) over the `time` dimension, we might \n", @@ -125,8 +123,7 @@ "cell_type": "markdown", "id": "81356724-6c1a-4d4a-9a32-bb906a9419b2", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "## Review\n", @@ -163,8 +160,7 @@ "cell_type": "markdown", "id": "e3382658-14e1-4842-a618-ce7a27948c31", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "## Try nD input\n", @@ -177,7 +173,9 @@ "execution_count": null, "id": "1476bcce-cc7b-4252-90dd-f45502dffb09", "metadata": { - "tags": [] + "tags": [ + "raises-exception" + ] }, "outputs": [], "source": [ @@ -198,8 +196,7 @@ "cell_type": "markdown", "id": "1d1da9c2-a634-4920-890c-74d9bec9eab9", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "We will use a \"wrapper\" function `interp1d_np` to examine what gets passed to `numpy.interp`" @@ -210,7 +207,9 @@ "execution_count": null, "id": "fa306dcf-eec3-425c-b278-42d15bbc0e4f", "metadata": { - "tags": [] + "tags": [ + "raises-exception" + ] }, "outputs": [], "source": [ From e7a82a063249a42450250fe7e832f58e84d621f1 Mon Sep 17 00:00:00 2001 From: dcherian Date: Fri, 23 Jun 2023 12:38:09 -0600 Subject: [PATCH 14/38] fixes --- advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb | 3 ++- advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb | 5 +++-- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb b/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb index 33cb9c25..63f914a1 100644 --- a/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb +++ b/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb @@ -14,7 +14,8 @@ "cell_type": "markdown", "id": "afc56d28-6e55-4967-b27d-28e2cc539cc7", "metadata": { - "tags": [] + "tags": [], + "user_expressions": [] }, "source": [ "Previously we looked at applying functions on numpy arrays, and the concept of core dimensions.\n", diff --git a/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb b/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb index 62e7f31c..2ca72e44 100644 --- a/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb +++ b/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb @@ -498,7 +498,7 @@ "The core dimension for `trapz` is `lon`, and there is only one chunk along `lon`. This means that integrating along `lon` is a \"blockwise\" or \"embarrassingly parallel\" operation and `dask=\"parallelized\"` works quite well. \n", "\n", "```{tip} Question\n", - "Do you understand why `integrate(ds)` when `ds` has a single chunk along `lon` is a \"embarassingly parallel\" operation?\n", + "Do you understand why `integrate(ds)` when `ds` has a single chunk along `lon` is a \"embarrassingly parallel\" operation?\n", "```" ] }, @@ -506,7 +506,8 @@ "cell_type": "markdown", "id": "4b4707d4-f596-47a8-8a3b-6a4a157f0759", "metadata": { - "tags": [] + "tags": [], + "user_expressions": [] }, "source": [ "```{tip} Exercise\n", From d076f2196a0f145d70abe7a2a3b7d132d330ea3b Mon Sep 17 00:00:00 2001 From: dcherian Date: Fri, 23 Jun 2023 14:30:08 -0600 Subject: [PATCH 15/38] cleanup --- .../apply_ufunc/simple_dask_apply_ufunc.ipynb | 15 +-------------- 1 file changed, 1 insertion(+), 14 deletions(-) diff --git a/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb b/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb index 2ca72e44..07abb04c 100644 --- a/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb +++ b/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb @@ -497,7 +497,7 @@ "source": [ "The core dimension for `trapz` is `lon`, and there is only one chunk along `lon`. This means that integrating along `lon` is a \"blockwise\" or \"embarrassingly parallel\" operation and `dask=\"parallelized\"` works quite well. \n", "\n", - "```{tip} Question\n", + "```{caution} Question\n", "Do you understand why `integrate(ds)` when `ds` has a single chunk along `lon` is a \"embarrassingly parallel\" operation?\n", "```" ] @@ -547,19 +547,6 @@ "source": [ "client.close()" ] - }, - { - "cell_type": "markdown", - "id": "71492531-4eda-4c47-86e5-dad033c22751", - "metadata": { - "user_expressions": [] - }, - "source": [ - "## More\n", - "\n", - "1. https://docs.xarray.dev/en/stable/examples/apply_ufunc_vectorize_1d.html\n", - "2. https://docs.dask.org/en/latest/array-best-practices.html\n" - ] } ], "metadata": { From d9a093d5773af25899bde83f0911d1b0d522fef2 Mon Sep 17 00:00:00 2001 From: dcherian Date: Mon, 26 Jun 2023 11:42:22 -0600 Subject: [PATCH 16/38] small updates --- .../automatic-vectorizing-numpy.ipynb | 4 +- advanced/apply_ufunc/example-interp.ipynb | 180 ++++++++---------- .../simple_numpy_apply_ufunc.ipynb | 118 ++++++++---- 3 files changed, 162 insertions(+), 140 deletions(-) diff --git a/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb b/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb index 63f914a1..8a3de279 100644 --- a/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb +++ b/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb @@ -351,7 +351,9 @@ "```\n", "\n", "## Conclusion\n", - "This is why `apply_ufunc` is so convenient; it takes care of a lot of code necessary to apply functions that consume and produce numpy arrays to xarray objects.\n" + "This is why `apply_ufunc` is so convenient; it takes care of a lot of code necessary to apply functions that consume and produce numpy arrays to xarray objects.\n", + "\n", + "The `vectorize` keyword argument, when set to True, will use `numpy.vectorize` to apply the function by looping over the \"loop dimensions\" --- dimensions that are not the core dimensions for the applied function." ] } ], diff --git a/advanced/apply_ufunc/example-interp.ipynb b/advanced/apply_ufunc/example-interp.ipynb index 878aabcf..410db120 100644 --- a/advanced/apply_ufunc/example-interp.ipynb +++ b/advanced/apply_ufunc/example-interp.ipynb @@ -3,11 +3,10 @@ { "cell_type": "markdown", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ - "# Interpolation : An end-to-end example\n", + "# np.interp : An end-to-end example\n", "\n", "**Author** [Deepak Cherian (NCAR)](https://cherian.net)" ] @@ -15,13 +14,12 @@ { "cell_type": "markdown", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "This example will illustrate how to conveniently apply an unvectorized function `func` to xarray objects using `apply_ufunc`. `func` expects 1D numpy arrays and returns a 1D numpy array. Our goal is to coveniently apply this function along a dimension of xarray objects that may or may not wrap dask arrays with a signature.\n", "\n", - "We will illustrate this using `np.interp`: \n", + "We will illustrate this using [`np.interp`](https://numpy.org/doc/stable/reference/generated/numpy.interp.html): \n", "\n", " Signature: np.interp(x, xp, fp, left=None, right=None, period=None)\n", " Docstring:\n", @@ -43,7 +41,7 @@ "1. Handling core dimensions that change size using `exclude_dims`\n", "1. Automatic vectorizing or looping over dimensions that are not core dimensions using `vectorize=True`\n", "1. Automatically parallelization with dask arrays using `dask=\"parallelized\"`\n", - "\n", + "1. High-performance vectorization with numba and `vectorize=False`.\n", "\n", "\n", "```{tip}\n", @@ -58,7 +56,9 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "tags": [] + }, "outputs": [], "source": [ "%xmode minimal\n", @@ -79,7 +79,7 @@ { "cell_type": "markdown", "metadata": { - "user_expressions": [] + "tags": [] }, "source": [ "The function we will apply is `np.interp` which expects 1D numpy arrays. This functionality is already implemented in xarray so we use that capability to make sure we are not making mistakes." @@ -88,7 +88,9 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "tags": [] + }, "outputs": [], "source": [ "newlat = np.linspace(15, 75, 100)\n", @@ -97,9 +99,7 @@ }, { "cell_type": "markdown", - "metadata": { - "user_expressions": [] - }, + "metadata": {}, "source": [ "Let's define a function that works with one vector of data along `lat` at a time." ] @@ -107,7 +107,9 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "tags": [] + }, "outputs": [], "source": [ "def interp1d_np(data, x, xi):\n", @@ -123,9 +125,7 @@ }, { "cell_type": "markdown", - "metadata": { - "user_expressions": [] - }, + "metadata": {}, "source": [ "No errors are raised so our interpolation is working.\n", "\n", @@ -162,9 +162,7 @@ }, { "cell_type": "markdown", - "metadata": { - "user_expressions": [] - }, + "metadata": {}, "source": [ "`apply_ufunc` needs to know a lot of information about what our function does so that it can reconstruct the outputs. In this case, the size of dimension lat has changed and we need to explicitly specify that this will happen. xarray helpfully tells us that we need to specify the kwarg `exclude_dims`." ] @@ -172,18 +170,11 @@ { "cell_type": "markdown", "metadata": { - "user_expressions": [] - }, - "source": [ - "## `exclude_dims`" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "user_expressions": [] + "tags": [] }, "source": [ + "## `exclude_dims`\n", + "\n", "\n", "```\n", "exclude_dims : set, optional\n", @@ -216,9 +207,7 @@ }, { "cell_type": "markdown", - "metadata": { - "user_expressions": [] - }, + "metadata": {}, "source": [ "## Core dimensions\n", "\n" @@ -226,9 +215,7 @@ }, { "cell_type": "markdown", - "metadata": { - "user_expressions": [] - }, + "metadata": {}, "source": [ "Core dimensions are central to using `apply_ufunc`. In our case, our function expects to receive a 1D vector along `lat` — this is the dimension that is \"core\" to the function's functionality. Multiple core dimensions are possible. `apply_ufunc` needs to know which dimensions of each variable are core dimensions.\n", "\n", @@ -280,9 +267,7 @@ }, { "cell_type": "markdown", - "metadata": { - "user_expressions": [] - }, + "metadata": {}, "source": [ "xarray is telling us that it expected to receive back a numpy array with 0 dimensions but instead received an array with 1 dimension corresponding to `newlat`. We can fix this by specifying `output_core_dims`" ] @@ -290,7 +275,9 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "tags": [] + }, "outputs": [], "source": [ "xr.apply_ufunc(\n", @@ -306,9 +293,7 @@ }, { "cell_type": "markdown", - "metadata": { - "user_expressions": [] - }, + "metadata": {}, "source": [ "Finally we get some output! Let's check that this is right\n", "\n" @@ -317,7 +302,9 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "tags": [] + }, "outputs": [], "source": [ "interped = xr.apply_ufunc(\n", @@ -335,9 +322,7 @@ }, { "cell_type": "markdown", - "metadata": { - "user_expressions": [] - }, + "metadata": {}, "source": [ "No errors are raised so it is right!" ] @@ -345,17 +330,15 @@ { "cell_type": "markdown", "metadata": { - "user_expressions": [] + "tags": [] }, "source": [ - "## Vectorization with `np.vectorize`" + "## Automatic vectorization with `np.vectorize`" ] }, { "cell_type": "markdown", - "metadata": { - "user_expressions": [] - }, + "metadata": {}, "source": [ "Now our function currently only works on one vector of data which is not so useful given our 3D dataset.\n", "Let's try passing the whole dataset. We add a `print` statement so we can see what our function receives." @@ -384,15 +367,13 @@ " input_core_dims=[[\"lat\"], [\"lat\"], []], # list with one entry per arg\n", " output_core_dims=[[\"lat\"]],\n", " exclude_dims=set((\"lat\",)), # dimensions allowed to change size. Must be set!\n", - ")\n", - "interped[\"lat\"] = newlat # need to add this manually\n", - "xr.testing.assert_allclose(expected.isel(time=0, lon=0), interped)" + ")" ] }, { "cell_type": "markdown", "metadata": { - "user_expressions": [] + "tags": [] }, "source": [ "That's a hard-to-interpret error but our `print` call helpfully printed the shapes of the input data: \n", @@ -400,6 +381,8 @@ " data: (10, 53, 25) | x: (25,) | xi: (100,)\n", "\n", "We see that `air` has been passed as a 3D numpy array which is not what `np.interp` expects. Instead we want loop over all combinations of `lon` and `time`; and apply our function to each corresponding vector of data along `lat`.\n", + "\n", + "\n", "`apply_ufunc` makes this easy by specifying `vectorize=True`:\n", "\n", " vectorize : bool, optional\n", @@ -409,10 +392,10 @@ " almost always slower than supplying a pre-vectorized function.\n", " Using this option requires NumPy version 1.12 or newer.\n", " \n", - "Also see the documentation for `np.vectorize`: . Most importantly\n", - "\n", - " The vectorize function is provided primarily for convenience, not for performance. \n", - " The implementation is essentially a for loop." + "```{caution}\n", + "The documentation for [`np.vectorize`](https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html) points out that\n", + "\"The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.\"\n", + "```" ] }, { @@ -439,16 +422,12 @@ " output_core_dims=[[\"lat\"]], # returned data has one dimension\n", " exclude_dims=set((\"lat\",)), # dimensions allowed to change size. Must be set!\n", " vectorize=True, # loop over non-core dims\n", - ")\n", - "interped[\"lat\"] = newlat # need to add this manually\n", - "xr.testing.assert_allclose(expected, interped)" + ")" ] }, { "cell_type": "markdown", - "metadata": { - "user_expressions": [] - }, + "metadata": {}, "source": [ "This unfortunately is another cryptic error from numpy. \n", "\n", @@ -458,7 +437,9 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "tags": [] + }, "outputs": [], "source": [ "def interp1d_np(data, x, xi):\n", @@ -479,16 +460,14 @@ "interped = interped.rename({\"new_lat\": \"lat\"})\n", "interped[\"lat\"] = newlat # need to add this manually\n", "xr.testing.assert_allclose(\n", - " expected.transpose(*interped.dims), interped # order of dims is different\n", - ")\n", + " expected.transpose(*interped.dims), interped\n", + ") # order of dims is different\n", "interped" ] }, { "cell_type": "markdown", - "metadata": { - "user_expressions": [] - }, + "metadata": {}, "source": [ "Notice that the printed input shapes are all 1D and correspond to one vector along the `lat` dimension.\n", "\n", @@ -503,9 +482,7 @@ }, { "cell_type": "markdown", - "metadata": { - "user_expressions": [] - }, + "metadata": {}, "source": [ "## Parallelization with dask\n", "\n" @@ -514,7 +491,7 @@ { "cell_type": "markdown", "metadata": { - "user_expressions": [] + "tags": [] }, "source": [ "So far our function can only handle numpy arrays. A real benefit of `apply_ufunc` is the ability to easily parallelize over dask chunks _when needed_. \n", @@ -523,13 +500,17 @@ " 1. `output_dtypes`: dtypes of all returned objects, and \n", " 2. `output_sizes`: lengths of any new dimensions. \n", " \n", - "Here we need to specify `output_dtypes` since `apply_ufunc` can infer the size of the new dimension `new_lat` from the argument corresponding to the third element in `input_core_dims`. Here I choose the chunk sizes to illustrate that `np.vectorize` is still applied so that our function receives 1D vectors even though the blocks are 3D." + "Here we need to specify `output_dtypes` since `apply_ufunc` can infer the size of the new dimension `new_lat` from the argument corresponding to the third element in `input_core_dims`. \n", + "\n", + "Here I choose the chunk sizes to illustrate that `np.vectorize` is still applied so that our function receives 1D vectors even though the blocks are 3D." ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "tags": [] + }, "outputs": [], "source": [ "def interp1d_np(data, x, xi):\n", @@ -555,9 +536,7 @@ }, { "cell_type": "markdown", - "metadata": { - "user_expressions": [] - }, + "metadata": {}, "source": [ "Yay! our function is receiving 1D vectors, so we've successfully parallelized applying a 1D function over a block. If you have a distributed dashboard up, you should see computes happening as equality is checked.\n", "\n" @@ -565,9 +544,7 @@ }, { "cell_type": "markdown", - "metadata": { - "user_expressions": [] - }, + "metadata": {}, "source": [ "## High performance vectorization: gufuncs, numba & guvectorize\n", "\n" @@ -576,27 +553,34 @@ { "cell_type": "markdown", "metadata": { - "user_expressions": [] + "tags": [] }, "source": [ "`np.vectorize` is a very convenient function but is unfortunately slow. It is only marginally faster than writing a for loop in Python and looping. A common way to get around this is to write a base interpolation function that can handle nD arrays in a compiled language like Fortran and then pass that to `apply_ufunc`.\n", "\n", - "Another option is to use the numba package which provides a very convenient `guvectorize` decorator: https://numba.pydata.org/numba-doc/latest/user/vectorize.html#the-guvectorize-decorator\n", + "Another option is to use the numba package which provides a very [convenient `guvectorize` decorator](https://numba.pydata.org/numba-doc/latest/user/vectorize.html#the-guvectorize-decorator). Any decorated function gets compiled and will loop over any non-core dimension in parallel when necessary. \n", "\n", - "Any decorated function gets compiled and will loop over any non-core dimension in parallel when necessary. We need to specify some extra information:\n", + "We need to specify some extra information:\n", "\n", " 1. Our function cannot return a variable any more. Instead it must receive a variable (the last argument) whose contents the function will modify. So we change from `def interp1d_np(data, x, xi)` to `def interp1d_np_gufunc(data, x, xi, out)`. Our computed results must be assigned to `out`. All values of `out` must be assigned explicitly.\n", " \n", " 2. `guvectorize` needs to know the dtypes of the input and output. This is specified in string form as the first argument. Each element of the tuple corresponds to each argument of the function. In this case, we specify `float64` for all inputs and outputs: `\"(float64[:], float64[:], float64[:], float64[:])\"` corresponding to `data, x, xi, out`\n", " \n", - " 3. Now we need to tell numba the size of the dimensions the function takes as inputs and returns as output i.e. core dimensions. This is done in symbolic form i.e. `data` and `x` are vectors of the same length, say `n`; `xi` and the output `out` have a different length, say `m`. So the second argument is (again as a string)\n", - " `\"(n), (n), (m) -> (m).\"` corresponding again to `data, x, xi, out`\n" + " 3. Now we need to tell numba the size of the dimensions the function takes as inputs and returns as output i.e. _core dimensions_. This is done in symbolic form i.e. `data` and `x` are vectors of the same length, say `n`; `xi` and the output `out` have a different length, say `m`. So the second argument is (again as a string)\n", + " `\"(n), (n), (m) -> (m).\"` corresponding again to `data, x, xi, out`\n", + " \n", + "```{seealso}\n", + "\n", + "Read the [numba documentation](https://numba.readthedocs.io/en/stable/user/vectorize.html#the-guvectorize-decorator) for more details.\n", + "```\n" ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "tags": [] + }, "outputs": [], "source": [ "from numba import float64, guvectorize\n", @@ -605,7 +589,6 @@ "@guvectorize(\"(float64[:], float64[:], float64[:], float64[:])\", \"(n), (n), (m) -> (m)\")\n", "def interp1d_np_gufunc(data, x, xi, out):\n", " # numba doesn't really like this.\n", - " # seem to support fstrings so do it the old way\n", " print(\"data: \" + str(data.shape) + \" | x:\" + str(x.shape) + \" | xi: \" + str(xi.shape))\n", " out[:] = np.interp(xi, x, data)\n", " # gufuncs don't return data\n", @@ -616,10 +599,10 @@ { "cell_type": "markdown", "metadata": { - "user_expressions": [] + "tags": [] }, "source": [ - "The warnings are about object-mode compilation relating to the `print` statement. This means we don't get much speed up: https://numba.pydata.org/numba-doc/latest/user/performance-tips.html#no-python-mode-vs-object-mode. We'll keep the `print` statement temporarily to make sure that `guvectorize` acts like we want it to." + "The warnings are about [object-mode compilation](https://numba.pydata.org/numba-doc/latest/user/performance-tips.html#no-python-mode-vs-object-mode) relating to the `print` statement. This means we don't get much speed up. We'll keep the `print` statement temporarily to make sure that `guvectorize` acts like we want it to." ] }, { @@ -647,10 +630,12 @@ { "cell_type": "markdown", "metadata": { - "user_expressions": [] + "tags": [] }, "source": [ - "Yay! Our function is receiving 1D vectors and is working automatically with dask arrays. Finally let's comment out the print line and wrap everything up in a nice reusable function" + "Yay! Our function is receiving 1D vectors and is working automatically with dask arrays. \n", + "\n", + "Finally let's comment out the print line and wrap everything up in a nice reusable function" ] }, { @@ -698,11 +683,10 @@ { "cell_type": "markdown", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ - "This technique is generalizable to any 1D function." + "This technique is generalizable to any 1D function that [can be compiled](https://numba.readthedocs.io/en/stable/reference/pysupported.html#pysupported) by Numba." ] } ], diff --git a/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb b/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb index a30982eb..5691d729 100644 --- a/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb +++ b/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb @@ -4,8 +4,7 @@ "cell_type": "markdown", "id": "7a8ff28a-4be3-469f-8cf4-9297e71cc4ca", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "# A gentle introduction\n", @@ -40,7 +39,9 @@ "import numpy as np\n", "import xarray as xr\n", "\n", - "xr.set_options(display_expand_data=False)" + "# limit the amount of information printed to screen\n", + "xr.set_options(display_expand_data=False)\n", + "np.set_printoptions(threshold=10, edgeitems=2)" ] }, { @@ -66,8 +67,7 @@ "cell_type": "markdown", "id": "a6b21b09-3ebb-4efb-9c57-44bf587ba92d", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "## A simple example: pure numpy\n", @@ -93,12 +93,29 @@ " return (x - y) ** 2" ] }, + { + "cell_type": "markdown", + "id": "f7566ce1-5276-4f80-a33b-e1cc149f6a90", + "metadata": {}, + "source": [ + "````{tip}\n", + "\n", + "This function uses only arithmetic operations. For such simple functions, you can pass Xarray objects directly and receive Xarray objects back.\n", + "Try\n", + "```python\n", + "squared_error(ds.air, 1)\n", + "```\n", + "\n", + "We use it here as a very simple example\n", + "````" + ] + }, { "cell_type": "markdown", "id": "7bee21ef-be9e-4393-82cf-e0fc6fe8d97e", "metadata": {}, "source": [ - "We can apply this manually by extracting the underlying numpy array" + "We can apply `squared_error` manually by extracting the underlying numpy array" ] }, { @@ -152,24 +169,26 @@ "cell_type": "markdown", "id": "b93a8aba-0427-4223-a4a3-7beb6ea4dc48", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ - "Using `DataArray.copy` works for such simple cases but doesn't generalize that well. For example, consider a function that removed one dimension and added a new dimension." + "```{caution}\n", + "Using `DataArray.copy` works for such simple cases but doesn't generalize that well. \n", + "\n", + "For example, consider a function that removed one dimension and added a new dimension.\n", + "```" ] }, { "cell_type": "markdown", "id": "b5e335ab-b1c7-4b44-a373-7c9476427826", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "## apply_ufunc\n", "\n", - "`apply_ufunc` can handle such cases. Here's how to use it with `squared_error`" + "`apply_ufunc` can handle more complicated functions. Here's how to use it with `squared_error`" ] }, { @@ -186,8 +205,7 @@ "cell_type": "markdown", "id": "d2a667ad-81ad-42e1-a46f-f991cbbf36d0", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "## How does apply_ufunc work?\n", @@ -197,7 +215,7 @@ "```python\n", "xr.apply_ufunc(squared_error, ds.air, 1)\n", "```\n", - "is equivalent to `squared_error(ds.air, 1)` with automatic propagation of xarray metadata like dimension names, coordinate values etc.\n", + "is equivalent to `squared_error(ds.air.data, 1)` with automatic propagation of xarray metadata like dimension names, coordinate values etc.\n", "\n", "\n", "To illustrate how `apply_ufunc` works, let us write a small wrapper function. This will let us examine what data is received and returned from the applied function. \n", @@ -227,17 +245,16 @@ "cell_type": "markdown", "id": "cb1abf7b-ba81-420d-9b59-dc0439bcfa10", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "We see that `wrapper` receives the underlying numpy array (`ds.air.data`), and the integer `1`. \n", "\n", "Essentially, `apply_ufunc` does the following:\n", - "1. extracts the underlying array data, \n", + "1. extracts the underlying array data (`.data`), \n", "2. passes it to the user function, \n", "3. receives the returned values, and \n", - "4. then wraps that back up as an array\n", + "4. then wraps that back up as a DataArray\n", "\n", "```{tip}\n", "`apply_ufunc` always takes in at least one DataArray or Dataset and returns one DataArray or Dataset\n", @@ -248,15 +265,16 @@ "cell_type": "markdown", "id": "f31b53ae-b6bf-4259-8c2a-67a231c9de73", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "## Handling datasets\n", "\n", "`apply_ufunc` easily handles both DataArrays and Datasets. \n", "\n", - "When passed a Dataset, `apply_ufunc` will loop over the data variables and sequentially pass those to `squared_error`. So `squared_error` always receives a _single_ numpy array." + "When passed a Dataset, `apply_ufunc` will loop over the data variables and sequentially pass those to `squared_error`.\n", + "\n", + "So `squared_error` always receives a _single_ numpy array." ] }, { @@ -283,19 +301,40 @@ "cell_type": "markdown", "id": "f8359efc-ffde-4357-a4fb-702dc064c85a", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] + }, + "source": [ + "## Passing positional and keyword arguments\n", + "\n", + "```{seealso}\n", + "See the Python tutorial on [defining functions](https://docs.python.org/3/tutorial/controlflow.html#defining-functions) for more on positional and keyword arguments.\n", + "```\n", + "\n", + "`squared_error` takes two argumens named `x` and `y`.\n", + "\n", + "In `xr.apply_ufunc(squared_error, ds.air, 1)`, the value of `1` for `y` was passed positionally. \n", + "\n", + "to use the keyword argument form, pass it using the `kwargs` keyword argument to `apply_ufunc`\n", + "> kwargs (dict, optional) – Optional keyword arguments passed directly on to call func." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "664a7004-6327-4edc-afe1-994c42d8181a", + "metadata": { + "tags": [] }, + "outputs": [], "source": [ - "## TODO: Passing args, kwargs" + "xr.apply_ufunc(squared_error, ds.air, kwargs={\"y\": 1})" ] }, { "cell_type": "markdown", "id": "ac4f4511-2535-4dd8-af2e-085a9f453e9f", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "## Reductions and core dimensions\n", @@ -334,8 +373,7 @@ "cell_type": "markdown", "id": "f8edcfd8-3307-4417-bc38-19bda89df8e1", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "`get_axis_num` is a useful method." @@ -406,13 +444,13 @@ "cell_type": "markdown", "id": "f7be1720-b529-444d-852e-a6f63563ac3f", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "The error here\n", "```\n", - "applied function returned data with unexpected number of dimensions. Received 2 dimension(s) but expected 3 dimensions with names: ('time', 'lat', 'lon')\n", + "applied function returned data with unexpected number of dimensions. \n", + "Received 2 dimension(s) but expected 3 dimensions with names: ('time', 'lat', 'lon')\n", "```\n", "\n", "means that while `np.mean` did indeed reduce one dimension, we did not tell `apply_ufunc` that this would happen. That is, we need to specify the core dimensions on the input.\n", @@ -447,8 +485,7 @@ "cell_type": "markdown", "id": "534b12aa-8a69-457b-a901-70a2358b0b71", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "This next error is a little confusing.\n", @@ -517,13 +554,11 @@ "cell_type": "markdown", "id": "81fe9fe1-83ff-4d60-bbee-fc11ecd4f73a", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ - "This illustrates an important concept: \n", "```{important}\n", - "Arrays are transposed so that core dimensions are at the end.\n", + "This illustrates an important concept. Arrays are transposed so that core dimensions are at the end.\n", "```\n", "\n", "With `apply_ufunc`, core dimensions are recognized by name, and then moved to\n", @@ -560,16 +595,17 @@ }, { "cell_type": "markdown", - "id": "f62f8f73-cd7f-45fe-b379-6e9b675d38a0", + "id": "600b8380-dab6-4caf-b2b9-91aee8447847", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "```{tip} Exercise\n", + ":label: trapz\n", "\n", "Use `apply_ufunc` to apply `scipy.integrate.trapz` along the `time` axis.\n", "```\n", + "\n", "````{tip} Solution\n", ":class: dropdown\n", "\n", From 7b908be13251df83d91e3257945a752803440293 Mon Sep 17 00:00:00 2001 From: dcherian Date: Mon, 26 Jun 2023 11:43:19 -0600 Subject: [PATCH 17/38] Add numba vectorize notebook --- _toc.yml | 1 + .../apply_ufunc/numba-vectorization.ipynb | 277 ++++++++++++++++++ 2 files changed, 278 insertions(+) create mode 100644 advanced/apply_ufunc/numba-vectorization.ipynb diff --git a/_toc.yml b/_toc.yml index 982389b7..12187904 100644 --- a/_toc.yml +++ b/_toc.yml @@ -50,6 +50,7 @@ parts: - file: advanced/apply_ufunc/complex-output-numpy - file: advanced/apply_ufunc/automatic-vectorizing-numpy - file: advanced/apply_ufunc/simple_dask_apply_ufunc + - file: advanced/apply_ufunc/numba-vectorization - file: advanced/apply_ufunc/example-interp - file: advanced/map_blocks/map_blocks.md sections: diff --git a/advanced/apply_ufunc/numba-vectorization.ipynb b/advanced/apply_ufunc/numba-vectorization.ipynb new file mode 100644 index 00000000..32150b9b --- /dev/null +++ b/advanced/apply_ufunc/numba-vectorization.ipynb @@ -0,0 +1,277 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "93a37164-1923-48bd-9393-2acb7aec1a56", + "metadata": { + "tags": [], + "user_expressions": [] + }, + "source": [ + "# Fast vectorization with Numba" + ] + }, + { + "cell_type": "markdown", + "id": "d22ea5e8-ac54-4c51-ba02-c8027140c1b1", + "metadata": { + "tags": [] + }, + "source": [ + "`np.vectorize` is a very convenient function but is unfortunately slow. It is only marginally faster than writing a for loop in Python and looping. \n", + "\n", + "A common way to get around this is to write a base interpolation function that can handle nD arrays in a compiled language like C or Fortran and then pass that to `apply_ufunc`.\n", + "\n", + "Another option is to use the [numba package](https://numba.pydata.org/) which provides two very convenient decorators to build [numpy universal functions or ufuncs](https://numba.readthedocs.io/en/stable/user/vectorize.html):\n", + "1. [`vectorize`](https://numba.readthedocs.io/en/stable/user/vectorize.html#the-vectorize-decorator) for functions that act on scalars, and \n", + "2. [`guvectorize`](https://numba.pydata.org/numba-doc/latest/user/vectorize.html#the-guvectorize-decorator) for functions that operates on subsets of the array along core-dimensions. Any decorated function gets compiled and will loop over the loop dimensions in parallel when necessary. \n", + "\n", + "For `apply_ufunc` the key concept is that `vectorize=False` (the default) when using Numba vectorized functions. \n", + "Numba handles the vectorization (or looping) and `apply_ufunc` handles converting Xarray objects to bare arrays and handling metadata." + ] + }, + { + "cell_type": "markdown", + "id": "c87db0b2-bb9f-4100-aa6f-5e73ed807182", + "metadata": {}, + "source": [ + "## Load data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ff0aebbd-eca0-4d4d-bb4f-42ee646acad8", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%xmode minimal\n", + "\n", + "import numpy as np\n", + "import xarray as xr\n", + "\n", + "da = xr.DataArray(\n", + " np.arange(12).reshape(3, 4),\n", + " dims=(\"x\", \"y\"),\n", + " coords={\"x\": [12, 13, 14]},\n", + " attrs={\"foo\": \"bar\"},\n", + ")\n", + "da" + ] + }, + { + "cell_type": "markdown", + "id": "b0bdb9b4-83dd-4391-a601-fbc4188ef15f", + "metadata": {}, + "source": [ + "## `vectorize`\n", + "\n", + "Our `squared_error` example from earlier works element-by-element, and is a great example for `vectorize`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "49793ab4-8df7-4182-8151-9276fef789d8", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "from numba import vectorize, float64\n", + "\n", + "\n", + "@vectorize([float64(float64, float64)])\n", + "def squared_error(x, y):\n", + " return (x - y) ** 2" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1968525f-de80-441d-8c5a-e21395c832c0", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "xr.apply_ufunc(squared_error, da, 1)" + ] + }, + { + "cell_type": "markdown", + "id": "edf740d7-ca56-42dd-86f2-d9d7c2730b15", + "metadata": {}, + "source": [ + "## `guvectorize`\n", + "\n", + "`guvectorize` is for functions that work on small subsets of the data. Quoting the Numba documentation\n", + "> While `vectorize()` allows you to write ufuncs that work on one element at a time, the `guvectorize()` decorator takes the concept one step further and allows you to write ufuncs that will work on an arbitrary number of elements of input arrays, and take and return arrays of differing dimensions. The typical example is a running median or a convolution filter.\n", + "\n", + "This description should remind you of `apply_ufunc`!\n", + "\n", + "We will use the example function `g` from the [numba docs](https://numba.readthedocs.io/en/stable/user/vectorize.html#the-guvectorize-decorator), which adds a scalar `y` to a 1D vector `x`. The `res` argument here will contain the output (this is a Numba detail).\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3f8e96b3-d0f0-47d8-aeec-125c89642c2b", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "from numba import guvectorize, int64\n", + "\n", + "\n", + "@guvectorize([(int64[:], int64, int64[:])], '(n),()->(n)')\n", + "def g(x, y, res):\n", + " for i in range(x.shape[0]):\n", + " res[i] = x[i] + y\n", + "\n", + "\n", + "a = np.arange(5)\n", + "g(a, 2)" + ] + }, + { + "cell_type": "markdown", + "id": "f365fb19-5fe8-4034-9c2f-6a4a8a5b1f12", + "metadata": {}, + "source": [ + "Unlike `squared_error` we cannot pass an Xarray object to `g` directly." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "138e692e-36bf-45ca-8131-feeb1c1b41c4", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "g(da, 1)" + ] + }, + { + "cell_type": "markdown", + "id": "e46c26c4-7377-4679-9b9e-923c154a4b88", + "metadata": {}, + "source": [ + "Now use `apply_ufunc`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "759308fa-14ef-407a-bd1d-732c7fc39c12", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "xr.apply_ufunc(\n", + " g,\n", + " da,\n", + " 1,\n", + " input_core_dims=[[\"x\"], []],\n", + " output_core_dims=[[\"x\"]],\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "ba64c71d-9fe0-4479-8d3d-0b383c88e464", + "metadata": {}, + "source": [ + "Notice the following:\n", + "1. The `guvectorize` decorator includes the concept of \"core dimensions\": `'(n),()->(n)'`. This string means that the `g` takes a 1D vector of size `n`, a scalar, and returns a 1D vector of size `n`.\n", + "2. That string translates to `input_core_dims=[[\"x\"], []], output_core_dims=[[\"x\"]]` in `apply_ufunc`.\n", + "3. We don't provide `vectorize=False` to `apply_ufunc` since `numba` will handle the vectorization in compiled code automatically." + ] + }, + { + "cell_type": "markdown", + "id": "471853be-b718-47b8-a19f-6ada0edda7d5", + "metadata": {}, + "source": [ + "## With dask\n" + ] + }, + { + "cell_type": "markdown", + "id": "f123fc32-9a48-4910-8d85-da85a15f69f8", + "metadata": {}, + "source": [ + "Use the chunked DataArray" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5fe08c1c-7ef9-4d03-b57f-76047b638aa2", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "da_dask = da.chunk({\"y\": 1})\n", + "da_dask" + ] + }, + { + "cell_type": "markdown", + "id": "ad018876-302f-41bf-82e4-2605b92f3d30", + "metadata": {}, + "source": [ + "```{tip} Exercise\n", + "\n", + "Apply `g` to `da_dask`\n", + "```\n", + "````{tip} Solution\n", + ":class: dropdown\n", + "\n", + "```python\n", + "xr.apply_ufunc(\n", + " g,\n", + " da_dask, \n", + " 1, \n", + " input_core_dims=[[\"x\"], []], \n", + " output_core_dims=[[\"x\"]],\n", + " dask=\"parallelized\",\n", + ")\n", + "```\n", + "````" + ] + }, + { + "cell_type": "markdown", + "id": "48a3df42-23d1-4953-8b28-f0efaef75fcd", + "metadata": {}, + "source": [ + "## Next\n", + "\n", + "For more, see the numpy.interp end-to-end example in the left sidebar." + ] + } + ], + "metadata": { + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From 2befb268c7b3bb1389364e2800db1a257a8846c3 Mon Sep 17 00:00:00 2001 From: dcherian Date: Mon, 26 Jun 2023 11:47:09 -0600 Subject: [PATCH 18/38] fix --- advanced/apply_ufunc/numba-vectorization.ipynb | 4 +++- advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb | 1 - 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/advanced/apply_ufunc/numba-vectorization.ipynb b/advanced/apply_ufunc/numba-vectorization.ipynb index 32150b9b..a77166c0 100644 --- a/advanced/apply_ufunc/numba-vectorization.ipynb +++ b/advanced/apply_ufunc/numba-vectorization.ipynb @@ -150,7 +150,9 @@ "execution_count": null, "id": "138e692e-36bf-45ca-8131-feeb1c1b41c4", "metadata": { - "tags": [] + "tags": [ + "raises-exception" + ] }, "outputs": [], "source": [ diff --git a/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb b/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb index 5691d729..c551ad50 100644 --- a/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb +++ b/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb @@ -601,7 +601,6 @@ }, "source": [ "```{tip} Exercise\n", - ":label: trapz\n", "\n", "Use `apply_ufunc` to apply `scipy.integrate.trapz` along the `time` axis.\n", "```\n", From 748eab0fcb998472a70bb52d60904ee71a7e74cf Mon Sep 17 00:00:00 2001 From: dcherian Date: Mon, 26 Jun 2023 11:59:06 -0600 Subject: [PATCH 19/38] Small edits --- advanced/apply_ufunc/example-interp.ipynb | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/advanced/apply_ufunc/example-interp.ipynb b/advanced/apply_ufunc/example-interp.ipynb index 410db120..dd8a5cd1 100644 --- a/advanced/apply_ufunc/example-interp.ipynb +++ b/advanced/apply_ufunc/example-interp.ipynb @@ -17,7 +17,7 @@ "tags": [] }, "source": [ - "This example will illustrate how to conveniently apply an unvectorized function `func` to xarray objects using `apply_ufunc`. `func` expects 1D numpy arrays and returns a 1D numpy array. Our goal is to coveniently apply this function along a dimension of xarray objects that may or may not wrap dask arrays with a signature.\n", + "This example will illustrate how to conveniently apply an unvectorized function `func` to xarray objects using `apply_ufunc`. `func` expects 1D numpy arrays and returns a 1D numpy array. Our goal is to conveniently apply this function along a dimension of xarray objects that may or may not wrap dask arrays with a signature.\n", "\n", "We will illustrate this using [`np.interp`](https://numpy.org/doc/stable/reference/generated/numpy.interp.html): \n", "\n", @@ -43,6 +43,8 @@ "1. Automatically parallelization with dask arrays using `dask=\"parallelized\"`\n", "1. High-performance vectorization with numba and `vectorize=False`.\n", "\n", + "It puts together all the concepts covered earlier.\n", + "\n", "\n", "```{tip}\n", "We'll reduce the length of error messages using in this tutorial using `%xmode minimal` See the [ipython documentation](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-xmode) for details.\n", From 7540f2c704c2e93415df4b29884cc4c6935f50b1 Mon Sep 17 00:00:00 2001 From: dcherian Date: Mon, 26 Jun 2023 12:28:37 -0600 Subject: [PATCH 20/38] add numba logo --- advanced/apply_ufunc/numba-vectorization.ipynb | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/advanced/apply_ufunc/numba-vectorization.ipynb b/advanced/apply_ufunc/numba-vectorization.ipynb index a77166c0..5238ccf5 100644 --- a/advanced/apply_ufunc/numba-vectorization.ipynb +++ b/advanced/apply_ufunc/numba-vectorization.ipynb @@ -4,10 +4,11 @@ "cell_type": "markdown", "id": "93a37164-1923-48bd-9393-2acb7aec1a56", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ + "\n", + "\n", "# Fast vectorization with Numba" ] }, From 87968ba6c1bfe176fc4e0ff4c872d5a7a8c0e96a Mon Sep 17 00:00:00 2001 From: dcherian Date: Mon, 26 Jun 2023 12:39:21 -0600 Subject: [PATCH 21/38] small text update --- advanced/apply_ufunc/numba-vectorization.ipynb | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/advanced/apply_ufunc/numba-vectorization.ipynb b/advanced/apply_ufunc/numba-vectorization.ipynb index 5238ccf5..278e530b 100644 --- a/advanced/apply_ufunc/numba-vectorization.ipynb +++ b/advanced/apply_ufunc/numba-vectorization.ipynb @@ -89,6 +89,16 @@ " return (x - y) ** 2" ] }, + { + "cell_type": "markdown", + "id": "3c82f57c-ffe9-40ae-b429-1a273037731e", + "metadata": {}, + "source": [ + "See the numba documentation to understand `@vectorize([float64(float64, float64)])`\n", + "\n", + "Now use `apply_ufunc` to apply it." + ] + }, { "cell_type": "code", "execution_count": null, From 1fe8d87c54ca302cc8e3175c904a1084b9de0b89 Mon Sep 17 00:00:00 2001 From: dcherian Date: Mon, 26 Jun 2023 13:05:31 -0600 Subject: [PATCH 22/38] fix spelling --- advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb b/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb index c551ad50..a07f239a 100644 --- a/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb +++ b/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb @@ -310,7 +310,7 @@ "See the Python tutorial on [defining functions](https://docs.python.org/3/tutorial/controlflow.html#defining-functions) for more on positional and keyword arguments.\n", "```\n", "\n", - "`squared_error` takes two argumens named `x` and `y`.\n", + "`squared_error` takes two arguments named `x` and `y`.\n", "\n", "In `xr.apply_ufunc(squared_error, ds.air, 1)`, the value of `1` for `y` was passed positionally. \n", "\n", From 68b91031c53dd4664397f721c656c7e51f887524 Mon Sep 17 00:00:00 2001 From: dcherian Date: Mon, 26 Jun 2023 14:39:38 -0600 Subject: [PATCH 23/38] Tweaks to dask material --- .../automatic-vectorizing-numpy.ipynb | 1 + .../apply_ufunc/complex-output-numpy.ipynb | 1 + .../apply_ufunc/simple_dask_apply_ufunc.ipynb | 186 ++++++++++++++---- 3 files changed, 148 insertions(+), 40 deletions(-) diff --git a/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb b/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb index 8a3de279..edb6ede4 100644 --- a/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb +++ b/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb @@ -7,6 +7,7 @@ "tags": [] }, "source": [ + "(vectorize)=\n", "# Automatic Vectorization" ] }, diff --git a/advanced/apply_ufunc/complex-output-numpy.ipynb b/advanced/apply_ufunc/complex-output-numpy.ipynb index b259e74b..632912c7 100644 --- a/advanced/apply_ufunc/complex-output-numpy.ipynb +++ b/advanced/apply_ufunc/complex-output-numpy.ipynb @@ -8,6 +8,7 @@ "user_expressions": [] }, "source": [ + "(complex-output)=\n", "# Handling complex output\n", "\n", "We've seen how to use `apply_ufunc` to handle relatively simple functions that transform every element, or reduce along a single dimension.\n", diff --git a/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb b/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb index 07abb04c..81612fa2 100644 --- a/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb +++ b/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb @@ -4,8 +4,7 @@ "cell_type": "markdown", "id": "7a8ff28a-4be3-469f-8cf4-9297e71cc4ca", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "# Handling dask arrays\n", @@ -16,6 +15,11 @@ "\n", "Learning goals:\n", "- Learn that `apply_ufunc` can automate aspects of applying computation functions on dask arrays\n", + "- It is possible to automatically parallelize certain operations by providing `dask=\"parallelized\"`\n", + "- In some cases, extra information needs to be provided such as sizes of any new dimensions added, or data types for output variables.\n", + "- Learn that all the concepts from the numpy lessons carry over: like [automatic vectorization](vectorize) and specifying input and\n", + " output core dimensions.\n", + "\n", "\n", "```{tip}\n", "We'll reduce the length of error messages using `%xmode minimal` See the [ipython documentation](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-xmode) for details.\n", @@ -43,9 +47,7 @@ { "cell_type": "markdown", "id": "e511efdf-1f39-4dcf-a111-660eeca2eb8c", - "metadata": { - "user_expressions": [] - }, + "metadata": {}, "source": [ "First lets set up a `LocalCluster` using [dask.distributed](https://distributed.dask.org/).\n", "\n", @@ -70,9 +72,7 @@ { "cell_type": "markdown", "id": "9aa9663b-028a-4639-be90-5576f88d1bfa", - "metadata": { - "user_expressions": [] - }, + "metadata": {}, "source": [ "

👆

Click the Dashboard link above. Or click the \"Search\" button in the dashboard.\n", "\n", @@ -94,9 +94,7 @@ { "cell_type": "markdown", "id": "e6abae80-004a-481f-9b1a-c476de951ef0", - "metadata": { - "user_expressions": [] - }, + "metadata": {}, "source": [ "Let's open a dataset. We specify `chunks` so that we create a dask arrays for the DataArrays" ] @@ -116,8 +114,7 @@ "cell_type": "markdown", "id": "a6b21b09-3ebb-4efb-9c57-44bf587ba92d", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "## A simple example\n", @@ -175,8 +172,7 @@ "cell_type": "markdown", "id": "5e2c1b9f-f436-414c-944b-fa134837ee32", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ " \n", @@ -202,8 +198,7 @@ "cell_type": "markdown", "id": "a4c17b34-7fd9-45ef-9f71-041fc8b16fdf", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "Since `squared_error` can handle dask arrays without computing them, we specify\n", @@ -230,8 +225,7 @@ "cell_type": "markdown", "id": "c1418c06-e1f8-4db8-bdf5-a4e23f6524e1", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "### Understanding what's happening\n", @@ -261,8 +255,7 @@ "cell_type": "markdown", "id": "ce1b0d1c-8bf3-4ddb-aa5f-e2ca70415ad6", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "## Reductions and core dimensions\n", @@ -278,6 +271,11 @@ "that core dimensions are needed is the presence of an `axis` argument on the\n", "corresponding NumPy function.\n", "\n", + "```{important}\n", + "One way to think about core dimensions is to consider the smallest dimensionality of data that the function acts on.\n", + "```\n", + "\n", + "\n", "With `apply_ufunc`, core dimensions are recognized by name, and then moved to\n", "the last dimension of any input arguments before applying the given function.\n", "This means that for functions that accept an `axis` argument, you usually need\n", @@ -310,8 +308,7 @@ "cell_type": "markdown", "id": "7952ceb1-8402-41d1-8813-31228c3e9ae6", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "Again, this is identical to the built-in `mean`" @@ -341,12 +338,15 @@ "cell_type": "markdown", "id": "8abf7aa7-e1e6-4935-9d59-58d4f587135c", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "## Automatically parallelizing dask-unaware functions\n", "\n", + "### Basics\n", + "\n", + "Not all functions can handle dask arrays appropriately by default.\n", + "\n", "A very useful `apply_ufunc` feature is the ability to apply arbitrary functions\n", "in parallel to each block. This ability can be activated using\n", "`dask=\"parallelized\"`. \n", @@ -380,8 +380,7 @@ "cell_type": "markdown", "id": "f518099f-7778-4f0a-9d40-950968c651d5", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "Let's activate automatic parallelization with `dask=\"parallelized\"`" @@ -408,8 +407,7 @@ "cell_type": "markdown", "id": "3ec8e13e-77e0-4e46-a32d-dafa5da6fa6a", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "And make sure the returned data is a dask array" @@ -431,8 +429,7 @@ "cell_type": "markdown", "id": "09e87266-2ff7-452d-b96d-d551d55c5e4e", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "We'll define a function `integrate_lon` for later use:" @@ -461,8 +458,7 @@ "cell_type": "markdown", "id": "a348a4c4-b36f-4ad8-8549-fce35ef429dd", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "### Understanding what is happening\n", @@ -491,8 +487,7 @@ "cell_type": "markdown", "id": "82b66790-4c7c-4902-aa3b-89f92c5641b5", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "The core dimension for `trapz` is `lon`, and there is only one chunk along `lon`. This means that integrating along `lon` is a \"blockwise\" or \"embarrassingly parallel\" operation and `dask=\"parallelized\"` works quite well. \n", @@ -506,8 +501,7 @@ "cell_type": "markdown", "id": "4b4707d4-f596-47a8-8a3b-6a4a157f0759", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "```{tip} Exercise\n", @@ -527,12 +521,122 @@ "```" ] }, + { + "cell_type": "markdown", + "id": "dfed3978-288f-48b4-a978-edfce2f3ddb8", + "metadata": {}, + "source": [ + "### Handling more complex output\n", + "\n", + "Again we use the [1D interpolation example](complex-output) that changes the size of the input along a single dimension.\n", + "\n", + "Logically, we can think of this as removing the old dimension and adding a new dimension.\n", + "\n", + "For `interp` we expect one returned output with one new core dimension that we will call `\"lat_interp\"`.\n", + "Specify this using `output_core_dims=[[\"lat_interp\"]]`\n", + "\n", + "When automatically parallelizing with `dask`, we need to provide some more information about the outputs.\n", + "1. When adding a new dimensions, we need to provide the size in `dask_gufunc_kwargs` using the key `output_sizes`\n", + "2. Usually we need provide the datatype or `dtype` of the returned array. Usually the dtype of the input is a good guess." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5d82f436-65fe-484e-9295-382f6c725b80", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "newlat = np.linspace(15, 75, 100)\n", + "\n", + "xr.apply_ufunc(\n", + " np.interp, # first the function\n", + " newlat,\n", + " ds.air.lat,\n", + " ds.air.chunk({\"time\": 2, \"lon\": 2}),\n", + " input_core_dims=[[\"lat_interp\"], [\"lat\"], [\"lat\"]],\n", + " output_core_dims=[[\"lat_interp\"]],\n", + " dask=\"parallelized\",\n", + " dask_gufunc_kwargs=dict(output_sizes={\"lat_interp\": len(newlat)}),\n", + " output_dtypes=[ds.air.dtype],\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "5f019966-c296-41c8-a0f8-91bcffa7ac7f", + "metadata": {}, + "source": [ + "```{tip} Exercise\n", + "\n", + "Repeat the above computation but provide `\"lat\"` as the output core dimension. Remember that you need to provide `exclude_dims` for this case. See the numpy notebook on [complex output](complex-output) for more.\n", + "```\n", + "\n", + "````{tip} Solution\n", + ":class: dropdown\n", + "\n", + "```python\n", + "newlat = np.linspace(15, 75, 100)\n", + "\n", + "xr.apply_ufunc(\n", + " np.interp, # first the function\n", + " newlat,\n", + " ds.air.lat,\n", + " ds.air.chunk({\"time\": 2, \"lon\": 2}),\n", + " input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n", + " output_core_dims=[[\"lat\"]],\n", + " exclude_dims={\"lat\"},\n", + " dask=\"parallelized\",\n", + " dask_gufunc_kwargs=dict(output_sizes={\"lat\": len(newlat)}),\n", + " output_dtypes=[ds.air.dtype],\n", + ")\n", + "```\n", + "\n", + "````" + ] + }, + { + "cell_type": "markdown", + "id": "9646a631-2318-43c4-923d-a5384885134a", + "metadata": {}, + "source": [ + "### Automatic Vectorizing\n", + "\n", + "[Automatic vectorizing](vectorize) with `vectorize=True` also carries over!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d8accd13-1afe-4085-97a6-86e4a5bd4405", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "interped = xr.apply_ufunc(\n", + " np.interp, # first the function\n", + " newlat,\n", + " ds.air.lat,\n", + " ds.air,\n", + " input_core_dims=[[\"lat_interp\"], [\"lat\"], [\"lat\"]],\n", + " output_core_dims=[[\"lat_interp\"]],\n", + " exclude_dims={\"lat\"}, # dimensions allowed to change size. Must be set!\n", + " vectorize=True,\n", + " dask=\"parallelized\",\n", + " dask_gufunc_kwargs=dict(output_sizes={\"lat_interp\": len(newlat)}),\n", + " output_dtypes=[ds.air.dtype],\n", + ")\n", + "interped" + ] + }, { "cell_type": "markdown", "id": "c3618181-7432-409e-95e7-d7bc4cc567ce", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "## Clean up the cluster" @@ -542,7 +646,9 @@ "cell_type": "code", "execution_count": null, "id": "6b375c8a-b7ef-47a6-b007-1fc2f34a2cf5", - "metadata": {}, + "metadata": { + "tags": [] + }, "outputs": [], "source": [ "client.close()" From 2401c14166aade5dc0ba427cfe5ab63184995f94 Mon Sep 17 00:00:00 2001 From: dcherian Date: Mon, 26 Jun 2023 14:43:42 -0600 Subject: [PATCH 24/38] more tweaks --- .../apply_ufunc/simple_dask_apply_ufunc.ipynb | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb b/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb index 81612fa2..659ee02b 100644 --- a/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb +++ b/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb @@ -526,7 +526,12 @@ "id": "dfed3978-288f-48b4-a978-edfce2f3ddb8", "metadata": {}, "source": [ - "### Handling more complex output\n", + "## More complex situations\n", + "\n", + "\n", + "Here we quickly demonstrate that all the concepts from the numpy material earlier carry over.\n", + "\n", + "### Adding new dimensions\n", "\n", "Again we use the [1D interpolation example](complex-output) that changes the size of the input along a single dimension.\n", "\n", @@ -564,6 +569,14 @@ ")" ] }, + { + "cell_type": "markdown", + "id": "b3c08d6e-cc7b-4d6d-b074-ff90ba99e6c7", + "metadata": {}, + "source": [ + "### Dimensions that change size" + ] + }, { "cell_type": "markdown", "id": "5f019966-c296-41c8-a0f8-91bcffa7ac7f", From 67f973d37ed3bfbba21871b5b170677780fb8bff Mon Sep 17 00:00:00 2001 From: dcherian Date: Mon, 26 Jun 2023 20:12:58 -0600 Subject: [PATCH 25/38] Review comments --- .../apply_ufunc/complex-output-numpy.ipynb | 41 ++++---- advanced/apply_ufunc/example-interp.ipynb | 4 +- .../apply_ufunc/simple_dask_apply_ufunc.ipynb | 4 +- .../simple_numpy_apply_ufunc.ipynb | 94 +++++++++++++++++-- 4 files changed, 119 insertions(+), 24 deletions(-) diff --git a/advanced/apply_ufunc/complex-output-numpy.ipynb b/advanced/apply_ufunc/complex-output-numpy.ipynb index 632912c7..92ab34ec 100644 --- a/advanced/apply_ufunc/complex-output-numpy.ipynb +++ b/advanced/apply_ufunc/complex-output-numpy.ipynb @@ -4,8 +4,7 @@ "cell_type": "markdown", "id": "0203c86c-2e42-4226-acd1-1e9bcffc6708", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "(complex-output)=\n", @@ -31,23 +30,18 @@ " with given discrete data points (`xp`, `fp`), evaluated at `x`.\n", "```\n", "\n", - "This function expects a 1D array as input, and returns a 1D array as output.\n", + "This function expects a 1D array as input, and returns a 1D array as output. That is, `numpy.interp` has one core dimension.\n", "\n", "\n", - "```{tip} Exercise\n", - "How many core dimensions does `numpy.interp` handle?\n", - "```\n", - "```{tip} Solution\n", - ":class: dropdown\n", - "\n", - "One.\n", + "```{tip}\n", + "We'll reduce the length of error messages using `%xmode minimal` See the [ipython documentation](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-xmode) for details.\n", "```\n" ] }, { "cell_type": "code", "execution_count": null, - "id": "18135586-b2d9-4f1d-abe2-d7afa036e2bc", + "id": "f7842446-6b7d-494b-9625-5dd04dc6e9ad", "metadata": { "tags": [] }, @@ -58,6 +52,7 @@ "import xarray as xr\n", "import numpy as np\n", "\n", + "np.set_printoptions(threshold=10, edgeitems=2)\n", "xr.set_options(display_expand_data=False)\n", "\n", "air = (\n", @@ -68,6 +63,20 @@ "air" ] }, + { + "cell_type": "code", + "execution_count": null, + "id": "936f0cb2-491b-4560-9470-d98df595b3d1", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Our goal is to densify from 25 to 100 coordinate values:s\n", + "newlat = np.linspace(15, 75, 100)\n", + "np.interp(newlat, air.lat.data, air.data)" + ] + }, { "cell_type": "markdown", "id": "db79d3e3-2a1a-4644-82a0-d58da74e0e36", @@ -113,11 +122,11 @@ "newlat = np.linspace(15, 75, 100)\n", "\n", "xr.apply_ufunc(\n", - " np.interp, # first the function\n", - " newlat,\n", - " air.lat,\n", - " air,\n", - " input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n", + " np.interp, # function to apply\n", + " newlat, # 1st input to np.interp\n", + " air.lat, # 2nd input to np.interp\n", + " air, # 3rd input to np.interp\n", + " input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]], # one entry per function input, 3 in total!\n", " output_core_dims=[[\"lat_interp\"]],\n", ")" ] diff --git a/advanced/apply_ufunc/example-interp.ipynb b/advanced/apply_ufunc/example-interp.ipynb index dd8a5cd1..1664b204 100644 --- a/advanced/apply_ufunc/example-interp.ipynb +++ b/advanced/apply_ufunc/example-interp.ipynb @@ -68,7 +68,9 @@ "import xarray as xr\n", "import numpy as np\n", "\n", - "xr.set_options(display_expand_data=False) # fancy HTML repr\n", + "# limit the amount of information printed to screen\n", + "xr.set_options(display_expand_data=False)\n", + "np.set_printoptions(threshold=10, edgeitems=2)\n", "\n", "air = (\n", " xr.tutorial.load_dataset(\"air_temperature\")\n", diff --git a/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb b/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb index 659ee02b..5f99ba48 100644 --- a/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb +++ b/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb @@ -41,7 +41,9 @@ "import numpy as np\n", "import xarray as xr\n", "\n", - "xr.set_options(display_expand_data=False)" + "# limit the amount of information printed to screen\n", + "xr.set_options(display_expand_data=False)\n", + "np.set_printoptions(threshold=10, edgeitems=2)" ] }, { diff --git a/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb b/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb index a07f239a..e4b779c4 100644 --- a/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb +++ b/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb @@ -26,6 +26,12 @@ "- Learn how to identify \"core dimensions\" for the function you're applying.\n", "- Learn that \"core dimensions\" are automatically moved or transposed to the end of the array.\n", "\n", + "\n", + "```{tip}\n", + "We'll reduce the length of error messages using `%xmode minimal` See the [ipython documentation](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-xmode) for details.\n", + "```\n", + "\n", + "\n", "## Setup" ] }, @@ -36,6 +42,8 @@ "metadata": {}, "outputs": [], "source": [ + "%xmode minimal\n", + "\n", "import numpy as np\n", "import xarray as xr\n", "\n", @@ -144,7 +152,14 @@ "metadata": {}, "outputs": [], "source": [ - "xr.DataArray(data=numpy_result, dims=ds.air.dims, coords=ds.air.coords)" + "xr.DataArray(\n", + " data=numpy_result,\n", + " # propagate all the Xarray metadata manually\n", + " dims=ds.air.dims,\n", + " coords=ds.air.coords,\n", + " attrs=ds.air.attrs,\n", + " name=ds.air.name,\n", + ")" ] }, { @@ -261,6 +276,50 @@ "```" ] }, + { + "cell_type": "markdown", + "id": "2b370050-1f93-49a8-b8e1-671710967315", + "metadata": {}, + "source": [ + "## Handling attributes\n", + "\n", + "By default, attributes are omitted since they may now be inaccurate" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "40a480bd-ff78-45b8-a818-869c2091c1f4", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "result = xr.apply_ufunc(wrapper, ds.air, 1)\n", + "result.attrs" + ] + }, + { + "cell_type": "markdown", + "id": "9169294d-08c5-413a-a87d-98e59957d102", + "metadata": {}, + "source": [ + "To propagate attributes, pass `keep_attrs=True`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "07f38f4f-cd62-4790-89b0-9de654ed4057", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "result = xr.apply_ufunc(wrapper, ds.air, 1, keep_attrs=True)\n", + "result.attrs" + ] + }, { "cell_type": "markdown", "id": "f31b53ae-b6bf-4259-8c2a-67a231c9de73", @@ -274,7 +333,30 @@ "\n", "When passed a Dataset, `apply_ufunc` will loop over the data variables and sequentially pass those to `squared_error`.\n", "\n", - "So `squared_error` always receives a _single_ numpy array." + "So `squared_error` always receives a _single_ numpy array.\n", + "\n", + "To illustrate that lets create a new `Dataset` with two arrays. We'll create a new array `air2` that is 2D `time, lat`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1d1fd411-3aaa-4e86-9f3b-739c58f35ee6", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "ds2 = ds.copy()\n", + "ds2[\"air2\"] = ds2.air.isel(lon=0) ** 2" + ] + }, + { + "cell_type": "markdown", + "id": "84a32eb2-2e6f-4a7d-897c-b221f5c3c26e", + "metadata": {}, + "source": [ + "We see that `wrapper` is called twice" ] }, { @@ -284,7 +366,7 @@ "metadata": {}, "outputs": [], "source": [ - "xr.apply_ufunc(wrapper, ds, 1)" + "xr.apply_ufunc(wrapper, ds2, 1)" ] }, { @@ -294,7 +376,7 @@ "metadata": {}, "outputs": [], "source": [ - "xr.apply_ufunc(squared_error, ds, 1)" + "xr.apply_ufunc(squared_error, ds2, 1)" ] }, { @@ -339,9 +421,9 @@ "source": [ "## Reductions and core dimensions\n", "\n", - "`squared_error` operated on a per-element basis. How about a reduction like `np.mean`? \n", + "`squared_error` operated on a per-element basis. How about a reduction like `np.mean` applied along a single axis? \n", "\n", - "Such functions involve the concept of \"core dimensions\". \n", + "Such operations involve the concept of \"core dimensions\". \n", "\n", "```{important}\n", "One way to think about core dimensions is to consider the smallest dimensionality of data that the function acts on.\n", From 7f8170d628e15515e4cdabc13be166991a03d2a4 Mon Sep 17 00:00:00 2001 From: dcherian Date: Mon, 26 Jun 2023 20:18:09 -0600 Subject: [PATCH 26/38] Migrate to exercise syntax. --- .../automatic-vectorizing-numpy.ipynb | 5 +++-- .../apply_ufunc/complex-output-numpy.ipynb | 10 ++++++---- advanced/apply_ufunc/example-interp.ipynb | 2 ++ .../apply_ufunc/simple_dask_apply_ufunc.ipynb | 18 +++++++++++------- .../apply_ufunc/simple_numpy_apply_ufunc.ipynb | 5 +++-- 5 files changed, 25 insertions(+), 15 deletions(-) diff --git a/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb b/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb index edb6ede4..ee3f1b62 100644 --- a/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb +++ b/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb @@ -65,12 +65,13 @@ "```\n", "\n", "\n", - "```{tip} Exercise\n", + "```{exercise}\n", + ":label: coreloopdims\n", "\n", "Consider the example problem of interpolating a 2D array with dimensions `space` and `time` along the `time` dimension.\n", "Which dimension is the core dimension, and which is the \"loop dimension\"?\n", "```\n", - "```{tip} Solution\n", + "```{solution} coreloopdims\n", ":class: dropdown\n", "\n", "`time` is the core dimension, and `space` is the loop dimension.\n", diff --git a/advanced/apply_ufunc/complex-output-numpy.ipynb b/advanced/apply_ufunc/complex-output-numpy.ipynb index 92ab34ec..c84e70e2 100644 --- a/advanced/apply_ufunc/complex-output-numpy.ipynb +++ b/advanced/apply_ufunc/complex-output-numpy.ipynb @@ -138,7 +138,8 @@ "tags": [] }, "source": [ - "```{tip} Exercise\n", + "```{exercise}\n", + ":label: newdim\n", "\n", "Apply the following function using `apply_ufunc`. It adds a new dimension to the input array, let's call it `newdim`. Specify the new dimension using `output_core_dims`\n", "\n", @@ -146,7 +147,7 @@ "def add_new_dim(array):\n", " return np.expand_dims(array, axis=0)\n", "```\n", - "````{tip} Solution\n", + "````{solution} newdim\n", ":class: dropdown\n", "\n", "``` python\n", @@ -327,12 +328,13 @@ "tags": [] }, "source": [ - "```{tip} Exercise\n", + "```{exercise}\n", + ":label: generalize\n", "\n", "We presented the concept of \"core dimensions\" as the \"smallest unit of data the function could handle.\" Do you understand how the following use of `apply_ufunc` generalizes to an array with more than one dimension? Try it with `air3d = xr.tutorial.load_dataset(\"air_temperature\").air`\n", "```\n", "\n", - "````{tip} Solution\n", + "````{solution} generalize\n", ":class: dropdown\n", "\n", "We want to use `minmax` to compute the minimum and maximum along the \"lat\" dimension always, regardless of how many dimensions are on the input. So we specify `input_core_dims=[[\"lat\"]]`. The output does not contain the \"lat\" dimension, but we expect two returned variables. So we pass an empty list `[]` for each returned array, so `output_core_dims=[[], []]` just as before.\n", diff --git a/advanced/apply_ufunc/example-interp.ipynb b/advanced/apply_ufunc/example-interp.ipynb index 1664b204..40cbe9a1 100644 --- a/advanced/apply_ufunc/example-interp.ipynb +++ b/advanced/apply_ufunc/example-interp.ipynb @@ -690,6 +690,8 @@ "tags": [] }, "source": [ + "## Summary\n", + "\n", "This technique is generalizable to any 1D function that [can be compiled](https://numba.readthedocs.io/en/stable/reference/pysupported.html#pysupported) by Numba." ] } diff --git a/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb b/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb index 5f99ba48..d3e0022f 100644 --- a/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb +++ b/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb @@ -283,13 +283,14 @@ "This means that for functions that accept an `axis` argument, you usually need\n", "to set `axis=-1`\n", "\n", - "```{tip} Exercise\n", + "```{exercise}\n", + ":label: daskmean\n", "\n", "Use `dask.array.mean` as an example of a function that can handle dask\n", "arrays and uses an `axis` kwarg. \n", "```\n", "\n", - "```{tip} Solution\n", + "````{solution} daskmean\n", ":class: dropdown\n", "```python\n", "def time_mean(da):\n", @@ -303,7 +304,7 @@ "\n", "\n", "time_mean(ds.air)\n", - "```\n" + "````\n" ] }, { @@ -506,10 +507,12 @@ "tags": [] }, "source": [ - "```{tip} Exercise\n", + "```{exercise} \n", + ":label: rechunk\n", "Apply the integrate function to `ds` after rechunking to have a different chunksize along `lon` using `ds.chunk(lon=4)` (for example). What happens?\n", "```\n", - "```{tip} Solution\n", + "\n", + "```{solution} rechunk\n", ":class: dropdown\n", "\n", "`apply_ufunc` complains that it cannot automatically parallelize because the dataset `ds` is now chunked along the core dimension `lon`. You should see the following error:\n", @@ -584,12 +587,13 @@ "id": "5f019966-c296-41c8-a0f8-91bcffa7ac7f", "metadata": {}, "source": [ - "```{tip} Exercise\n", + "```{exercise}\n", + ":label: lat\n", "\n", "Repeat the above computation but provide `\"lat\"` as the output core dimension. Remember that you need to provide `exclude_dims` for this case. See the numpy notebook on [complex output](complex-output) for more.\n", "```\n", "\n", - "````{tip} Solution\n", + "````{solution} lat\n", ":class: dropdown\n", "\n", "```python\n", diff --git a/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb b/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb index e4b779c4..12e6d527 100644 --- a/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb +++ b/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb @@ -682,12 +682,13 @@ "tags": [] }, "source": [ - "```{tip} Exercise\n", + "```{exercise}\n", + ":label: trapz\n", "\n", "Use `apply_ufunc` to apply `scipy.integrate.trapz` along the `time` axis.\n", "```\n", "\n", - "````{tip} Solution\n", + "````{solution} trapz\n", ":class: dropdown\n", "\n", "```python\n", From 4a55bcb930024bca5c7329030a902c46e0ad4fdd Mon Sep 17 00:00:00 2001 From: dcherian Date: Tue, 27 Jun 2023 09:14:57 -0600 Subject: [PATCH 27/38] Add sphinx_exercise + review feedback --- _config.yml | 1 + .../automatic-vectorizing-numpy.ipynb | 74 +++++++------------ .../apply_ufunc/complex-output-numpy.ipynb | 12 ++- .../simple_numpy_apply_ufunc.ipynb | 2 + 4 files changed, 40 insertions(+), 49 deletions(-) diff --git a/_config.yml b/_config.yml index 3e3e184c..9cecd16b 100644 --- a/_config.yml +++ b/_config.yml @@ -65,6 +65,7 @@ sphinx: # maintain old paths and redirect them (so google results dont go to 404) # https://github.com/wpilibsuite/sphinxext-rediraffe - sphinxext.rediraffe + - sphinx_exercise config: # application/vnd.holoviews_load.v0+json, application/vnd.holoviews_exec.v0+json suppress_warnings: ['mystnb.unknown_mime_type'] diff --git a/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb b/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb index ee3f1b62..0e18fe63 100644 --- a/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb +++ b/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb @@ -15,11 +15,10 @@ "cell_type": "markdown", "id": "afc56d28-6e55-4967-b27d-28e2cc539cc7", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ - "Previously we looked at applying functions on numpy arrays, and the concept of core dimensions.\n", + "Previously we looked at [applying functions](gentle-intro) on numpy arrays, and the concept of [core dimensions](core-dimensions).\n", "We learned that functions commonly support specifying \"core dimensions\" through the `axis` keyword\n", "argument. \n", "\n", @@ -39,7 +38,7 @@ " with given discrete data points (`xp`, `fp`), evaluated at `x`.\n", "```\n", "\n", - "This functions expects 1D arrays as input, so there is one core dimension and we cannot easily apply \n", + "This function expects 1D arrays as input, so there is one core dimension and we cannot easily apply \n", "it to a nD array since there is no `axis` keyword argument. \n", "\n", "\n", @@ -53,7 +52,7 @@ "Let's say we want to\n", "interpolate an array with two dimensions (`space`, `time`) over the `time` dimension, we might \n", "1. loop over the `space` dimension, \n", - "1. subset the array to a 1D array at that `space` loction, \n", + "1. subset the array to a 1D array at that `space` location, \n", "1. Interpolate the 1D arrays to the new `time` vector, and\n", "1. Assign that new interpolated 1D array to the appropriate location of a 2D output array\n", "\n", @@ -81,9 +80,9 @@ "\n", "The pattern of looping over any number of \"loop dimensions\" and applying a function along \"core dimensions\" \n", "is so common that numpy provides wrappers that automate these steps: \n", - "1. numpy.apply_along_axis\n", - "1. numpy.apply_along_axes\n", - "1. numpy.vectorize\n", + "1. [numpy.apply_along_axis](https://numpy.org/doc/stable/reference/generated/numpy.apply_along_axis.html)\n", + "1. [numpy.apply_over_axes](https://numpy.org/doc/stable/reference/generated/numpy.apply_along_axes.html)\n", + "1. [numpy.vectorize](https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html)\n", "\n", "\n", "`apply_ufunc` provides an easy interface to `numpy.vectorize` through the keyword argument `vectorize`. Here we see how to use\n", @@ -168,6 +167,8 @@ "source": [ "## Try nD input\n", "\n", + "Our goal is to interpolate latitude at every longitude and time, such that we go from a dataset with dimensions `(time: 4, lat: 25, lon: 3)` to `(time: 4, lat: 100, lon: 3)`. \n", + "\n", "If we blindly try passing `air` (a 3D DataArray), we get a hard-to-understand error" ] }, @@ -202,7 +203,11 @@ "tags": [] }, "source": [ - "We will use a \"wrapper\" function `interp1d_np` to examine what gets passed to `numpy.interp`" + "We will use a \"wrapper\" function `debug_interp` to examine what gets passed to `numpy.interp`.\n", + "\n", + "```{tip}\n", + "Such wrapper functions are a great way to understand and debug `apply_ufunc` use cases.\n", + "```" ] }, { @@ -216,13 +221,13 @@ }, "outputs": [], "source": [ - "def interp1d_np(xi, x, data):\n", + "def debug_interp(xi, x, data):\n", " print(f\"data: {data.shape} | x: {x.shape} | xi: {xi.shape}\")\n", " return np.interp(xi, x, data)\n", "\n", "\n", "interped = xr.apply_ufunc(\n", - " interp1d_np, # first the function\n", + " debug_interp, # first the function\n", " newlat,\n", " air.lat,\n", " air,\n", @@ -236,43 +241,23 @@ "cell_type": "markdown", "id": "6f5c928b-f8cb-4016-9d6d-39743f9c2976", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ - "That's a hard-to-interpret error but our `print` call helpfully printed the shapes of the input data: \n", + "That's a hard-to-interpret error from NumPy but our `print` call helpfully printed the shapes of the input data: \n", "\n", " data: (4, 3, 25) | x: (25,) | xi: (100,)\n", "\n", - "We see that `apply_ufunc` passes the full 3D array to `interp1d_np` which in turn passes that on to `numpy.interp`. This cannot work." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0ce03d38-db86-49c2-a60d-46d312ad60f8", - "metadata": { - "tags": [ - "raises-exception" - ] - }, - "outputs": [], - "source": [ - "xr.apply_ufunc(\n", - " interp1d_np, # first the function\n", - " air.isel(time=0, lon=0), # now arguments in the order expected by 'interp1_np'\n", - " air.lat,\n", - " newlat,\n", - " exclude_dims=set((\"lat\",)), # dimensions allowed to change size. Must be set!\n", - ")" + "We see that `apply_ufunc` passes the full 3D array to `interp1d_np` which in turn passes that on to `numpy.interp`. But `numpy.interp` requires a 1D input, and thus the error.\n", + "\n", + "Instead of passing the full 3D array we want loop over all combinations of `lon` and `time`; and apply our function to each corresponding vector of data along `lat`." ] }, { "cell_type": "markdown", "id": "737cc6b4-522f-488c-9124-524cc42ebef3", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "## Vectorization with `np.vectorize`\n" @@ -282,12 +267,10 @@ "cell_type": "markdown", "id": "b6dac8da-8420-4fc4-9aeb-29b8999d4b37", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ - "Instead of passing the full 3D array we want loop over all combinations of `lon` and `time`; and apply our function to each corresponding vector of data along `lat`.\n", - "`apply_ufunc` makes this easy by specifying `vectorize=True`:\n", + "`apply_ufunc` makes it easy to loop over the loop dimensions by specifying `vectorize=True`:\n", "\n", " vectorize : bool, optional\n", " If True, then assume ``func`` only takes arrays defined over core\n", @@ -298,7 +281,7 @@ " \n", "\n", "```{warning}\n", - "Also see the numpy documentation for [vectorize](https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html). Most importantly\n", + "Also see the numpy documentation for [numpy.vectorize](https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html). Most importantly\n", "\n", " The vectorize function is provided primarily for convenience, not for performance. \n", " The implementation is essentially a for loop.\n", @@ -316,7 +299,7 @@ "outputs": [], "source": [ "interped = xr.apply_ufunc(\n", - " interp1d_np, # first the function\n", + " debug_interp, # first the function\n", " newlat,\n", " air.lat,\n", " air,\n", @@ -332,15 +315,14 @@ "cell_type": "markdown", "id": "d81f399e-1649-4d4b-ad28-81cba8403210", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "Wow that worked!\n", "\n", "Notice that \n", "1. the printed input shapes are all 1D and correspond to one vector of size 25 along the `lat` dimension.\n", - "2. `interp1d_np` was called 4x3 = 12 times which is the total number `lat` vectors since the size along `time` is 4, and the size along `lon` is 3.\n", + "2. `debug_interp` was called 4x3 = 12 times which is the total number `lat` vectors since the size along `time` is 4, and the size along `lon` is 3.\n", "3. The result `interped` is now an xarray object with coordinate values copied over from `data`. \n", "\n", "\n", diff --git a/advanced/apply_ufunc/complex-output-numpy.ipynb b/advanced/apply_ufunc/complex-output-numpy.ipynb index c84e70e2..375224e4 100644 --- a/advanced/apply_ufunc/complex-output-numpy.ipynb +++ b/advanced/apply_ufunc/complex-output-numpy.ipynb @@ -328,11 +328,17 @@ "tags": [] }, "source": [ - "```{exercise}\n", + "````{exercise}\n", ":label: generalize\n", "\n", - "We presented the concept of \"core dimensions\" as the \"smallest unit of data the function could handle.\" Do you understand how the following use of `apply_ufunc` generalizes to an array with more than one dimension? Try it with `air3d = xr.tutorial.load_dataset(\"air_temperature\").air`\n", - "```\n", + "We presented the concept of \"core dimensions\" as the \"smallest unit of data the function could handle.\" Do you understand how the above use of `apply_ufunc` generalizes to an array with more than one dimension? \n", + "\n", + "Try applying the minmax function to a 3d air temperature dataset \n", + "```python\n", + "air3d = xr.tutorial.load_dataset(\"air_temperature\").air)\n", + "``` \n", + "Your goal is to have a minimum and maximum value of temperature across all latitudes for a given time and longitude.\n", + "````\n", "\n", "````{solution} generalize\n", ":class: dropdown\n", diff --git a/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb b/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb index 12e6d527..3c0afcd9 100644 --- a/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb +++ b/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb @@ -7,6 +7,7 @@ "tags": [] }, "source": [ + "(gentle-intro)=\n", "# A gentle introduction\n", "\n", "Many, but not all, useful array methods are wrapped by Xarray and accessible\n", @@ -419,6 +420,7 @@ "tags": [] }, "source": [ + "(core-dimensions)=\n", "## Reductions and core dimensions\n", "\n", "`squared_error` operated on a per-element basis. How about a reduction like `np.mean` applied along a single axis? \n", From 4b92ebf610d9b427face030e2ba832c822c5782f Mon Sep 17 00:00:00 2001 From: dcherian Date: Tue, 27 Jun 2023 09:23:26 -0600 Subject: [PATCH 28/38] Break out core dimensions material --- _toc.yml | 1 + advanced/apply_ufunc/core-dimensions.ipynb | 373 ++++++++++++++++++ .../simple_numpy_apply_ufunc.ipynb | 297 +------------- 3 files changed, 378 insertions(+), 293 deletions(-) create mode 100644 advanced/apply_ufunc/core-dimensions.ipynb diff --git a/_toc.yml b/_toc.yml index 12187904..14918bbe 100644 --- a/_toc.yml +++ b/_toc.yml @@ -47,6 +47,7 @@ parts: - file: advanced/apply_ufunc/apply_ufunc.md sections: - file: advanced/apply_ufunc/simple_numpy_apply_ufunc + - file: advanced/apply_ufunc/core-dimensions - file: advanced/apply_ufunc/complex-output-numpy - file: advanced/apply_ufunc/automatic-vectorizing-numpy - file: advanced/apply_ufunc/simple_dask_apply_ufunc diff --git a/advanced/apply_ufunc/core-dimensions.ipynb b/advanced/apply_ufunc/core-dimensions.ipynb new file mode 100644 index 00000000..f82a0bba --- /dev/null +++ b/advanced/apply_ufunc/core-dimensions.ipynb @@ -0,0 +1,373 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "8a5fd05a-5187-4fb6-8809-049e7158124b", + "metadata": { + "tags": [] + }, + "source": [ + "(core-dimensions)=\n", + "# Core dimensions\n", + "\n", + "[Previously](gentle-intro) we learned to use `apply_ufunc` on simple functions that acted element by element. \n", + "\n", + "Here we move on to slightly more complex functions like `np.mean` that can act along a subset of an input array's dimensions.\n", + "\n", + "Such operations involve the concept of \"core dimensions\". \n", + "\n", + "Our learning goals are:\n", + "- Learn how to identify \"core dimensions\" for the function you're applying.\n", + "- Learn that \"core dimensions\" are automatically moved or transposed to the end of the array.\n", + "\n", + "\n", + "## Introduction\n", + "\n", + "For using more complex operations that consider some array values collectively,\n", + "it’s important to understand the idea of **core dimensions**. \n", + "Usually, they correspond to the fundamental dimensions over\n", + "which an operation is defined, e.g., the summed axis in `np.sum`. One way to think about core dimensions \n", + "is to consider the smallest dimensionality of data that the function acts on.\n", + "\n", + "```{important}\n", + "\n", + "A good clue that core dimensions are needed is the presence of an `axis` argument on the\n", + "corresponding NumPy function.\n", + "\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "id": "626c49ba-7d57-42e1-8e20-7d1b3ec7147d", + "metadata": {}, + "source": [ + "## Setup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cb4f4b80-d990-4bde-abf9-f77bac55c7ff", + "metadata": {}, + "outputs": [], + "source": [ + "%xmode minimal\n", + "\n", + "import numpy as np\n", + "import xarray as xr\n", + "\n", + "# limit the amount of information printed to screen\n", + "xr.set_options(display_expand_data=False)\n", + "np.set_printoptions(threshold=10, edgeitems=2)" + ] + }, + { + "cell_type": "markdown", + "id": "61124d8d-2992-46d1-bd49-6282fe9aba82", + "metadata": {}, + "source": [ + "Let's load a dataset" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "920c4b39-6f55-43cf-99df-7eb6ea01ad82", + "metadata": {}, + "outputs": [], + "source": [ + "ds = xr.tutorial.load_dataset(\"air_temperature\")\n", + "ds" + ] + }, + { + "cell_type": "markdown", + "id": "cbaab26c-24ed-4e0f-8285-f0ffff92ec14", + "metadata": {}, + "source": [ + "## Reducing with `np.mean`\n", + "\n", + "Let's write a function that computes the mean along `time` for a provided xarray object. \n", + "\n", + "This function requires one core dimension `time`. For `ds.air` note that `time` is the 0th axis." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6637aa49-4f2f-4526-929c-65086e0196a1", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "ds.air.dims" + ] + }, + { + "cell_type": "markdown", + "id": "7b66f1f2-369c-42b5-b082-735c71a2e1f9", + "metadata": { + "tags": [] + }, + "source": [ + "`get_axis_num` is a useful method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "30e7e10b-5447-4384-bc25-0cdea72da25a", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "ds.air.get_axis_num(\"time\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f53ec56a-63bb-4103-87e9-4e7c14dc6c1a", + "metadata": {}, + "outputs": [], + "source": [ + "np.mean(ds.air, axis=ds.air.get_axis_num(\"time\"))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a8214ae7-00fa-47fa-af79-79c3731b8cc1", + "metadata": {}, + "outputs": [], + "source": [ + "np.mean(ds.air.data, axis=0)" + ] + }, + { + "cell_type": "markdown", + "id": "938bd6a2-f0c0-412a-8f9b-4b2147750892", + "metadata": {}, + "source": [ + "Let's try to use `apply_ufunc` to replicate `np.mean(ds.air.data, axis=0)`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "424a54a2-b704-407b-80d8-27107949f184", + "metadata": { + "tags": [ + "raises-exception" + ] + }, + "outputs": [], + "source": [ + "xr.apply_ufunc(\n", + " # function to apply\n", + " np.mean,\n", + " # object with data to pass to function\n", + " ds,\n", + " # keyword arguments to pass to np.mean\n", + " kwargs={\"axis\": 0},\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "0ec1348a-cf6d-4d7e-b01b-5a9ea32f7565", + "metadata": { + "tags": [] + }, + "source": [ + "The error here\n", + "```\n", + "applied function returned data with unexpected number of dimensions. \n", + "Received 2 dimension(s) but expected 3 dimensions with names: ('time', 'lat', 'lon')\n", + "```\n", + "\n", + "means that while `np.mean` did indeed reduce one dimension, we did not tell `apply_ufunc` that this would happen. That is, we need to specify the core dimensions on the input.\n", + "\n", + "Do that by passing a list of dimension names for each input object. For this function we have one input : `ds` and with a single core dimension `\"time\"` so we have `input_core_dims=[[\"time\"]]`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6e8d7580-2300-47f3-953b-403967eb7fcd", + "metadata": { + "tags": [ + "raises-exception" + ] + }, + "outputs": [], + "source": [ + "xr.apply_ufunc(\n", + " np.mean,\n", + " ds,\n", + " # specify core dimensions as a list of lists\n", + " # here 'time' is the core dimension on `ds`\n", + " input_core_dims=[\n", + " [\"time\"], # core dimension for ds\n", + " ],\n", + " kwargs={\"axis\": 0},\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "6e856818-80fe-4412-a71f-4f77bb6911c6", + "metadata": { + "tags": [] + }, + "source": [ + "This next error is a little confusing.\n", + "\n", + "```\n", + "size of dimension 'lat' on inputs was unexpectedly changed by applied function from 25 to 53. \n", + "Only dimensions specified in ``exclude_dims`` with xarray.apply_ufunc are allowed to change size.\n", + "```\n", + "\n", + "\n", + "A good trick here is to pass a little wrapper function to `apply_ufunc` instead and inspect the shapes of data received by the wrapper.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "562d5758-f492-4d9a-bf68-960da7580582", + "metadata": { + "tags": [ + "raises-exception" + ] + }, + "outputs": [], + "source": [ + "def wrapper(array, **kwargs):\n", + " print(f\"received {type(array)} shape: {array.shape}, kwargs: {kwargs}\")\n", + " result = np.mean(array, **kwargs)\n", + " print(f\"result.shape: {result.shape}\")\n", + " return result\n", + "\n", + "\n", + "xr.apply_ufunc(\n", + " wrapper,\n", + " ds,\n", + " # specify core dimensions as a list of lists\n", + " # here 'time' is the core dimension on `ds`\n", + " input_core_dims=[[\"time\"]],\n", + " kwargs={\"axis\": 0},\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "27008aae-f1b7-45b2-88c7-2c440f23a958", + "metadata": {}, + "source": [ + "Now we see the issue:\n", + "\n", + " received shape: (25, 53, 2920), kwargs: {'axis': 0}\n", + " result.shape: (53, 2920)\n", + " \n", + "The `time` dimension is of size `2920` and is now the last axis of the array but was initially the first axis" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "45ee968e-bee6-43c3-b4a1-ffdfaa5bad3c", + "metadata": {}, + "outputs": [], + "source": [ + "ds.air.get_axis_num(\"time\")" + ] + }, + { + "cell_type": "markdown", + "id": "415b2169-e855-4013-bf88-10ee2e87604e", + "metadata": { + "tags": [] + }, + "source": [ + "```{important}\n", + "This illustrates an important concept. Arrays are transposed so that core dimensions are at the end.\n", + "```\n", + "\n", + "With `apply_ufunc`, core dimensions are recognized by name, and then moved to\n", + "the last dimension of any input arguments before applying the given function.\n", + "This means that for functions that accept an `axis` argument, you usually need\n", + "to set `axis=-1`\n", + "\n", + "Such behaviour means that our functions (like `wrapper` or `np.mean`) do not need to know the exact order of dimensions. They can rely on the core dimensions being at the end allowing us to write very general code! \n", + "\n", + "We can fix our `apply_ufunc` call by specifying `axis=-1` instead." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0dafff0a-3a6d-476b-bc96-409a1872d136", + "metadata": {}, + "outputs": [], + "source": [ + "def wrapper(array, **kwargs):\n", + " print(f\"received {type(array)} shape: {array.shape}, kwargs: {kwargs}\")\n", + " result = np.mean(array, **kwargs)\n", + " print(f\"result.shape: {result.shape}\")\n", + " return result\n", + "\n", + "\n", + "xr.apply_ufunc(\n", + " wrapper,\n", + " ds,\n", + " input_core_dims=[[\"time\"]],\n", + " kwargs={\"axis\": -1},\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "156701e9-9104-4db4-85da-24ead2c0836d", + "metadata": { + "tags": [] + }, + "source": [ + "```{exercise}\n", + ":label: trapz\n", + "\n", + "Use `apply_ufunc` to apply `scipy.integrate.trapz` along the `time` axis.\n", + "```\n", + "\n", + "````{solution} trapz\n", + ":class: dropdown\n", + "\n", + "```python\n", + "import scipy as sp\n", + "import scipy.integrate\n", + "\n", + "xr.apply_ufunc(scipy.integrate.trapz, ds, input_core_dims=[[\"time\"]], kwargs={\"axis\": -1})\n", + "```\n", + "````" + ] + } + ], + "metadata": { + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb b/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb index 3c0afcd9..68fd8eb8 100644 --- a/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb +++ b/advanced/apply_ufunc/simple_numpy_apply_ufunc.ipynb @@ -22,10 +22,10 @@ "Xarray uses `apply_ufunc` internally to implement much of its API, meaning that it is quite powerful!\n", "```\n", "\n", - "Learning goals:\n", - "- Learn that `apply_ufunc` automates aspects of applying computation functions that are designed for pure arrays (like numpy arrays) on xarray objects\n", - "- Learn how to identify \"core dimensions\" for the function you're applying.\n", - "- Learn that \"core dimensions\" are automatically moved or transposed to the end of the array.\n", + "Our goals are to learn that `apply_ufunc` automates aspects of applying computation functions that are designed for pure arrays (like numpy arrays) on xarray objects including\n", + "- Propagating dimension names, coordinate variables, and (optionally) attributes.\n", + "- Handle Dataset input by looping over data variables.\n", + "- Allow passing arbitrary positional and keyword arguments\n", "\n", "\n", "```{tip}\n", @@ -412,295 +412,6 @@ "source": [ "xr.apply_ufunc(squared_error, ds.air, kwargs={\"y\": 1})" ] - }, - { - "cell_type": "markdown", - "id": "ac4f4511-2535-4dd8-af2e-085a9f453e9f", - "metadata": { - "tags": [] - }, - "source": [ - "(core-dimensions)=\n", - "## Reductions and core dimensions\n", - "\n", - "`squared_error` operated on a per-element basis. How about a reduction like `np.mean` applied along a single axis? \n", - "\n", - "Such operations involve the concept of \"core dimensions\". \n", - "\n", - "```{important}\n", - "One way to think about core dimensions is to consider the smallest dimensionality of data that the function acts on.\n", - "```\n", - "\n", - "For using more complex operations that consider some array values collectively,\n", - "it’s important to understand the idea of **core dimensions**. \n", - "Usually, they correspond to the fundamental dimensions over\n", - "which an operation is defined, e.g., the summed axis in `np.sum`. A good clue\n", - "that core dimensions are needed is the presence of an `axis` argument on the\n", - "corresponding NumPy function.\n", - "\n", - "Let's write a function that computes the mean along `time` for a provided xarray object. \n", - "\n", - "This function requires one core dimension `time`. For `ds.air` note that `time` is the 0th axis." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "84856bf9-e926-4b6f-9c38-ef61b2953610", - "metadata": {}, - "outputs": [], - "source": [ - "ds.air.dims" - ] - }, - { - "cell_type": "markdown", - "id": "f8edcfd8-3307-4417-bc38-19bda89df8e1", - "metadata": { - "tags": [] - }, - "source": [ - "`get_axis_num` is a useful method." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8abba2b4-a011-4dd6-846a-c174fb674233", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "ds.air.get_axis_num(\"time\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7b075981-c16c-4f0b-a37d-c2dede010ebc", - "metadata": {}, - "outputs": [], - "source": [ - "np.mean(ds.air, axis=ds.air.get_axis_num(\"time\"))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f2ab71ad-b2ff-4203-a484-0c2d6b2c4be4", - "metadata": {}, - "outputs": [], - "source": [ - "np.mean(ds.air.data, axis=0)" - ] - }, - { - "cell_type": "markdown", - "id": "484bc283-a0a8-40f1-906b-81aaf06a14ec", - "metadata": {}, - "source": [ - "Let's try to use `apply_ufunc` to replicate `np.mean(ds.air.data, axis=0)`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9f56f265-a723-4165-a248-12f5ba8c34d5", - "metadata": { - "tags": [ - "raises-exception" - ] - }, - "outputs": [], - "source": [ - "xr.apply_ufunc(\n", - " # function to apply\n", - " np.mean,\n", - " # object with data to pass to function\n", - " ds,\n", - " # keyword arguments to pass to np.mean\n", - " kwargs={\"axis\": 0},\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "f7be1720-b529-444d-852e-a6f63563ac3f", - "metadata": { - "tags": [] - }, - "source": [ - "The error here\n", - "```\n", - "applied function returned data with unexpected number of dimensions. \n", - "Received 2 dimension(s) but expected 3 dimensions with names: ('time', 'lat', 'lon')\n", - "```\n", - "\n", - "means that while `np.mean` did indeed reduce one dimension, we did not tell `apply_ufunc` that this would happen. That is, we need to specify the core dimensions on the input.\n", - "\n", - "Do that by passing a list of dimension names for each input object. For this function we have one input : `ds` and with a single core dimension `\"time\"` so we have `input_core_dims=[[\"time\"]]`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "447f79cd-8ee2-416f-9ebe-b1b4317583d0", - "metadata": { - "tags": [ - "raises-exception" - ] - }, - "outputs": [], - "source": [ - "xr.apply_ufunc(\n", - " np.mean,\n", - " ds,\n", - " # specify core dimensions as a list of lists\n", - " # here 'time' is the core dimension on `ds`\n", - " input_core_dims=[\n", - " [\"time\"], # core dimension for ds\n", - " ],\n", - " kwargs={\"axis\": 0},\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "534b12aa-8a69-457b-a901-70a2358b0b71", - "metadata": { - "tags": [] - }, - "source": [ - "This next error is a little confusing.\n", - "\n", - "```\n", - "size of dimension 'lat' on inputs was unexpectedly changed by applied function from 25 to 53. \n", - "Only dimensions specified in ``exclude_dims`` with xarray.apply_ufunc are allowed to change size.\n", - "```\n", - "\n", - "\n", - "A good trick here is to pass a little wrapper function to `apply_ufunc` instead and inspect the shapes of data received by the wrapper.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "560f3d23-7760-4573-a091-722b8a440fb8", - "metadata": { - "tags": [ - "raises-exception" - ] - }, - "outputs": [], - "source": [ - "def wrapper(array, **kwargs):\n", - " print(f\"received {type(array)} shape: {array.shape}, kwargs: {kwargs}\")\n", - " result = np.mean(array, **kwargs)\n", - " print(f\"result.shape: {result.shape}\")\n", - " return result\n", - "\n", - "\n", - "xr.apply_ufunc(\n", - " wrapper,\n", - " ds,\n", - " # specify core dimensions as a list of lists\n", - " # here 'time' is the core dimension on `ds`\n", - " input_core_dims=[[\"time\"]],\n", - " kwargs={\"axis\": 0},\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "e3ba7a3d-5ab4-4b3b-b0d5-1b6ae98e8d7c", - "metadata": {}, - "source": [ - "Now we see the issue:\n", - "\n", - " received shape: (25, 53, 2920), kwargs: {'axis': 0}\n", - " result.shape: (53, 2920)\n", - " \n", - "The `time` dimension is of size `2920` and is now the last axis of the array but was initially the first axis" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6f2dd2ec-e036-4fb2-888b-8411e88091dc", - "metadata": {}, - "outputs": [], - "source": [ - "ds.air.get_axis_num(\"time\")" - ] - }, - { - "cell_type": "markdown", - "id": "81fe9fe1-83ff-4d60-bbee-fc11ecd4f73a", - "metadata": { - "tags": [] - }, - "source": [ - "```{important}\n", - "This illustrates an important concept. Arrays are transposed so that core dimensions are at the end.\n", - "```\n", - "\n", - "With `apply_ufunc`, core dimensions are recognized by name, and then moved to\n", - "the last dimension of any input arguments before applying the given function.\n", - "This means that for functions that accept an `axis` argument, you usually need\n", - "to set `axis=-1`\n", - "\n", - "Such behaviour means that our functions (like `wrapper` or `np.mean`) do not need to know the exact order of dimensions. They can rely on the core dimensions being at the end allowing us to write very general code! \n", - "\n", - "We can fix our `apply_ufunc` call by specifying `axis=-1` instead." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6c343d01-cdf6-4e48-a349-606f06c4c251", - "metadata": {}, - "outputs": [], - "source": [ - "def wrapper(array, **kwargs):\n", - " print(f\"received {type(array)} shape: {array.shape}, kwargs: {kwargs}\")\n", - " result = np.mean(array, **kwargs)\n", - " print(f\"result.shape: {result.shape}\")\n", - " return result\n", - "\n", - "\n", - "xr.apply_ufunc(\n", - " wrapper,\n", - " ds,\n", - " input_core_dims=[[\"time\"]],\n", - " kwargs={\"axis\": -1},\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "600b8380-dab6-4caf-b2b9-91aee8447847", - "metadata": { - "tags": [] - }, - "source": [ - "```{exercise}\n", - ":label: trapz\n", - "\n", - "Use `apply_ufunc` to apply `scipy.integrate.trapz` along the `time` axis.\n", - "```\n", - "\n", - "````{solution} trapz\n", - ":class: dropdown\n", - "\n", - "```python\n", - "import scipy as sp\n", - "import scipy.integrate\n", - "\n", - "xr.apply_ufunc(scipy.integrate.trapz, ds, input_core_dims=[[\"time\"]], kwargs={\"axis\": -1})\n", - "```\n", - "````" - ] } ], "metadata": { From 0756860336234e5143f565f6bd3a6bde17bf7631 Mon Sep 17 00:00:00 2001 From: dcherian Date: Tue, 27 Jun 2023 09:26:44 -0600 Subject: [PATCH 29/38] fix --- advanced/apply_ufunc/core-dimensions.ipynb | 1 - 1 file changed, 1 deletion(-) diff --git a/advanced/apply_ufunc/core-dimensions.ipynb b/advanced/apply_ufunc/core-dimensions.ipynb index f82a0bba..2c4d39d9 100644 --- a/advanced/apply_ufunc/core-dimensions.ipynb +++ b/advanced/apply_ufunc/core-dimensions.ipynb @@ -7,7 +7,6 @@ "tags": [] }, "source": [ - "(core-dimensions)=\n", "# Core dimensions\n", "\n", "[Previously](gentle-intro) we learned to use `apply_ufunc` on simple functions that acted element by element. \n", From 742231bfdb6e2472478a8cae97041e02360961f0 Mon Sep 17 00:00:00 2001 From: dcherian Date: Tue, 27 Jun 2023 09:42:27 -0600 Subject: [PATCH 30/38] fix link --- advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb b/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb index 0e18fe63..d88eb7bb 100644 --- a/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb +++ b/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb @@ -81,7 +81,7 @@ "The pattern of looping over any number of \"loop dimensions\" and applying a function along \"core dimensions\" \n", "is so common that numpy provides wrappers that automate these steps: \n", "1. [numpy.apply_along_axis](https://numpy.org/doc/stable/reference/generated/numpy.apply_along_axis.html)\n", - "1. [numpy.apply_over_axes](https://numpy.org/doc/stable/reference/generated/numpy.apply_along_axes.html)\n", + "1. [numpy.apply_over_axes](https://numpy.org/doc/stable/reference/generated/numpy.apply_over_axes.html)\n", "1. [numpy.vectorize](https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html)\n", "\n", "\n", From 89a37c94d8fb3dda00f9662980f391845d32afb8 Mon Sep 17 00:00:00 2001 From: dcherian Date: Tue, 27 Jun 2023 09:44:43 -0600 Subject: [PATCH 31/38] one more use of exercise syntax --- advanced/apply_ufunc/numba-vectorization.ipynb | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/advanced/apply_ufunc/numba-vectorization.ipynb b/advanced/apply_ufunc/numba-vectorization.ipynb index 278e530b..dd510a45 100644 --- a/advanced/apply_ufunc/numba-vectorization.ipynb +++ b/advanced/apply_ufunc/numba-vectorization.ipynb @@ -241,11 +241,12 @@ "id": "ad018876-302f-41bf-82e4-2605b92f3d30", "metadata": {}, "source": [ - "```{tip} Exercise\n", + "```{exercise}\n", + ":label: g\n", "\n", "Apply `g` to `da_dask`\n", "```\n", - "````{tip} Solution\n", + "````{solution} g\n", ":class: dropdown\n", "\n", "```python\n", From 95e0717fd74ea56fdde314243bfd8ee101e10683 Mon Sep 17 00:00:00 2001 From: dcherian Date: Tue, 27 Jun 2023 12:57:58 -0600 Subject: [PATCH 32/38] More review comments. --- .../apply_ufunc/complex-output-numpy.ipynb | 17 +- .../apply_ufunc/numba-vectorization.ipynb | 6 +- .../apply_ufunc/simple_dask_apply_ufunc.ipynb | 211 ++++++++++++------ 3 files changed, 153 insertions(+), 81 deletions(-) diff --git a/advanced/apply_ufunc/complex-output-numpy.ipynb b/advanced/apply_ufunc/complex-output-numpy.ipynb index 375224e4..74387ace 100644 --- a/advanced/apply_ufunc/complex-output-numpy.ipynb +++ b/advanced/apply_ufunc/complex-output-numpy.ipynb @@ -81,10 +81,10 @@ "cell_type": "markdown", "id": "db79d3e3-2a1a-4644-82a0-d58da74e0e36", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ + "(interp-add-new-dim)=\n", "## Adding a new dimension\n", "\n", "1D interpolation transforms the size of the input along a single dimension.\n", @@ -141,26 +141,26 @@ "```{exercise}\n", ":label: newdim\n", "\n", - "Apply the following function using `apply_ufunc`. It adds a new dimension to the input array, let's call it `newdim`. Specify the new dimension using `output_core_dims`\n", + "Apply the following function using `apply_ufunc`. It adds a new dimension to the input array, let's call it `newdim`. Specify the new dimension using `output_core_dims`. Do you need any `input_core_dims`?\n", "\n", "```python\n", "def add_new_dim(array):\n", - " return np.expand_dims(array, axis=0)\n", + " return np.expand_dims(array, axis=-1)\n", "```\n", "````{solution} newdim\n", ":class: dropdown\n", "\n", "``` python\n", "def add_new_dim(array):\n", - " return np.expand_dims(array, axis=0)\n", + " return np.expand_dims(array, axis=-1)\n", "\n", "\n", "xr.apply_ufunc(\n", " add_new_dim,\n", " air,\n", - " input_core_dims=[[\"lat\"]],\n", - " output_core_dims=[[\"newdim\", \"lat\"]],\n", + " output_core_dims=[[\"newdim\"]],\n", ")\n", + "```\n", "````" ] }, @@ -254,8 +254,7 @@ "cell_type": "markdown", "id": "a3c45b16-795c-4549-95ee-8e9c0d7dd517", "metadata": { - "tags": [], - "user_expressions": [] + "tags": [] }, "source": [ "## Returning multiple variables\n", diff --git a/advanced/apply_ufunc/numba-vectorization.ipynb b/advanced/apply_ufunc/numba-vectorization.ipynb index dd510a45..c1df10d1 100644 --- a/advanced/apply_ufunc/numba-vectorization.ipynb +++ b/advanced/apply_ufunc/numba-vectorization.ipynb @@ -27,7 +27,7 @@ "1. [`vectorize`](https://numba.readthedocs.io/en/stable/user/vectorize.html#the-vectorize-decorator) for functions that act on scalars, and \n", "2. [`guvectorize`](https://numba.pydata.org/numba-doc/latest/user/vectorize.html#the-guvectorize-decorator) for functions that operates on subsets of the array along core-dimensions. Any decorated function gets compiled and will loop over the loop dimensions in parallel when necessary. \n", "\n", - "For `apply_ufunc` the key concept is that `vectorize=False` (the default) when using Numba vectorized functions. \n", + "For `apply_ufunc` the key concept is that we must provide `vectorize=False` (the default) when using Numba vectorized functions. \n", "Numba handles the vectorization (or looping) and `apply_ufunc` handles converting Xarray objects to bare arrays and handling metadata." ] }, @@ -202,9 +202,9 @@ "metadata": {}, "source": [ "Notice the following:\n", - "1. The `guvectorize` decorator includes the concept of \"core dimensions\": `'(n),()->(n)'`. This string means that the `g` takes a 1D vector of size `n`, a scalar, and returns a 1D vector of size `n`.\n", + "1. The `guvectorize` decorator includes the concept of \"core dimensions\": `'(n),()->(n)'`. This string means that the `g` takes a 1D vector of size `n`, a scalar, and returns a 1D vector of size `n`. There is one core dimension for the input, and one core dimension for the output. Both core dimensions have the same size.\n", "2. That string translates to `input_core_dims=[[\"x\"], []], output_core_dims=[[\"x\"]]` in `apply_ufunc`.\n", - "3. We don't provide `vectorize=False` to `apply_ufunc` since `numba` will handle the vectorization in compiled code automatically." + "3. We don't provide `vectorize=True` to `apply_ufunc` since `numba` will handle the vectorization in compiled code automatically." ] }, { diff --git a/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb b/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb index d3e0022f..8d589d2e 100644 --- a/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb +++ b/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb @@ -260,28 +260,12 @@ "tags": [] }, "source": [ - "## Reductions and core dimensions\n", + "## Core dimensions\n", "\n", "`squared_error` operated on a per-element basis. How about a reduction like `np.mean`?\n", "\n", - "Such functions involve the concept of \"core dimensions\". One way to think about core dimensions is to consider the smallest dimensionality of data necessary to apply the function.\n", + "Such functions involve the concept of \"core dimensions\". This concept is independent of the underlying array type, and is a property of the applied function. See the [core dimensions with NumPy](core-dimensions) tutorial for more.\n", "\n", - "For using more complex operations that consider some array values collectively,\n", - "it’s important to understand the idea of **core dimensions**. \n", - "Usually, they correspond to the fundamental dimensions over\n", - "which an operation is defined, e.g., the summed axis in `np.sum`. A good clue\n", - "that core dimensions are needed is the presence of an `axis` argument on the\n", - "corresponding NumPy function.\n", - "\n", - "```{important}\n", - "One way to think about core dimensions is to consider the smallest dimensionality of data that the function acts on.\n", - "```\n", - "\n", - "\n", - "With `apply_ufunc`, core dimensions are recognized by name, and then moved to\n", - "the last dimension of any input arguments before applying the given function.\n", - "This means that for functions that accept an `axis` argument, you usually need\n", - "to set `axis=-1`\n", "\n", "```{exercise}\n", ":label: daskmean\n", @@ -354,10 +338,6 @@ "in parallel to each block. This ability can be activated using\n", "`dask=\"parallelized\"`. \n", "\n", - "Again xarray needs a lot of extra metadata, so depending\n", - "on the function, extra arguments such as `output_dtypes` and `output_sizes` may\n", - "be necessary.\n", - "\n", "We will use `scipy.integrate.trapz` as an example of a function that cannot\n", "handle dask arrays and requires a core dimension. If we call `trapz` with a dask\n", "array, we get a numpy array back that is, the values have been eagerly computed.\n", @@ -430,31 +410,22 @@ }, { "cell_type": "markdown", - "id": "09e87266-2ff7-452d-b96d-d551d55c5e4e", - "metadata": { - "tags": [] - }, + "id": "59e97071-493c-4df3-8214-d9d5ada940d2", + "metadata": {}, "source": [ - "We'll define a function `integrate_lon` for later use:" + "Now you have control over executing this parallel computation." ] }, { "cell_type": "code", "execution_count": null, - "id": "689909bf-884e-4dbd-9e28-ed2cb68f80b8", - "metadata": { - "tags": [] - }, + "id": "901f0d1d-a0c5-49fe-a27e-4b831ae0f9f4", + "metadata": {}, "outputs": [], "source": [ - "def integrate_lon(ds):\n", - " return xr.apply_ufunc(\n", - " sp.integrate.trapz,\n", - " ds,\n", - " input_core_dims=[[\"lon\"]],\n", - " kwargs={\"axis\": -1},\n", - " dask=\"parallelized\",\n", - " )" + "# Dask -> Numpy array of integrated values\n", + "parallelized_results = integrated.compute()\n", + "parallelized_results" ] }, { @@ -533,17 +504,20 @@ "source": [ "## More complex situations\n", "\n", - "\n", "Here we quickly demonstrate that all the concepts from the numpy material earlier carry over.\n", "\n", - "### Adding new dimensions\n", + "Xarray needs a lot of extra metadata, so depending\n", + "on the function, extra arguments such as `output_dtypes` and `output_sizes` may\n", + "be necessary for supporting dask arrays. We demonstrate this below.\n", "\n", - "Again we use the [1D interpolation example](complex-output) that changes the size of the input along a single dimension.\n", + "### Adding new dimensions\n", "\n", - "Logically, we can think of this as removing the old dimension and adding a new dimension.\n", + "We use the [expand_dims example](newdim) that changes the size of the input along a single dimension.\n", "\n", - "For `interp` we expect one returned output with one new core dimension that we will call `\"lat_interp\"`.\n", - "Specify this using `output_core_dims=[[\"lat_interp\"]]`\n", + "```python\n", + "def add_new_dim(array):\n", + " return np.expand_dims(array, axis=0)\n", + "```\n", "\n", "When automatically parallelizing with `dask`, we need to provide some more information about the outputs.\n", "1. When adding a new dimensions, we need to provide the size in `dask_gufunc_kwargs` using the key `output_sizes`\n", @@ -555,22 +529,57 @@ "execution_count": null, "id": "5d82f436-65fe-484e-9295-382f6c725b80", "metadata": { - "tags": [] + "tags": [ + "raises-exception" + ] }, "outputs": [], "source": [ - "newlat = np.linspace(15, 75, 100)\n", + "def add_new_dim(array):\n", + " return np.expand_dims(array, axis=-1)\n", + "\n", "\n", "xr.apply_ufunc(\n", - " np.interp, # first the function\n", - " newlat,\n", - " ds.air.lat,\n", + " add_new_dim, # first the function\n", " ds.air.chunk({\"time\": 2, \"lon\": 2}),\n", - " input_core_dims=[[\"lat_interp\"], [\"lat\"], [\"lat\"]],\n", - " output_core_dims=[[\"lat_interp\"]],\n", + " output_core_dims=[[\"newdim\"]],\n", " dask=\"parallelized\",\n", - " dask_gufunc_kwargs=dict(output_sizes={\"lat_interp\": len(newlat)}),\n", - " output_dtypes=[ds.air.dtype],\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "d92e60e2-2652-4023-a6ce-7f83046e44ad", + "metadata": {}, + "source": [ + "Provide the size of the newly added dimension `newdim` in `output_sizes` as part of the `dask_gufunc_kwargs` keyword argument:\n", + "\n", + " dask_gufunc_kwargs (dict, optional) – Optional keyword arguments passed to dask.array.apply_gufunc() \n", + " if dask=’parallelized’. Possible keywords are output_sizes, allow_rechunk and meta.\n", + " \n", + "The syntax is \n", + "```python\n", + "dask_gufunc_kwargs={\n", + " \"output_sizes\": {\"newdim\": 1}\n", + "}\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "32bc121c-cb91-430b-96ab-a39079b5e80c", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "xr.apply_ufunc(\n", + " add_new_dim, # first the function\n", + " ds.air.chunk({\"time\": 2, \"lon\": 2}),\n", + " output_core_dims=[[\"newdim\"]],\n", + " dask=\"parallelized\",\n", + " dask_gufunc_kwargs={\"output_sizes\": {\"newdim\": 1}},\n", ")" ] }, @@ -584,19 +593,75 @@ }, { "cell_type": "markdown", - "id": "5f019966-c296-41c8-a0f8-91bcffa7ac7f", + "id": "72177e24-0c3b-4e73-9a2b-9929d19c0490", "metadata": {}, "source": [ - "```{exercise}\n", - ":label: lat\n", + "We will now repeat the [interpolation example from earlier](interp-add-new-dims) with `\"lat\"` as the output core dimension. See the numpy notebook on [complex output](complex-output) for more.\n", "\n", - "Repeat the above computation but provide `\"lat\"` as the output core dimension. Remember that you need to provide `exclude_dims` for this case. See the numpy notebook on [complex output](complex-output) for more.\n", + "```python\n", + "newlat = np.linspace(15, 75, 100)\n", + "\n", + "xr.apply_ufunc(\n", + " np.interp,\n", + " newlat,\n", + " ds.air.lat,\n", + " ds.air.chunk({\"time\": 2, \"lon\": 2}),\n", + " input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n", + " output_core_dims=[[\"lat\"]],\n", + " exclude_dims={\"lat\"},\n", + ")\n", "```\n", "\n", - "````{solution} lat\n", - ":class: dropdown\n", + "We will first add `dask=\"parallelized\"` and provide `output_sizes` in `dask_gufunc_kwargs`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "180ae3e2-82a2-4d2c-8217-88c97533eb1e", + "metadata": { + "tags": [ + "raises-exception" + ] + }, + "outputs": [], + "source": [ + "newlat = np.linspace(15, 75, 100)\n", "\n", - "```python\n", + "xr.apply_ufunc(\n", + " np.interp, # first the function\n", + " newlat,\n", + " ds.air.lat,\n", + " ds.air.chunk({\"time\": 2, \"lon\": 2}),\n", + " input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n", + " output_core_dims=[[\"lat\"]],\n", + " exclude_dims={\"lat\"},\n", + " # The following are dask-specific\n", + " dask=\"parallelized\",\n", + " dask_gufunc_kwargs=dict(output_sizes={\"lat\": len(newlat)}),\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "62424e94-baed-425c-8169-da769da480b8", + "metadata": {}, + "source": [ + "This error means that we need to provide `output_dtypes`\n", + "\n", + " output_dtypes (list of dtype, optional) – Optional list of output dtypes. \n", + " Only used if dask='parallelized' or vectorize=True." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "797a2875-e727-4866-af43-5ed0e8eaed5f", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ "newlat = np.linspace(15, 75, 100)\n", "\n", "xr.apply_ufunc(\n", @@ -607,13 +672,21 @@ " input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n", " output_core_dims=[[\"lat\"]],\n", " exclude_dims={\"lat\"},\n", + " # The following are dask-specific\n", " dask=\"parallelized\",\n", " dask_gufunc_kwargs=dict(output_sizes={\"lat\": len(newlat)}),\n", " output_dtypes=[ds.air.dtype],\n", - ")\n", - "```\n", - "\n", - "````" + ")" + ] + }, + { + "cell_type": "markdown", + "id": "344b3051-b469-4859-8412-dc60552b1d14", + "metadata": {}, + "source": [ + "```{tip}\n", + "Dask can sometimes figure out the output sizes and dtypes. The usual workflow is to read the error messages and iteratively pass more information to `apply_ufunc`.\n", + "```" ] }, { @@ -640,13 +713,13 @@ " newlat,\n", " ds.air.lat,\n", " ds.air,\n", - " input_core_dims=[[\"lat_interp\"], [\"lat\"], [\"lat\"]],\n", - " output_core_dims=[[\"lat_interp\"]],\n", + " input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n", + " output_core_dims=[[\"lat\"]],\n", " exclude_dims={\"lat\"}, # dimensions allowed to change size. Must be set!\n", - " vectorize=True,\n", " dask=\"parallelized\",\n", - " dask_gufunc_kwargs=dict(output_sizes={\"lat_interp\": len(newlat)}),\n", + " dask_gufunc_kwargs=dict(output_sizes={\"lat\": len(newlat)}),\n", " output_dtypes=[ds.air.dtype],\n", + " vectorize=True,\n", ")\n", "interped" ] From 4cf9e9d5c0758334fdd1a435990b3fdcb86a41fc Mon Sep 17 00:00:00 2001 From: dcherian Date: Tue, 27 Jun 2023 14:23:33 -0600 Subject: [PATCH 33/38] Significant dask updates. --- _toc.yml | 2 +- ...ply_ufunc.ipynb => dask_apply_ufunc.ipynb} | 183 ++++++++++++++++-- 2 files changed, 168 insertions(+), 17 deletions(-) rename advanced/apply_ufunc/{simple_dask_apply_ufunc.ipynb => dask_apply_ufunc.ipynb} (77%) diff --git a/_toc.yml b/_toc.yml index 14918bbe..47e8d35c 100644 --- a/_toc.yml +++ b/_toc.yml @@ -50,7 +50,7 @@ parts: - file: advanced/apply_ufunc/core-dimensions - file: advanced/apply_ufunc/complex-output-numpy - file: advanced/apply_ufunc/automatic-vectorizing-numpy - - file: advanced/apply_ufunc/simple_dask_apply_ufunc + - file: advanced/apply_ufunc/dask_apply_ufunc - file: advanced/apply_ufunc/numba-vectorization - file: advanced/apply_ufunc/example-interp - file: advanced/map_blocks/map_blocks.md diff --git a/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb b/advanced/apply_ufunc/dask_apply_ufunc.ipynb similarity index 77% rename from advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb rename to advanced/apply_ufunc/dask_apply_ufunc.ipynb index 8d589d2e..34a403c0 100644 --- a/advanced/apply_ufunc/simple_dask_apply_ufunc.ipynb +++ b/advanced/apply_ufunc/dask_apply_ufunc.ipynb @@ -366,7 +366,7 @@ "tags": [] }, "source": [ - "Let's activate automatic parallelization with `dask=\"parallelized\"`" + "Let's activate automatic parallelization by using `apply_ufunc` with `dask=\"parallelized\"`" ] }, { @@ -430,17 +430,32 @@ }, { "cell_type": "markdown", - "id": "a348a4c4-b36f-4ad8-8549-fce35ef429dd", + "id": "6b91b8ca-cec9-49ca-b136-bf3b5f612d91", "metadata": { "tags": [] }, "source": [ - "### Understanding what is happening\n", + "### Understanding `dask=\"parallelized\"`\n", "\n", + "It is very important to understand what `dask=\"parallelized\"` does. To fully understand it, requires understanding some core concepts.\n", "\n", - "It is very important to understand what `dask=\"parallelized\"` does. It is quite convenient and works well for \"blockwise\" or \"embarrassingly parallel\" operations.\n", + "```{seealso}\n", + "For `dask=\"parallelized\"` `apply_ufunc` will call `dask.array.apply_gufunc`. See the dask documentation on [generalized ufuncs](https://docs.dask.org/en/stable/array-gufunc.html) and [`apply_gufunc`](https://docs.dask.org/en/stable/generated/dask.array.gufunc.apply_gufunc.html) for more.\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "116fd5c7-5d15-4c05-96d0-73acb28e5403", + "metadata": { + "tags": [] + }, + "source": [ + "#### Embarassingly parallel or blockwise operations\n", + "\n", + "`dask=\"parallelized\"` works well for \"blockwise\" or \"embarrassingly parallel\" operations ([Wikipedia](https://en.wikipedia.org/wiki/Embarrassingly_parallel)).\n", "\n", - "These are operations where one block or chunk of the output array corresponds to one block or chunk of the input array. Specifically, the blocks or chunks of the _core dimension_ is what matters.\n", + "These are operations where one block or chunk of the output array corresponds to one block or chunk of the input array. Specifically, the blocks or chunks of the _core dimension_ is what matters. Importantly, no communication between blocks is necessary to create the output, which makes parallelization quite simple or \"embarrassing\".\n", "\n", "Let's look at the dask repr for `ds` and note chunksizes are (100,25,53) for a array with shape (2920, 25, 53). This means that each block or chunk of the array contains all `lat`, `lon` points and a subset of `time` points." ] @@ -473,7 +488,7 @@ }, { "cell_type": "markdown", - "id": "4b4707d4-f596-47a8-8a3b-6a4a157f0759", + "id": "141c4c5d-d3dc-433c-a89f-8c8e7084f333", "metadata": { "tags": [] }, @@ -499,8 +514,113 @@ }, { "cell_type": "markdown", - "id": "dfed3978-288f-48b4-a978-edfce2f3ddb8", + "id": "f411e1dc-5413-4400-8127-757f4bb7d640", + "metadata": { + "tags": [] + }, + "source": [ + "#### Understanding execution\n", + "\n", + "We are layering many concepts together there so it is important to understand how the function is executed, and what input it will receive. Again we will use our wrapper trick." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1c454151-b331-43d1-8eb1-c0103e821712", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "def integrate_wrapper(array, **kwargs):\n", + " print(f\"received array of type {type(array)}, shape {array.shape}\")\n", + " result = sp.integrate.trapz(array, **kwargs)\n", + " print(f\"received array of type {type(result)}, shape {result.shape}\")\n", + " return result\n", + "\n", + "\n", + "integrated = xr.apply_ufunc(\n", + " integrate_wrapper,\n", + " ds,\n", + " input_core_dims=[[\"lon\"]],\n", + " kwargs={\"axis\": -1},\n", + " dask=\"parallelized\",\n", + ")\n", + "integrated" + ] + }, + { + "cell_type": "markdown", + "id": "33c4e90c-f153-407f-ba1b-e100e827e2c2", + "metadata": {}, + "source": [ + "Note that we received an Xarray object back (`integrated`) but our wrapper function was called with a numpy array of shape `(1,1,1)`.\n", + "\n", + "Importantly the full 3D array has **not yet been** passed to `integrate_wrapper`. Yet dask needs to know the shape and dtype of the result. This is key. \n", + "\n", + "The `integrate_wrapper` function is treated like a black box, and its effect on the inputs has to either be described through additional keyword arguments, or inferred by passing dummy inputs.\n", + "\n", + "To do so, `dask.array.apply_gufunc` calls the user function with dummy inputs (here a numpy array of shape `(1,1,1)`), and inspects the returned value to understand that one dimension was removed (returned a numpy array of shape `(1,1)`.\n", + "\n", + "```{caution}\n", + "Some functions can have trouble handling such dummy inputs. Alternatively you can pass `meta = np.ones((1,1))` in `dask_gufunc_kwargs` to prevent dask from providing dummy inputs to the array.\n", + "```\n", + "\n", + "Since no errors were raised we proceed as-is.\n", + "\n", + "Let's compute the array to get real values." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b53fed48-5dc2-4a0b-985c-71bd436ca058", + "metadata": { + "tags": [ + "output-scroll" + ] + }, + "outputs": [], + "source": [ + "integrated.compute()" + ] + }, + { + "cell_type": "markdown", + "id": "c265429d-6abb-4475-a9b0-2dcb0902afb1", "metadata": {}, + "source": [ + "We see that `integrate_wrapper` is called many times!\n", + "\n", + "As many times as there are blocks in the array in fact, which is 30 here (`ds.air.data.numblocks`).\n", + "\n", + "Our function is independently executed on each block of the array, and then the results.\n", + "\n", + "The order of execution when executing `integrated.compute()` which looks ilke\n", + "\n", + "`xarray.apply_ufunc` → `dask.array.apply_gufunc` → `integrate_wrapper` → `scipy.integrate.trapz`\n", + "\n", + "\n", + "When executed\n", + "\n", + "1. Xarray loops over all data variables.\n", + "1. Xarray unwraps the underlying dask array (e.g. `ds.air`) and passes that to dask's `apply_gufunc`.\n", + "1. `apply_gufunc` calls `integrate_wrapper` on each block of the array.\n", + "1. For each block, `integrate_wrapper` calls `scipy.integrate.trapz` and returns one block of the output array.\n", + "1. dask stitches all the output blocks to form the output array.\n", + "1. `xarray.apply_ufunc` wraps the output array with Xarray metadata to give the final result.\n", + "\n", + "Phew!\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "4d971644-7cd2-4cda-8004-c0bb588e3a8b", + "metadata": { + "tags": [] + }, "source": [ "## More complex situations\n", "\n", @@ -508,8 +628,16 @@ "\n", "Xarray needs a lot of extra metadata, so depending\n", "on the function, extra arguments such as `output_dtypes` and `output_sizes` may\n", - "be necessary for supporting dask arrays. We demonstrate this below.\n", - "\n", + "be necessary for supporting dask arrays. We demonstrate this below." + ] + }, + { + "cell_type": "markdown", + "id": "0c984ddb-3798-435b-b33b-0f9d8a645b83", + "metadata": { + "tags": [] + }, + "source": [ "### Adding new dimensions\n", "\n", "We use the [expand_dims example](newdim) that changes the size of the input along a single dimension.\n", @@ -668,7 +796,7 @@ " np.interp, # first the function\n", " newlat,\n", " ds.air.lat,\n", - " ds.air.chunk({\"time\": 2, \"lon\": 2}),\n", + " ds.air.chunk({\"time\": 100, \"lon\": -1}),\n", " input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n", " output_core_dims=[[\"lat\"]],\n", " exclude_dims={\"lat\"},\n", @@ -712,7 +840,7 @@ " np.interp, # first the function\n", " newlat,\n", " ds.air.lat,\n", - " ds.air,\n", + " ds.chunk({\"time\": 100, \"lon\": -1}),\n", " input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n", " output_core_dims=[[\"lat\"]],\n", " exclude_dims={\"lat\"}, # dimensions allowed to change size. Must be set!\n", @@ -724,6 +852,31 @@ "interped" ] }, + { + "cell_type": "markdown", + "id": "e484de1b-9d2f-4ad0-b978-550df593ee2f", + "metadata": {}, + "source": [ + "Again, it is important to understand the order of execution when executing `interped.compute()` which looks ilke\n", + "\n", + "`xarray.apply_ufunc` → `dask.array.apply_gufunc` → `numpy.vectorize` → `numpy.interp`\n", + "\n", + "\n", + "When executed\n", + "\n", + "1. Xarray loops over all data variables.\n", + "1. Xarray unwraps the underlying dask array (e.g. `ds.air`) and passes that to dask's `apply_gufunc`.\n", + "1. `apply_gufunc` calls the vectorized function on each block of the array.\n", + "1. For each block, `numpy.vectorize` handles looping over the loop dimensions \n", + " and passes 1D vectors along the core dimension to `numpy.interp`\n", + "1. The 1D results for each block are concatenated by `numpy.vectorize` to create one output block.\n", + "1. dask stitches all the output blocks to form the output array.\n", + "1. `xarray.apply_ufunc` wraps the output array with Xarray metadata to give the final result.\n", + "\n", + "Phew!\n", + "\n" + ] + }, { "cell_type": "markdown", "id": "c3618181-7432-409e-95e7-d7bc4cc567ce", @@ -737,13 +890,11 @@ { "cell_type": "code", "execution_count": null, - "id": "6b375c8a-b7ef-47a6-b007-1fc2f34a2cf5", - "metadata": { - "tags": [] - }, + "id": "b5d9ab78-1b89-4530-a904-1d1e8475e949", + "metadata": {}, "outputs": [], "source": [ - "client.close()" + "client.close();" ] } ], From e36fde57f5de7252da16896967c68488150be255 Mon Sep 17 00:00:00 2001 From: dcherian Date: Tue, 27 Jun 2023 14:25:50 -0600 Subject: [PATCH 34/38] last fix? --- advanced/apply_ufunc/example-interp.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/advanced/apply_ufunc/example-interp.ipynb b/advanced/apply_ufunc/example-interp.ipynb index 40cbe9a1..2cfe45d1 100644 --- a/advanced/apply_ufunc/example-interp.ipynb +++ b/advanced/apply_ufunc/example-interp.ipynb @@ -35,7 +35,7 @@ " \n", "## Learning goals \n", "\n", - "This example will illustrate \n", + "Our goal is to use `apply_ufunc` with a general function so that we can reuse our code to apply to different xarray datasets or along different dimensions. Specifically, this example will illustrate \n", "1. Specifying core dimensions with `input_core_dims`\n", "1. Handling core dimensions of the output with `output_core_dims`\n", "1. Handling core dimensions that change size using `exclude_dims`\n", From 6fefa599ff4cf470a4885f232f608de26c0f825e Mon Sep 17 00:00:00 2001 From: dcherian Date: Tue, 27 Jun 2023 14:32:01 -0600 Subject: [PATCH 35/38] more improvements. --- advanced/apply_ufunc/dask_apply_ufunc.ipynb | 32 ++++++++++++++------- 1 file changed, 22 insertions(+), 10 deletions(-) diff --git a/advanced/apply_ufunc/dask_apply_ufunc.ipynb b/advanced/apply_ufunc/dask_apply_ufunc.ipynb index 34a403c0..93520906 100644 --- a/advanced/apply_ufunc/dask_apply_ufunc.ipynb +++ b/advanced/apply_ufunc/dask_apply_ufunc.ipynb @@ -557,15 +557,29 @@ "source": [ "Note that we received an Xarray object back (`integrated`) but our wrapper function was called with a numpy array of shape `(1,1,1)`.\n", "\n", - "Importantly the full 3D array has **not yet been** passed to `integrate_wrapper`. Yet dask needs to know the shape and dtype of the result. This is key. \n", + "```{important}\n", + "the full 3D array has **not yet been** passed to `integrate_wrapper`. Yet dask needs to know the shape and dtype of the result. This is key. \n", + "```\n", "\n", "The `integrate_wrapper` function is treated like a black box, and its effect on the inputs has to either be described through additional keyword arguments, or inferred by passing dummy inputs.\n", "\n", "To do so, `dask.array.apply_gufunc` calls the user function with dummy inputs (here a numpy array of shape `(1,1,1)`), and inspects the returned value to understand that one dimension was removed (returned a numpy array of shape `(1,1)`.\n", "\n", - "```{caution}\n", + "````{caution}\n", + ":class: dropdown\n", + "\n", "Some functions can have trouble handling such dummy inputs. Alternatively you can pass `meta = np.ones((1,1))` in `dask_gufunc_kwargs` to prevent dask from providing dummy inputs to the array.\n", + "```python\n", + "xr.apply_ufunc(\n", + " integrate_wrapper,\n", + " ds,\n", + " input_core_dims=[[\"lon\"]],\n", + " kwargs={\"axis\": -1},\n", + " dask=\"parallelized\",\n", + " dask_gufunc_kwargs={\"meta\": np.ones((1,1))},\n", + ")\n", "```\n", + "````\n", "\n", "Since no errors were raised we proceed as-is.\n", "\n", @@ -591,15 +605,13 @@ "id": "c265429d-6abb-4475-a9b0-2dcb0902afb1", "metadata": {}, "source": [ - "We see that `integrate_wrapper` is called many times!\n", - "\n", - "As many times as there are blocks in the array in fact, which is 30 here (`ds.air.data.numblocks`).\n", + "We see that `integrate_wrapper` is called many times! As many times as there are blocks in the array in fact, which is 30 here (`ds.air.data.numblocks`).\n", "\n", - "Our function is independently executed on each block of the array, and then the results.\n", + "Our function is independently executed on each block of the array, and then the results are concatenated to form the final result.\n", "\n", - "The order of execution when executing `integrated.compute()` which looks ilke\n", + "Conceptually, there is a two-way flow of information between various packages when executing `integrated.compute()`:\n", "\n", - "`xarray.apply_ufunc` → `dask.array.apply_gufunc` → `integrate_wrapper` → `scipy.integrate.trapz`\n", + "`xarray.apply_ufunc` ↔ `dask.array.apply_gufunc` ↔ `integrate_wrapper` ↔ `scipy.integrate.trapz` ↔ `ds.air.data`\n", "\n", "\n", "When executed\n", @@ -857,9 +869,9 @@ "id": "e484de1b-9d2f-4ad0-b978-550df593ee2f", "metadata": {}, "source": [ - "Again, it is important to understand the order of execution when executing `interped.compute()` which looks ilke\n", + "Again, it is important to understand the conceptual flow of information between the variuus packages when executing `interped.compute()` which looks ilke\n", "\n", - "`xarray.apply_ufunc` → `dask.array.apply_gufunc` → `numpy.vectorize` → `numpy.interp`\n", + "`xarray.apply_ufunc` ↔ `dask.array.apply_gufunc` ↔ `numpy.vectorize` ↔ `numpy.interp`\n", "\n", "\n", "When executed\n", From 0c5fb88ab5253eb9f56a27b0220873168c0174ea Mon Sep 17 00:00:00 2001 From: dcherian Date: Tue, 27 Jun 2023 14:35:33 -0600 Subject: [PATCH 36/38] lint --- advanced/apply_ufunc/dask_apply_ufunc.ipynb | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/advanced/apply_ufunc/dask_apply_ufunc.ipynb b/advanced/apply_ufunc/dask_apply_ufunc.ipynb index 93520906..0538c290 100644 --- a/advanced/apply_ufunc/dask_apply_ufunc.ipynb +++ b/advanced/apply_ufunc/dask_apply_ufunc.ipynb @@ -451,7 +451,7 @@ "tags": [] }, "source": [ - "#### Embarassingly parallel or blockwise operations\n", + "#### Embarrassingly parallel or blockwise operations\n", "\n", "`dask=\"parallelized\"` works well for \"blockwise\" or \"embarrassingly parallel\" operations ([Wikipedia](https://en.wikipedia.org/wiki/Embarrassingly_parallel)).\n", "\n", @@ -736,7 +736,7 @@ "id": "72177e24-0c3b-4e73-9a2b-9929d19c0490", "metadata": {}, "source": [ - "We will now repeat the [interpolation example from earlier](interp-add-new-dims) with `\"lat\"` as the output core dimension. See the numpy notebook on [complex output](complex-output) for more.\n", + "We will now repeat the [interpolation example from earlier](interp-add-new-dim) with `\"lat\"` as the output core dimension. See the numpy notebook on [complex output](complex-output) for more.\n", "\n", "```python\n", "newlat = np.linspace(15, 75, 100)\n", @@ -903,7 +903,9 @@ "cell_type": "code", "execution_count": null, "id": "b5d9ab78-1b89-4530-a904-1d1e8475e949", - "metadata": {}, + "metadata": { + "tags": [] + }, "outputs": [], "source": [ "client.close();" From e2c77b0dfafaca336e7307262f64c9599d680f3f Mon Sep 17 00:00:00 2001 From: dcherian Date: Tue, 27 Jun 2023 14:46:39 -0600 Subject: [PATCH 37/38] suppress gufunc error formatting warning --- _config.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_config.yml b/_config.yml index 9cecd16b..d5d27504 100644 --- a/_config.yml +++ b/_config.yml @@ -68,7 +68,7 @@ sphinx: - sphinx_exercise config: # application/vnd.holoviews_load.v0+json, application/vnd.holoviews_exec.v0+json - suppress_warnings: ['mystnb.unknown_mime_type'] + suppress_warnings: ['mystnb.unknown_mime_type', 'misc.highlighting_failure'] notfound_context: body: "

Whoops! 404 Page Not Found

\n\n

Sorry, this page doesn't exist. Many sections of this book have been updated recently.

Try the search box 🔎 to find what you're looking for!

" notfound_urls_prefix: / From 34e34c00c964f9df1d9e265b7e6127710ad881df Mon Sep 17 00:00:00 2001 From: dcherian Date: Tue, 27 Jun 2023 14:51:50 -0600 Subject: [PATCH 38/38] remove dask output spam --- advanced/apply_ufunc/dask_apply_ufunc.ipynb | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/advanced/apply_ufunc/dask_apply_ufunc.ipynb b/advanced/apply_ufunc/dask_apply_ufunc.ipynb index 0538c290..aacaab62 100644 --- a/advanced/apply_ufunc/dask_apply_ufunc.ipynb +++ b/advanced/apply_ufunc/dask_apply_ufunc.ipynb @@ -904,7 +904,9 @@ "execution_count": null, "id": "b5d9ab78-1b89-4530-a904-1d1e8475e949", "metadata": { - "tags": [] + "tags": [ + "remove-output" + ] }, "outputs": [], "source": [