Skip to content

Commit 8b46581

Browse files
scottyhqJessicaS11
andauthored
Add 2023 video link, update 2024 schedule (#284)
* add 2023 video link, update 2024 schedule * tidy outline * reorg data structure intro, add malaria example * Apply suggestions from code review Co-authored-by: Jessica Scheick <JessicaS11@users.noreply.github.com> * local malaria figure, update default links --------- Co-authored-by: Jessica Scheick <JessicaS11@users.noreply.github.com>
1 parent 8c6f71b commit 8b46581

File tree

7 files changed

+113
-115
lines changed

7 files changed

+113
-115
lines changed

.devcontainer/scipy2024/devcontainer.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
},
1313
"customizations": {
1414
"codespaces": {
15-
"openFiles": ["workshops/scipy2024/README.md"]
15+
"openFiles": ["workshops/scipy2024/index.ipynb"]
1616
},
1717
"vscode": {
1818
"extensions": ["ms-toolsai.jupyter", "ms-python.python"]

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
[![CI](https://github.com/xarray-contrib/xarray-tutorial/workflows/CI/badge.svg?branch=main)](https://github.com/xarray-contrib/xarray-tutorial/actions?query=branch%3Amain)
44
[![Jupyter Book Badge](https://jupyterbook.org/badge.svg)](https://tutorial.xarray.dev)
5-
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/xarray-contrib/xarray-tutorial/HEAD?labpath=overview/fundamental-path/index.ipynb)
5+
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/xarray-contrib/xarray-tutorial/HEAD?labpath=workshops/scipy2024/index.ipynb)
66

77
This is the repository for a Jupyter Book website with tutorial material for [Xarray](https://github.com/pydata/xarray), _an open source project and Python package that makes working with labelled multi-dimensional arrays simple, efficient, and fun!_
88

fundamentals/01_data_structures.md

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,69 @@
11
# Data Structures
22

3+
Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called “tensors”)
4+
are an essential part of computational science. They are encountered in a wide
5+
range of fields, including physics, astronomy, geoscience, bioinformatics,
6+
engineering, finance, and deep learning. In Python, [NumPy](https://numpy.org/)
7+
provides the fundamental data structure and API for working with raw ND arrays.
8+
However, real-world datasets are usually more than just raw numbers; they have
9+
labels which encode information about how the array values map to locations in
10+
space, time, etc.
11+
12+
The N-dimensional nature of Xarray’s data structures makes it suitable for
13+
dealing with multi-dimensional scientific data, and its use of dimension names
14+
instead of axis labels (`dim='time'` instead of `axis=0`) makes such arrays much
15+
more manageable than the raw NumPy ndarray: with Xarray, you don’t need to keep
16+
track of the order of an array’s dimensions or insert dummy dimensions of size 1
17+
to align arrays (e.g., using np.newaxis).
18+
19+
The immediate payoff of using Xarray is that you’ll write less code. The
20+
long-term payoff is that you’ll understand what you were thinking when you come
21+
back to look at it weeks or months later.
22+
23+
## Example: Weather forecast
24+
25+
Here is an example of how we might structure a dataset for a weather forecast:
26+
27+
<img src="https://docs.xarray.dev/en/stable/_images/dataset-diagram.png" align="center" width="80%">
28+
29+
You'll notice multiple data variables (temperature, precipitation), coordinate
30+
variables (latitude, longitude), and dimensions (x, y, t). We'll cover how these
31+
fit into Xarray's data structures below.
32+
33+
Xarray doesn’t just keep track of labels on arrays – it uses them to provide a
34+
powerful and concise interface. For example:
35+
36+
- Apply operations over dimensions by name: `x.sum('time')`.
37+
38+
- Select values by label (or logical location) instead of integer location:
39+
`x.loc['2014-01-01']` or `x.sel(time='2014-01-01')`.
40+
41+
- Mathematical operations (e.g., `x - y`) vectorize across multiple dimensions
42+
(array broadcasting) based on dimension names, not shape.
43+
44+
- Easily use the split-apply-combine paradigm with groupby:
45+
`x.groupby('time.dayofyear').mean()`.
46+
47+
- Database-like alignment based on coordinate labels that smoothly handles
48+
missing values: `x, y = xr.align(x, y, join='outer')`.
49+
50+
- Keep track of arbitrary metadata in the form of a Python dictionary:
51+
`x.attrs`.
52+
53+
## Example: Mosquito genetics
54+
55+
Although the Xarray library was originally developed with Earth Science datasets in mind, the datastructures work well across many other domains! For example, below is a side-by-side view of a data schematic on the left and Xarray Dataset representation on the right taken from a mosquito genetics analysis:
56+
57+
![malaria_dataset](../images/malaria_dataset.png)
58+
59+
The data can be stored as a 3-dimensional array, where one dimension of the array corresponds to positions (**variants**) within a reference genome, another dimension corresponds to the individual mosquitoes that were sequenced (**samples**), and a third dimension corresponds to the number of genomes within each individual (**ploidy**)."
60+
61+
You can explore this dataset in detail via the [training course in data analysis for genomic surveillance of African malaria vectors](https://anopheles-genomic-surveillance.github.io/workshop-5/module-1-xarray.html)!
62+
63+
## Explore on your own
64+
65+
The following collection of notebooks provide interactive code examples for working with example datasets and constructing Xarray data structures manually.
66+
367
```{tableofcontents}
468
569
```

fundamentals/01_datastructures.ipynb

Lines changed: 34 additions & 107 deletions
Original file line numberDiff line numberDiff line change
@@ -6,62 +6,13 @@
66
"source": [
77
"# Xarray's Data structures\n",
88
"\n",
9-
"In this lesson, we cover the basics of Xarray data structures. Our\n",
10-
"learning goals are as follows. By the end of the lesson, we will be able to:\n",
9+
"In this lesson, we cover the basics of Xarray data structures. By the end of the lesson, we will be able to:\n",
1110
"\n",
12-
"- Understand the basic data structures (`DataArray` and `Dataset` objects) in Xarray\n",
13-
"\n",
14-
"---\n",
15-
"\n",
16-
"## Introduction\n",
17-
"\n",
18-
"Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called “tensors”)\n",
19-
"are an essential part of computational science. They are encountered in a wide\n",
20-
"range of fields, including physics, astronomy, geoscience, bioinformatics,\n",
21-
"engineering, finance, and deep learning. In Python, [NumPy](https://numpy.org/)\n",
22-
"provides the fundamental data structure and API for working with raw ND arrays.\n",
23-
"However, real-world datasets are usually more than just raw numbers; they have\n",
24-
"labels which encode information about how the array values map to locations in\n",
25-
"space, time, etc.\n",
26-
"\n",
27-
"Here is an example of how we might structure a dataset for a weather forecast:\n",
28-
"\n",
29-
"<img src=\"https://docs.xarray.dev/en/stable/_images/dataset-diagram.png\" align=\"center\" width=\"80%\">\n",
30-
"\n",
31-
"You'll notice multiple data variables (temperature, precipitation), coordinate\n",
32-
"variables (latitude, longitude), and dimensions (x, y, t). We'll cover how these\n",
33-
"fit into Xarray's data structures below.\n",
34-
"\n",
35-
"Xarray doesn’t just keep track of labels on arrays – it uses them to provide a\n",
36-
"powerful and concise interface. For example:\n",
37-
"\n",
38-
"- Apply operations over dimensions by name: `x.sum('time')`.\n",
39-
"\n",
40-
"- Select values by label (or logical location) instead of integer location:\n",
41-
" `x.loc['2014-01-01']` or `x.sel(time='2014-01-01')`.\n",
42-
"\n",
43-
"- Mathematical operations (e.g., `x - y`) vectorize across multiple dimensions\n",
44-
" (array broadcasting) based on dimension names, not shape.\n",
45-
"\n",
46-
"- Easily use the split-apply-combine paradigm with groupby:\n",
47-
" `x.groupby('time.dayofyear').mean()`.\n",
48-
"\n",
49-
"- Database-like alignment based on coordinate labels that smoothly handles\n",
50-
" missing values: `x, y = xr.align(x, y, join='outer')`.\n",
51-
"\n",
52-
"- Keep track of arbitrary metadata in the form of a Python dictionary:\n",
53-
" `x.attrs`.\n",
54-
"\n",
55-
"The N-dimensional nature of xarray’s data structures makes it suitable for\n",
56-
"dealing with multi-dimensional scientific data, and its use of dimension names\n",
57-
"instead of axis labels (`dim='time'` instead of `axis=0`) makes such arrays much\n",
58-
"more manageable than the raw numpy ndarray: with xarray, you don’t need to keep\n",
59-
"track of the order of an array’s dimensions or insert dummy dimensions of size 1\n",
60-
"to align arrays (e.g., using np.newaxis).\n",
61-
"\n",
62-
"The immediate payoff of using xarray is that you’ll write less code. The\n",
63-
"long-term payoff is that you’ll understand what you were thinking when you come\n",
64-
"back to look at it weeks or months later.\n"
11+
":::{admonition} Learning Goals\n",
12+
"- Understand the basic Xarray data structures `DataArray` and `Dataset` \n",
13+
"- Customize the display of Xarray data structures\n",
14+
"- The connection between Pandas and Xarray data structures\n",
15+
":::"
6516
]
6617
},
6718
{
@@ -72,13 +23,10 @@
7223
"\n",
7324
"Xarray provides two data structures: the `DataArray` and `Dataset`. The\n",
7425
"`DataArray` class attaches dimension names, coordinates and attributes to\n",
75-
"multi-dimensional arrays while `Dataset` combines multiple arrays.\n",
26+
"multi-dimensional arrays while `Dataset` combines multiple DataArrays.\n",
7627
"\n",
7728
"Both classes are most commonly created by reading data.\n",
78-
"To learn how to create a DataArray or Dataset manually, see the [Creating Data Structures](01.1_creating_data_structures.ipynb) tutorial.\n",
79-
"\n",
80-
"Xarray has a few small real-world tutorial datasets hosted in this GitHub repository https://github.com/pydata/xarray-data.\n",
81-
"We'll use the [xarray.tutorial.load_dataset](https://docs.xarray.dev/en/stable/generated/xarray.tutorial.open_dataset.html#xarray.tutorial.open_dataset) convenience function to download and open the `air_temperature` (National Centers for Environmental Prediction) Dataset by name."
29+
"To learn how to create a DataArray or Dataset manually, see the [Creating Data Structures](01.1_creating_data_structures.ipynb) tutorial."
8230
]
8331
},
8432
{
@@ -88,7 +36,13 @@
8836
"outputs": [],
8937
"source": [
9038
"import numpy as np\n",
91-
"import xarray as xr"
39+
"import xarray as xr\n",
40+
"import pandas as pd\n",
41+
"\n",
42+
"# When working in a Jupyter Notebook you might want to customize Xarray display settings to your liking\n",
43+
"# The following settings reduce the amount of data displayed out by default\n",
44+
"xr.set_options(display_expand_attrs=False, display_expand_data=False)\n",
45+
"np.set_printoptions(threshold=10, edgeitems=2)"
9246
]
9347
},
9448
{
@@ -97,7 +51,10 @@
9751
"source": [
9852
"### Dataset\n",
9953
"\n",
100-
"`Dataset` objects are dictionary-like containers of DataArrays, mapping a variable name to each DataArray.\n"
54+
"`Dataset` objects are dictionary-like containers of DataArrays, mapping a variable name to each DataArray.\n",
55+
"\n",
56+
"Xarray has a few small real-world tutorial datasets hosted in this GitHub repository https://github.com/pydata/xarray-data.\n",
57+
"We'll use the [xarray.tutorial.load_dataset](https://docs.xarray.dev/en/stable/generated/xarray.tutorial.open_dataset.html#xarray.tutorial.open_dataset) convenience function to download and open the `air_temperature` (National Centers for Environmental Prediction) Dataset by name."
10158
]
10259
},
10360
{
@@ -147,14 +104,14 @@
147104
"cell_type": "markdown",
148105
"metadata": {},
149106
"source": [
150-
"#### What is all this anyway? (String representations)\n",
107+
"#### HTML vs text representations\n",
151108
"\n",
152109
"Xarray has two representation types: `\"html\"` (which is only available in\n",
153110
"notebooks) and `\"text\"`. To choose between them, use the `display_style` option.\n",
154111
"\n",
155112
"So far, our notebook has automatically displayed the `\"html\"` representation (which we will continue using).\n",
156-
"The `\"html\"` representation is interactive, allowing you to collapse sections (left arrows) and\n",
157-
"view attributes and values for each value (right hand sheet icon and data symbol)."
113+
"The `\"html\"` representation is interactive, allowing you to collapse sections () and\n",
114+
"view attributes and values for each value (📄 and )."
158115
]
159116
},
160117
{
@@ -171,18 +128,13 @@
171128
"cell_type": "markdown",
172129
"metadata": {},
173130
"source": [
174-
"The output consists of:\n",
131+
"☝️ From top to bottom the output consists of:\n",
175132
"\n",
176-
"- a summary of all *dimensions* of the `Dataset` `(lat: 25, time: 2920, lon: 53)`: this tells us that the first\n",
177-
" dimension is named `lat` and has a size of `25`, the second dimension is named\n",
178-
" `time` and has a size of `2920`, and the third dimension is named `lon` and has a size\n",
179-
" of `53`. Because we will access the dimensions by name, the order doesn't matter.\n",
180-
"- an unordered list of *coordinates* or dimensions with coordinates with one item\n",
181-
" per line. Each item has a name, one or more dimensions in parentheses, a dtype\n",
182-
" and a preview of the values. Also, if it is a dimension coordinate, it will be\n",
183-
" marked with a `*`.\n",
184-
"- an alphabetically sorted list of *dimensions without coordinates* (if there are any)\n",
185-
"- an unordered list of *attributes*, or metadata"
133+
"- **Dimensions**: summary of all *dimensions* of the `Dataset` `(lat: 25, time: 2920, lon: 53)`: this tells us that the first dimension is named `lat` and has a size of `25`, the second dimension is named `time` and has a size of `2920`, and the third dimension is named `lon` and has a size of `53`. Because we will access the dimensions by name, the order doesn't matter.\n",
134+
"- **Coordinates**: an unordered list of *coordinates* or dimensions with coordinates with one item per line. Each item has a name, one or more dimensions in parentheses, a dtype and a preview of the values. Also, if it is a dimension coordinate, it will be printed in **bold** font. *dimensions without coordinates* appear in plain font (there are none in this example, but you might imagine a 'mask' coordinate that has a value assigned at every point).\n",
135+
"- **Data variables**: names of each nD *measurement* in the dataset, followed by its dimensions `(time, lat, lon)`, dtype, and a preview of values.\n",
136+
"- **Indexes**: Each dimension with coordinates is backed by an \"Index\". In this example, each dimension is backed by a `PandasIndex`\n",
137+
"- **Attributes**: an unordered list of metadata (for example, a paragraph describing the dataset)"
186138
]
187139
},
188140
{
@@ -379,15 +331,6 @@
379331
"methods on `xarray` objects:\n"
380332
]
381333
},
382-
{
383-
"cell_type": "code",
384-
"execution_count": null,
385-
"metadata": {},
386-
"outputs": [],
387-
"source": [
388-
"import pandas as pd"
389-
]
390-
},
391334
{
392335
"cell_type": "code",
393336
"execution_count": null,
@@ -429,8 +372,8 @@
429372
"cell_type": "markdown",
430373
"metadata": {},
431374
"source": [
432-
"**<code>to_series</code>**: This will always convert `DataArray` objects to\n",
433-
"`pandas.Series`, using a `MultiIndex` for higher dimensions\n"
375+
"### to_series\n",
376+
"This will always convert `DataArray` objects to `pandas.Series`, using a `MultiIndex` for higher dimensions\n"
434377
]
435378
},
436379
{
@@ -446,9 +389,10 @@
446389
"cell_type": "markdown",
447390
"metadata": {},
448391
"source": [
449-
"**<code>to_dataframe</code>**: This will always convert `DataArray` or `Dataset`\n",
450-
"objects to a `pandas.DataFrame`. Note that `DataArray` objects have to be named\n",
451-
"for this.\n"
392+
"### to_dataframe\n",
393+
"\n",
394+
"This will always convert `DataArray` or `Dataset` objects to a `pandas.DataFrame`. Note that `DataArray` objects have to be named for this. Since columns in a `DataFrame` need to have the same index, they are\n",
395+
"broadcasted."
452396
]
453397
},
454398
{
@@ -459,23 +403,6 @@
459403
"source": [
460404
"ds.air.to_dataframe()"
461405
]
462-
},
463-
{
464-
"cell_type": "markdown",
465-
"metadata": {},
466-
"source": [
467-
"Since columns in a `DataFrame` need to have the same index, they are\n",
468-
"broadcasted.\n"
469-
]
470-
},
471-
{
472-
"cell_type": "code",
473-
"execution_count": null,
474-
"metadata": {},
475-
"outputs": [],
476-
"source": [
477-
"ds.to_dataframe()"
478-
]
479406
}
480407
],
481408
"metadata": {

images/malaria_dataset.png

429 KB
Loading

workshops/scipy2023/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,10 @@ Organized by:
1414

1515
## Instructions
1616

17+
:::{note}
18+
You can access a recording of this tutorial [here](https://www.youtube.com/watch?v=L4FXcIOMlsY)
19+
:::
20+
1721
### Running Locally
1822

1923
See instructions to set up the environment for running the tutorial material [here](get-started).

workshops/scipy2024/index.ipynb

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -13,14 +13,14 @@
1313
"\n",
1414
"**Xarray**: *Friendly, Interactive, and Scalable Scientific Data Analysis*\n",
1515
"\n",
16-
"July 8, 13:30–17:30 (US/Pacific), Room 317\n",
16+
"July 8, 13:30–17:30 (US/Pacific), Tacoma Convention Center Ballroom B/C\n",
1717
"\n",
1818
"This *4-hour* workshop will explore content from [the Xarray tutorial](https://tutorial.xarray.dev), which contains a comprehensive collection of hands-on tutorial Jupyter Notebooks. We will review a curated set of examples that will prepare you for increasingly complex real-world data analysis tasks!\n",
1919
"\n",
2020
":::{admonition} Learning Goals\n",
2121
"- Orient yourself to Xarray resources to continue on your Xarray journey!\n",
2222
"- Effectively use Xarray’s multidimensional indexing and computational patterns\n",
23-
"- Understand how Xarray can wrap other array types in the scientific Python ecosystem\n",
23+
"- Understand how Xarray integrates with other libraries in the scientific Python ecosystem\n",
2424
"- Learn how to leverage Xarray’s powerful backend and extension capabilities to customize workflows and open a variety of scientific datasets\n",
2525
":::\n",
2626
"\n",
@@ -33,13 +33,13 @@
3333
"| Topic | Time | Notebook Links | \n",
3434
"| :- | - | - | \n",
3535
"| Introduction and Setup | 1:30 (10 min) | --- | \n",
36-
"| Xarray Data Model, Backends, Extensions | 1:40 (40 min) | [Quick Introduction to Indexing](../../fundamentals/02.1_indexing_Basic.ipynb) <br> [Boolean Indexing & Masking](../../intermediate/indexing/boolean-masking-indexing.ipynb) | \n",
36+
"| The Xarray Data Model | 1:40 (40 min) | [Data structures](../../fundamentals/01_data_structures.md) <br> [Basic Indexing](../../fundamentals/02.1_indexing_Basic.ipynb) | \n",
3737
"| *10 minute Break* \n",
38-
"| Computational Patterns | 2:30 (50 min) | [Advanced Indexing](../../intermediate/indexing/advanced-indexing.ipynb) <br> [Computation Patterns](../../intermediate/01-high-level-computation-patterns.ipynb) <br> | \n",
38+
"| Indexing & Computational Patterns | 2:30 (50 min) | [Advanced Indexing](../../intermediate/indexing/indexing.md) <br> [Computational Patterns](../../intermediate/01-high-level-computation-patterns.ipynb) <br> | \n",
3939
"| *10 minute Break* | \n",
40-
"| Wrapping other arrays | 3:30 (50 min) | [The Xarray Ecosystem](../../intermediate/xarray_ecosystem.ipynb) <br> [Accessors](../../advanced/accessors/01_accessor_examples.ipynb) <br> [Backends](../../advanced/backends/1.Backend_without_Lazy_Loading.ipynb) <br> | \n",
40+
"| Xarray Integrations and Extensions | 3:30 (50 min) | [The Xarray Ecosystem](../../intermediate/xarray_ecosystem.ipynb) | \n",
4141
"| *10 minute Break* | \n",
42-
"| Synthesis, Explore your data! | 4:30 (50 min) <br> | [Data Tidying](../../intermediate/data_cleaning/05.1_intro.md) <br> |\n",
42+
"| Backends & Remote data| 4:30 (50 min) | [Remote Data](../../intermediate/remote_data/remote-data.ipynb) |\n",
4343
"| | End 5:30 | |\n",
4444
"\n",
4545
"\n",
@@ -66,6 +66,9 @@
6666
"- Once you see a url to click within the terminal, simply `cmd + click` the given url.\n",
6767
"- This will open up another tab in your browser, leading to a [Jupyter Lab](https://jupyterlab.readthedocs.io/en/latest/) Interface.\n",
6868
"\n",
69+
":::{warning}\n",
70+
"Consider Codespaces as ephemeral environments. You may lose your connection and any edits you make.\n",
71+
":::\n",
6972
"\n",
7073
"\n",
7174
"## Thanks for attending!\n",

0 commit comments

Comments
 (0)