netCDF and Xarray

netCDF and xarray
Ezequiel Cimadevilla Álvarez

ezequiel.cimadevilla@unican.es
Santander Meteorology Group

ETSI Caminos, Department of Applied Mathematics and Computer Sciences
University of Cantabria
Avenida de los Castros s/n
39005 Santander, Spain
http://www.meteo.unican.es
Máster Data Science/Ciencia de Datos - 2020/2021

Agenda
● Multidimensional arrays in climate science
● netCDF
○ Library
○ Data model
● xarray
○ Labelled multidimensional arrays
Multidimensional arrays in climate science
● Climate data is the output of climate models and/or observations
● Climate data is multidimensional (e.g. tas.npy)
● IPCC Sixth Assessment Report (AR6)
● X (longitude), Y (latitude), T (time) - 3D
○ precipitation or mean sea level pressure
● X (longitude), Y (latitude), Z (level), T (time) - 4D
○ temperature at isboraric levels or wind speed profile
● X (longitude), Y (latitude), Z (level), T (time), E (realization) - 5D
○ multiple model, initializations and parameter ensembles
● Climate data is distributed by modelling institutions
○ ESGF (Earth System Grid Federation)
○ University of Cantabria owns a tier 2 data node ;)
● To avoid huge file sizes, datasets are often split by time and by variable
Dataset
Files in Dataset
● Metadata is a crucial requirement for scientific data
● What can you tell from the tas.npy dataset?
○ Global metadata - Institution? Model? Experiment?
○ Time - Units? Calendar (365,360 days)? Leap years?
○ Temperature - Units? Missing values? Mean, median, max/min? Point or cell?
● “Information about a document rather than a document content” (Raggett Hors
& Jacobs, 1999)
○ Units, coordinate systems, model and institution information...
● File name can provide metadata
● Open tas.npy, what can you say from it?
● We can use NumPy ‘np.save’ and ‘np.load’ to share our
multidimensional arrays but...
○ no metadata is available
○ no coordinates are available
○ disk representation is not optimal
■ c contiguous or f contiguous
■ chunked
We want two different things
● Data model and disk storage format for multiple chunked multidimensional
arrays with metadata (netCDF)
● Library for semantic data analysis (xarray)
netCDF
netCDF
Network Common Data format
● Developed by Unidata
● “a set of software libraries
and machine-independent
data formats that support
the creation, access, and
sharing of array-oriented
scientific data”
● C, Python and Java APIs
netCDF
Network Common Data format
● Data Model
○ Groups, Variables, Dimensions,
Attributes, Data types
● Classic vs version 4 data models
○ Since version 4, netCDF files are
valid HDF5 files
○ Alternative backends can be
implemented (NCZarr)
● netCDF files contain multiple multidimensional arrays (variables) and metadata
○ See the source of tas.npy
● netCDF files can be persisted using contiguous or chunked alignment
● netCDF files can be opened using the netCDF4-python library
netCDF
● Well integrated within the Python ecosystem
netCDF
● Format used for international climate research projects (CMIP6, CORDEX)
○ ESGF (Earth System Grid Federation)
● To avoid huge file sizes, datasets are often split by time and by variable
Dataset
Files in Dataset
netCDF
CF Conventions
● Metadata that provide a definitive description of what the data in each variable
represents, and the spatial and temporal properties of the data.
● Interoperability between applications that are “CF compliant”
● Standard table of variable standard names
● See CMIP6_ScenarioMIP_CSIRO_ACCESS-ESM1-5_ssp585_r1i1p1f1_Amon
again
xarray
xarray
● Xarray introduces labels in the form of dimensions, coordinates and attributes
on top of raw NumPy-like arrays
● More intuitive, more concise, and less error-prone developer experience
● Real-world datasets are usually more than just raw numbers
○ Labels which encode information about how the array values map to locations in
space, time, etc.
xarray
● Apply operations over dimensions by name: x.sum('time')
● Select values by label (or logical location) instead of integer location:
x.loc['2014-01-01'] or x.sel(time='2014-01-01')
● Mathematical operations (e.g., x - y) vectorize across multiple dimensions (array
broadcasting) based on dimension names, not shape.
● Easily use the split-apply-combine paradigm with groupby:
x.groupby('time.dayofyear').mean()
● Database-like alignment based on coordinate labels that smoothly handles
missing values: x, y = xr.align(x, y, join='outer')
● Keep track of arbitrary metadata in the form of a Python dictionary: x.attrs
xarray
Two core data structures, which build upon and extend the core strengths of NumPy
and pandas. Both data structures are fundamentally N-dimensional:
● DataArray is the implementation of a labeled, N-dimensional array. It is an N-D

generalization of a pandas.Series. The name DataArray itself is borrowed from
Fernando Perez’s datarray project, which prototyped a similar data structure.
● Dataset is a multi-dimensional, in-memory array database. It is a dict-like
container of DataArray objects aligned along any number of shared dimensions,
and serves a similar purpose in xarray to the pandas.DataFrame.
xarray
● Heavily inspired in netCDF data model
○ However, it does not model groups, just Dataset and DataArray
○ Multiple backends: netcdf4-python, zarr
● Support for DataArrays that do not fit into memory via Dask
● Support for remote data analysis via DAP (Data Access Protocol)
○ Because it uses netcdf4-python as backend
○ Only requested data is sent over the network
netCDF and xarray
Ezequiel Cimadevilla Álvarez
ezequiel.cimadevilla@unican.es
Santander Meteorology Group

ETSI Caminos, Department of Applied Mathematics and Computer Sciences
University of Cantabria
Avenida de los Castros s/n
39005 Santander, Spain
http://www.meteo.unican.es
Máster Data Science/Ciencia de Datos - 2020/2021

netCDF and Xarray

Uploaded by

Copyright:

Available Formats

netCDF and Xarray

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

netCDF and Xarray

Uploaded by

Copyright:

Available Formats

netCDF and xarray

Ezequiel Cimadevilla Álvarez

Santander Meteorology Group

Máster Data Science/Ciencia de Datos - 2020/2021

● DataArray is the implementation of a labeled, N-dimensional array. It is an N-D

Santander Meteorology Group

Máster Data Science/Ciencia de Datos - 2020/2021

You might also like