netCDF and Xarray

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

netCDF and xarray

Ezequiel Cimadevilla Álvarez


ezequiel.cimadevilla@unican.es

Santander Meteorology Group


ETSI Caminos, Department of Applied Mathematics and Computer Sciences
University of Cantabria
Avenida de los Castros s/n
39005 Santander, Spain
http://www.meteo.unican.es

Máster Data Science/Ciencia de Datos - 2020/2021


Agenda
● Multidimensional arrays in climate science
● netCDF
○ Library
○ Data model
● xarray
○ Labelled multidimensional arrays
Multidimensional arrays in climate science
Multidimensional arrays in climate science
● Climate data is the output of climate models and/or observations
● Climate data is multidimensional (e.g. tas.npy)
● IPCC Sixth Assessment Report (AR6)
Multidimensional arrays in climate science
● X (longitude), Y (latitude), T (time) - 3D
○ precipitation or mean sea level pressure
● X (longitude), Y (latitude), Z (level), T (time) - 4D
○ temperature at isboraric levels or wind speed profile
● X (longitude), Y (latitude), Z (level), T (time), E (realization) - 5D
○ multiple model, initializations and parameter ensembles
Multidimensional arrays in climate science
● Climate data is distributed by modelling institutions
○ ESGF (Earth System Grid Federation)
○ University of Cantabria owns a tier 2 data node ;)
● To avoid huge file sizes, datasets are often split by time and by variable

Dataset

Files in Dataset
Multidimensional arrays in climate science
● Metadata is a crucial requirement for scientific data
● What can you tell from the tas.npy dataset?
○ Global metadata - Institution? Model? Experiment?
○ Time - Units? Calendar (365,360 days)? Leap years?
○ Temperature - Units? Missing values? Mean, median, max/min? Point or cell?
Multidimensional arrays in climate science
● “Information about a document rather than a document content” (Raggett Hors
& Jacobs, 1999)
○ Units, coordinate systems, model and institution information...
● File name can provide metadata
Multidimensional arrays in climate science
● Open tas.npy, what can you say from it?
● We can use NumPy ‘np.save’ and ‘np.load’ to share our
multidimensional arrays but...
○ no metadata is available
○ no coordinates are available
○ disk representation is not optimal
■ c contiguous or f contiguous
■ chunked
Multidimensional arrays in climate science
We want two different things

● Data model and disk storage format for multiple chunked multidimensional
arrays with metadata (netCDF)
● Library for semantic data analysis (xarray)
netCDF
netCDF
Network Common Data format

● Developed by Unidata
● “a set of software libraries
and machine-independent
data formats that support
the creation, access, and
sharing of array-oriented
scientific data”
● C, Python and Java APIs
netCDF
Network Common Data format

● Data Model
○ Groups, Variables, Dimensions,
Attributes, Data types
● Classic vs version 4 data models
○ Since version 4, netCDF files are
valid HDF5 files
○ Alternative backends can be
implemented (NCZarr)
Multidimensional arrays in climate science
● netCDF files contain multiple multidimensional arrays (variables) and metadata
○ See the source of tas.npy
● netCDF files can be persisted using contiguous or chunked alignment
● netCDF files can be opened using the netCDF4-python library
netCDF
● Well integrated within the Python ecosystem
netCDF
● Format used for international climate research projects (CMIP6, CORDEX)
○ ESGF (Earth System Grid Federation)
● To avoid huge file sizes, datasets are often split by time and by variable

Dataset

Files in Dataset
netCDF
CF Conventions

● Metadata that provide a definitive description of what the data in each variable
represents, and the spatial and temporal properties of the data.
● Interoperability between applications that are “CF compliant”
● Standard table of variable standard names
● See CMIP6_ScenarioMIP_CSIRO_ACCESS-ESM1-5_ssp585_r1i1p1f1_Amon
again
xarray
xarray
● Xarray introduces labels in the form of dimensions, coordinates and attributes
on top of raw NumPy-like arrays
● More intuitive, more concise, and less error-prone developer experience
● Real-world datasets are usually more than just raw numbers
○ Labels which encode information about how the array values map to locations in
space, time, etc.
xarray
● Apply operations over dimensions by name: x.sum('time')
● Select values by label (or logical location) instead of integer location:
x.loc['2014-01-01'] or x.sel(time='2014-01-01')
● Mathematical operations (e.g., x - y) vectorize across multiple dimensions (array
broadcasting) based on dimension names, not shape.
● Easily use the split-apply-combine paradigm with groupby:
x.groupby('time.dayofyear').mean()
● Database-like alignment based on coordinate labels that smoothly handles
missing values: x, y = xr.align(x, y, join='outer')
● Keep track of arbitrary metadata in the form of a Python dictionary: x.attrs
xarray

Two core data structures, which build upon and extend the core strengths of NumPy
and pandas. Both data structures are fundamentally N-dimensional:

● DataArray is the implementation of a labeled, N-dimensional array. It is an N-D


generalization of a pandas.Series. The name DataArray itself is borrowed from
Fernando Perez’s datarray project, which prototyped a similar data structure.
● Dataset is a multi-dimensional, in-memory array database. It is a dict-like
container of DataArray objects aligned along any number of shared dimensions,
and serves a similar purpose in xarray to the pandas.DataFrame.
xarray
● Heavily inspired in netCDF data model
○ However, it does not model groups, just Dataset and DataArray
○ Multiple backends: netcdf4-python, zarr
● Support for DataArrays that do not fit into memory via Dask
● Support for remote data analysis via DAP (Data Access Protocol)
○ Because it uses netcdf4-python as backend
○ Only requested data is sent over the network
netCDF and xarray
Ezequiel Cimadevilla Álvarez
ezequiel.cimadevilla@unican.es

Santander Meteorology Group


ETSI Caminos, Department of Applied Mathematics and Computer Sciences
University of Cantabria
Avenida de los Castros s/n
39005 Santander, Spain
http://www.meteo.unican.es

Máster Data Science/Ciencia de Datos - 2020/2021

You might also like