# Data assimilation in coastal ocean models

This presentation is on-line at http://tinyurl.com/ocean-assim

Scripts are available at http://modb.oce.ulg.ac.be/mediawiki/index.php/Ocean_Assimilation_Kit

Hit the space bar for next slide

# Data assimilation (basic concepts)

Data assimilation aims to combine model results and observation in clever way.

• state vector \mathbf x_n: model results at a given time t_n
• model f_n: model results at a given time t_n
• \mathbf x_{n+1} = f_n(\mathbf x_n)
• observation vector \mathbf y^o_n: all observations at a given time t_n

# Ways to represent uncertainty

• Neither the model nor the observations are perfect.
• Both have errors (uncertainty).
• error = systematic error (bias) + random error
• How can we represent uncertainty?
• error bars (for scalar variables) (mean, standard deviation)
• ellipsoid for more than one variables (vectors) (mean, error covariance)
• ensemble of possible values
• probability density function, cost function: what kind error is probable and what kind of error is not.

Errors in your model might be due to

• errors in initial conditions
• errors in open ocean boundary conditions
• errors in atmospheric fields (wind, air temperature, ...)
• errors in bathymetry
• inappropriate parameterizations
• discretization error
• ...

Errors in your observations might be due to

• instrumental error (bias, drift, limited accuracy and precision)

Observations might not represent exactly the same as the model variables

• mismatch in resolved scale
• mismatch in resolved processes
• ...

# Errors covariance

• For Gaussian-distributed errors.
• Example: \mathbf x = \left(T_1,T_2\right) \mathbf P = \left( \begin{array}{cc} \color{red}{P_{11}} & P_{12} \\ P_{12} & \color{red}{P_{22}} \end{array} \right)
• Graphical representation: ellipsoid ( \mathbf x \mathbf P^{-1} \mathbf x = 1 )
• Error modes (EOF: empirical orthogonal functions)
• Ensemble

# Data assimilation in coastal ocean models

• Kalman filter
• Invented for spacecraft and missiles
• Known since 1960 → so problem solved?
• Assumptions of Kalman filter
• Error covariances known, no bias, linear model,...
• None are really applicable to realistic ocean models

# Ensemble simulation

1. Perturbation of uncertain model inputs (initial conditions, boundary conditions, parameters) within their range of uncertainty
• Homogeneous and isotropic perturbations
• EOF perturbations
2. Model run with perturbed forcing
3. Analysis
• global scheme
• local scheme

# Ensemble perturbation

• Homogeneous and isotropic perturbations
• time series: Matlab/Octave code randnvec (use help randnvec in Octave/Matlab)
• 2D-field: Matlab/Octave code randomfield
Generate 500 random 2D-fields representing an area of 300 km x 300 km and a spatial correlation length scale of 20 km with adequate resolution. Compute the covariance of the central grid point with all other grid points. The ensemble covariance between two random variables x and y (with zero mean) is computed as:
\mbox{cov}(x,y) = \frac{1}{K} \sum_{k=1}^{K} x_k y_k
where k is the ensemble member index. Repeat with K = 100. Describe the structures you are seeing and the differences between the case K = 100 and K = 500.

# Solution

m = 100;
n = 100;
K = 500;
q = randomfield(m,n,K,3e3,3e3,20e3);

covar = zeros(m,n);

for j = 1:n
for i = 1:m
covar(i,j) = sum(q(i,j,:) .* q(m/2,n/2,:));
end
end
covar = covar/K;

% or with without loops
covar2 = sum(q .* repmat(q(m/2,n/2,:),[m n 1]),3)/K;


# Ensemble perturbation using EOFs

Create wind perturbations using multivariate EOFs using the test data in wind.nc. The matrix \mathbf W containing the zonal u^a and meridional v^a wind components is defined as:
\begin{eqnarray} \mathbf W_{ij} &=& u^a_{ij} \\ \mathbf W_{i+m,j} &=& v^a_{ij} \end{eqnarray}
where i is the spatial index (1 \le i \le m) and j is the temporal index (1 \le j \le n) and m and n are the number of spatial grid points and time steps respectively. The size of the matrix \mathbf W is thus 2m \times n. Wind fields anomalies:
\begin{eqnarray} \overline{\mathbf{W}}_{i} &=& \frac{1}{n} \sum_{j=1}^n W_{ij} \\ \mathbf W'_{ij} &=& \mathbf W_{ij} - \overline{\mathbf{W}}_{j} \end{eqnarray}
for all i \le 2m.

# Ensemble perturbation using EOFs

A SVD decomposition, limited to the r=10 dominant eigenvectors (EOFs):
$$\mathbf W' \sim \mathbf U \mathbf \Sigma \mathbf V^T$$
where \mathbf U^T \mathbf U = \mathbf I, \mathbf V \mathbf V^T = \mathbf I and \mathbf \Sigma is a r \times r diagonal matrix. The EOFs \mathbf U are used to produce multivariate, spatial perturbations of the wind field:
$$\mathbf W^{(k)} = \mathbf W + \mathbf U \mathbf \Sigma \mathbf A^{(k)}$$
where \mathbf A^{(k)} is a r \times n matrix containing random time series and where the superscript k represents the ensemble member. The random time series (rows of \mathbf A^{(k)}) have a standard deviation of \frac{1}{3} and a temporal correlation length scale of L=7 days.

# Ensemble perturbation using EOFs

Useful functions:
• statevector_init, statevector_pack, statevector_unpack to combine different variables in a multivariate state vector using a mask.
• % a mask parameter for every variable
% create a multivariate state vector
W = statevector_pack(sv,u,v);
% clever algorithm to create Wp...
% reconstitute individual variables
[up,vp] = statevector_unpack(sv,Wp);

• svds to make a SVD decomposition limited to a reduced number of modes.

# Analysis

• Description of test case
• What is a twin experiment?
• How to define the state vector \mathbf x?
• How to define the observation vector \mathbf y^o?

# Twin experiment

• In realistic assimilation experiments we do not know the truth \mathbf x^t.
• The analyzed model solution is closer to the assimilated observations.
• But what about the part of the model state which is not observed?
• → Twin experiment

Scheme of a twin experiment:

• Run your model with initial conditions (IC), boundary conditions (BC), forcing fields (e.g. winds, air temperature,...) that you assume to be the "true" solution.
• Extract pseudo-observations from the simulation.
• Perturb your IC, BC and forcing fields.
• Based on those perturbed fields and the extracted pseudo-observation determine if you can recover the "true" solution.

# Test case

• ROMS nested in Mediterranean Ocean Forecasting System
• 1/60 degree resolution and 32 vertical levels
• Atmospheric forcings come from the limited-area model COSMO (hourly at 2.8 km resolution)
• 100 ensemble members representing a 24 h simulation
• Number of ensemble members will be limited by memory
• Model domain is defined by netCDF files Common/LigurianSea.nc

# Ocean Assimilation Kit (OAK)

• Reduced rank square root analysis
• Global and local algorithm
• Modular Fortran 90 program
• Flexible definition of state vector
• Supports arbitrary curvilinear grid
• Local algorithm parallelized with OpenMP and MPI
• NetCDF or Fortran binary files as input

# Compilation

• Edit config.mk
• OS: operating system (e.g. Linux, AIX, ...)
• FORT: Fortran compiler (e.g. gfortran, ifort, pgi, ...)
• library location (NetCDF, LAPACK, BLAS). For NetCDF:
• NETCDF_INCDIR: directory which should contain include files (netcdf.inc and netcdf.mod
• NETCDF_LIBDIR: directory which should contain the library (e.g. libnetcdf.a
• NETCDF_LIB: name of the Fortran NetCDF library and dependence (e.g. -lnetcdff -lnetcdf)
# Default compilation (using all variables in config.mk)
make
# Override some variables in config.mk
make OPENMP=on

• Always use make clean if compiler settings are changed.

# Concepts for the assimilation module

• A netCDF file can contain multiple variables (as arrays)
• A specific array in a netCDF file is referenced by:
• NetCDF_filename#NetCDF_variable
• Files can be compressed with gzip. They are decompressed when the file is accessed.
• model_result.nc.gz#temperature

• Subset of a file can be extracted using the Matlab/Octave syntax: e.g.
• model_result.nc#temperature(1:10,1:20,:)
model_result.nc#temperature(1:10,:,end)
• This is implemented in ufileformat.F90

# NetCDF files from Octave/Matlab

•     SST = gread('file.nc#temp(:,:,end)');

• Write a netCDF variable:
•     gwrite('file.nc#temp(:,:,end)',SST);

• Order of dimensions for gread, gwrite and the OAK is: longitude, latitude, depth and time (netCDF files following the CF-convention)
• List all variables in a netCDF file (shell command)
•     ncdump -h file.nc

• Order of dimensions for ncdump: time, depth, latitude and longitude (netCDF files following the CF-convention)

# Initialization file

• See template assim.init.template
• Each line: name = value
• If value is a string (such as file name), it must be in single quotes:
• Example:
• ! this is a comment
runtype = 2
Geoflow.maxU = 0.3
logfile = 'assim.log'

• List of values in brackets
• Model.variables = ['ETA','TEM','SAL']
Zones.corrLength.const = [30e3,30000,30000.]
• Lines can become very long. Better disable wrapping of long lines in your text editor (Emacs: Alt-x toggle-truncate-line; gedit: Edit > Preferences > View Tab > Text Wrapping).

# State vector

• All variables that should be corrected by assimilation
• May contain diagnostic variables
• Model.variables: vector of unique names
• Model.gridX,Y,Z: file names of coordinates
• Model.mask: land-sea mask of model grid (land: _FillValue, sea: any other value)
• Model.path: Path to be added to all file names of this section
Model.variables = [                  'zeta',         'temp',         'salt']
Model.gridX     = ['domain.nc#lon(:,:,end)','domain.nc#lon','domain.nc#lon']
Model.gridY     = ['domain.nc#lat(:,:,end)','domain.nc#lat','domain.nc#lat']
Model.gridZ     = [  'domain.nc#z(:,:,end)',  'domain.nc#z',  'domain.nc#z']
Model.mask      = [  'domain.nc#z(:,:,end)',  'domain.nc#z',  'domain.nc#z']
Model.path      = '/home/user/Data/'


Note: The coordinate arrays gridX, gridY and gridZ should have the same size than the mask. If some dimensions are missing in grid array, then the values are repeated along this dimension. For example if mask is 25x20x10, but gridX is 25x20, then the values of gridX are repeated 10 times.

# Observation vector

• All observations (usually at a given time)
• 001: is the time index.
• *: can be used to designate any time instance. For example if all observations are in the same folder then Obs*.path can be used to avoid repetitions.
• For some complicated observation types, the observation operator H can be specified as a sparse matrix.
Obs001.time      = '2010-07-06T00:30:00.00'  ! time as YYYY-MM-DDThh:mm:ss[.ss]
Obs001.path      = 'Obs/'                    ! where the file can be found
Obs001.variables  = [               'TEM']   ! name as in Model.variables
Obs001.names      = [      'temp_profile']   ! descriptive name
Obs001.gridX      = [       'obs1.nc#lon']   ! longitude
Obs001.gridY      = [       'obs1.nc#lat']   ! latitude
Obs001.gridZ      = [         'obs1.nc#z']   ! depth
Obs001.value      = [      'obs1.nc#temp']   ! value of the observations
Obs001.rmse       = [ 'obs1.nc#temp_rmse']   ! root mean square error


# Reduced order analysis

The best linear unbiased estimator (BLUE) of the model's state vector given the model forecast \mathbf x^f with error covariance \mathbf P^f and the observation \mathbf y^o with error covariance \mathbf R is given by \mathbf x^a:
\begin{eqnarray} \mathbf x^a &=& \mathbf x^f + \mathbf K \left(\mathbf y^o - \mathbf H \mathbf x^f \right) \\ \mathbf K &=& \mathbf P^f \mathbf H^T \left( \mathbf H \mathbf P^f \mathbf H^T + \mathbf R \right)^{-1} \\ \mathbf P^a &=& \mathbf P^f - \mathbf K \mathbf H \mathbf P^f \end{eqnarray}
where \mathbf H is the observation operator extracting the observed part of the state vector and \mathbf P^a the error covariance of the analysis \mathbf x^a. As many other assimilation schemes (SEEK, RRSQRT, ESSE, EnKF) we decompose \mathbf P^f in:
\mathbf P^f = \mathbf S^f {\mathbf S^f}^T
Where \mathbf S^f is a n \times r matrix. The reduced order implementation is only effective if r is small (r << n).
For an ensemble {\mathbf x^f}^{(k)} \; (k=1,\dots,r), \mathbf x^f is the ensemble average and the columns of \mathbf S^f are constructed by:
(\mathbf S^f)^{(k)} = \frac{1}{\sqrt{r-1}} \left({\mathbf x^f}^{(k)} - {\mathbf x^f} \right)
In practice, the following eigenvalue decomposition is made,
\left(\mathbf H \mathbf S^f\right)^T \mathbf R^{-1} \mathbf H \mathbf S^f = \mathbf U \mathbf \Lambda \mathbf U^T
where \mathbf U^T \mathbf U = \mathbf I and where \Lambda is diagonal. The Kalman gain \mathbf K and \mathbf S^a can be computed by:
\begin{eqnarray} \mathbf K &=& \mathbf S^f \mathbf U (1+ \mathbf \Lambda)^{-1} \mathbf U^T (\mathbf H \mathbf S^f)^T \mathbf R^{-1} \\ \mathbf S^a &=& \mathbf S^f \mathbf U (1+ \mathbf \Lambda)^{-1/2} \mathbf U^T \mathbf \Omega \end{eqnarray}
\mathbf \Omega is an orthogonal matrix chosen such that average of all columns is zero. \mathbf S^a is the square root of \mathbf P^a:
\mathbf P^a = \mathbf S^a {\mathbf S^a}^T
Based on \mathbf x^a and \mathbf S^a, an ensemble can be reconstructed:
{\mathbf x^a}^{(k)} = \mathbf x^a + {\sqrt{r-1}} \; {\mathbf S^a}^{(k)}

# Error space

• Vector space containing the error of the state vector \mathbf x^{(k)}
• Can be defined through:
1. EOF modes (columns of \mathbf S^f),i.e. mean is already subtracted
2. An ensemble of states ({\mathbf x^a}^{(k)})
ErrorSpace.dimension = 50
ErrorSpace.path      = './'
ErrorSpace.init      = ['("Ens",I3.3,"/ic.nc#zeta")','("Ens",I3.3,"/ic.nc#u")']
ErrorSpace.type      = 2 ! 1: modes; 2: ensemble

The strings in ErrorSpace.init are Fortran formats. For example for the first ensemble member, ("Ens",I3.3,"/ic.nc#zeta") will be expanded as Ens001/ic.nc#zeta

# Diagnostics

• stddevxf,stddevxa: standard deviation of the model error before and after assimilation.
• xa-xf: analysis increment.
• Hxf,Hxa: analysis/forecast at the observation location.
• yo, invsqrtR: observations used by the assimilation and their weight (\mathbf R^{-1/2}).
! same elements as state vector
Diag*.stddevxf  =  ['stddevxf.nc#zeta','stddevxf.nc#temp','stddevxf.nc#salt']
Diag*.stddevxa  =  ['stddevxa.nc#zeta','stddevxa.nc#temp','stddevxa.nc#salt']
Diag*.xf        =  ['forecast.nc#zeta','forecast.nc#temp','forecast.nc#salt']
Diag*.xa        =  [      'ic.nc#zeta',      'ic.nc#temp',      'ic.nc#salt']
Diag*.xa-xf     =  [     'inc.nc#zeta',     'inc.nc#temp',     'inc.nc#salt']

! same element as observation vector
Diag001.Hxf       = [        'xf.nc#sst']
Diag001.Hxa       = [        'xa.nc#sst']
Diag001.yo        = [        'yo.nc#sst']
Diag001.invsqrtR  = [  'invsqrtR.nc#sst']
Diag001.stddevHxf = [ 'stddevHxf.nc#sst']
Diag001.stddevHxa = [ 'stddevHxa.nc#sst']
Diag001.path   = 'Analysis001/'


# Assimilation of a point measurement

• Correction is proportional to a column of \mathbf P^f
• With a finite-size ensemble: spurious long-range correlation
Using the ensemble of the Ligurian Sea, assimilate a point measurement located at (lon = 9.5, lat = 43.2) and 3 meters depth. Choose sensible values for temperature, and root-mean square error. Repeat this with a point at (lon = 9, lat = 44.2). Describe the differences of the increments \mathbf x^a - \mathbf x^f and error standard deviation of the forecast and the analysis. Useful files: assim.init.template and LigurianSea.nc. A model output can be used as mask.
assim-gfortran-single assimTS.init 1


# Global versus local assimilation

• Long-range correlation: error in large-scale processes air sea exchange
• Short-range correlation: small-scale dynamical processes (like eddies)
• With a finite-size ensemble: some times spurious long-range correlation

# Error reduction

• Error reduction of standard deviation error as measured by \mbox{stddev}(\mathbf x^f_i)/\mbox{stddev}(\mathbf x^a_i)

# Local assimilation

• State vector is partitioned into zones (e.g. water columns)
• Assimilation is performed in each zone independently using a sub-set of observations
• Weight of observation (\mathbf R^{-1}) is artificially increased for observations far away from the considered zone.
• since every zone is independent: easily done in parallel (OpenMP or MPI)
! possible values of "schemetype" are:
! 0 = global assimilation (default)
! 1 = local assimilation

schemetype = 0

Definition of zones:
!Model.variables =       [                  'ETA',         'TEM',         'SAL']
Zones.partition        = ['grid.nc#part(:,:,end)','grid.nc#part','grid.nc#part']
Zones.corrLength.const = [                   30e3,          30e3,          30e3]
Zones.maxLength.const  = [                 2000e3,        2000e3,        2000e3]
Zones.path = 'Common/'

All elements of those arrays are integer values. Variables are in the same zone if the same integer is associated to them.

# Local assimilation

Repeat the point measurement assimilation with local scheme and describe the differences of the increments \mathbf x^a - \mathbf x^f. Use the script gen_part to create the partition file from the mask.
assim-gfortran-single-openmp assimTS.init 1


# SST Assimilation

• Use a state vector composed by temperature and salinity (ic.nc#temp, ic.nc#salt)
• Extract SST from ensemble member 100
• Add noise (spatially uncorrelated noise) with standard deviation of 0.5 °C
• Assimilate this extracted observation (but without using the ensemble member 100)
• Assess the impact of the assimilation on temperature and other variables (script twinexp_compareTS).
• Repeat with spatially correlated noise to be added to the observations.
Describe the differences of the increments \mathbf x^a - \mathbf x^f.
Variable RMS(\mathbf x^f,\mathbf x^t) RMS(\mathbf x^a,\mathbf x^t)
Temperature 0.080 0.066
Salinity 0.0063 0.0052

Note: RMS is here a volume average

# SST Assimilation

• In this case, forecast already quite close to observations.
• Global analysis brings model closer to observations but some structures are not changed (especially if number of ensemble members is low).
• Local analysis allows the model to be closer to the observations.
• What could be the drawback of local analysis?

# 4D Extension

Including the time dimension:
• State vector \mathbf x include n model states at different times:
\mathbf x := \left(\mathbf x(1),\mathbf x(2),\dots,\mathbf x(n) \right)^T
• \mathbf x may now be called a model "trajectory".
• Similarly, the observation vector \mathbf y^o includes all observations within the corresponding time interval.
• Conceptually, the assimilation scheme becomes a smoother.
Including the forcing fields:
• The forcing fields (wind, boundary conditions, ...) can also be included in the 4D vector:
\mathbf x := \left(\mathbf x(1),\mathbf x(2),\dots,\mathbf x(n),\mbox{wind},\dots \right)^T
• The model can be re-run to give a model result consistent with the updated forcings.

# Assimilation of HF radar surface currents

• Radial currents are extracted by:
u_{\mathrm {HF}} = \frac{k_b}{1 - \exp(-k_b h)} \int_{-h}^0 \mathbf u(z) \cdot \mathbf e_r \exp(k_b z) dz
• k_b = \frac{2 \pi}{\lambda_b}
• \mathbf e_r is the unit vector pointing in the direction opposite to the location of the HF radar site
• positive values: current away from the system
• essentially represent an average over the upper meters.
• Smoothed in the azimuthal direction by a diffusion operator to filter scales smaller than 6 degrees

# Assimilation of HF radar surface currents

Radial currents on 2010-07-06 18:30 relative to the Palmaria site: WERA (left) and ROMS without assimilation (right).

# Surface current assimilation (exercise)

• Use surface current from ensemble member 100 (in folder HFRadar)
• Add noise (spatially uncorrelated noise) with standard deviation of 0.05 m/s
• Use initialization file assim4D.init
• Assimilate this extracted observation (but without using the ensemble member 100)
• Assess the impact of the assimilation (script twinexp_compare).
• Repeat with spatially correlated noise to be added to the observations.
Describe the differences of the increments \mathbf x^a - \mathbf x^f.

# Surface current assimilation (exercise)

Variable RMS(\mathbf x^f,\mathbf x^t) RMS(\mathbf x^a,\mathbf x^t)
Temperature 0.080 0.067
Salinity 0.0063 0.0057
u-velocity 0.010 0.0097
v-velocity 0.010 0.0095
u-wind 0.61 0.40
v-wind 0.60 0.54

# Challenges for data assimilation

• Incorporate new data types from remote sensing and in situ platforms (HF radars, gliders, SSS, coastal altimetry, ...).
• Which processes are resolved, how to handle high temporal resolution? (formulation of the observation operator).
• How does the error variance vary spatially and how is it correlated in space?) and bias estimation (estimation of observation error covariance)
• Also potential for improvement for classical'' observation types.
• Adapt assimilation schemes for complex coupled models (hydrodynamic models + biogeochemical models + sea-ice models ...).
• Non-linear and non-Gaussian systems
• Representation of error covariance (or pdf) between variables of different sub-models.

# Challenges for data assimilation (cont)

• Ensemble simulations:
• identification of the origin of model errors (e.g. uncertain forcing fields, parameters,...),
• parametrization those errors,
• potential to provide meaningful error estimates on model products (and derived products).
• What do we learn form data assimilation to:
• improve our models (reducing systematic and random errors),
• identify errors in our forcing fields (atmospheric parameters, open-ocean boundary conditions,...), and
• enhance the observational network?
• Develop robust and efficient methods suitable for operational use