Alexander Barth,
<a.barth@ulg.ac.be>

This presentation is on-line at http://tinyurl.com/ocean-assim

Scripts are available at http://modb.oce.ulg.ac.be/mediawiki/index.php/Ocean_Assimilation_Kit

Data assimilation aims to combine *model results* and *observation* in clever way.

*state vector*\mathbf x_n: model results at a given time t_n*model*f_n: model results at a given time t_n*observation vector*\mathbf y^o_n: all observations at a given time t_n

\mathbf x_{n+1} = f_n(\mathbf x_n)

- Neither the model nor the observations are perfect.
- Both have
*errors*(uncertainty). - error = systematic error (bias) + random error
- How can we represent uncertainty?
- error bars (for scalar variables) (mean, standard deviation)
- ellipsoid for more than one variables (vectors) (mean, error covariance)
- ensemble of possible values
- probability density function, cost function: what kind error is probable and what kind of error is not.

- error bars (for scalar variables) (mean, standard deviation)

Errors in your model might be due to

- errors in initial conditions
- errors in open ocean boundary conditions
- errors in atmospheric fields (wind, air temperature, ...)
- errors in bathymetry
- inappropriate parameterizations
- discretization error
- ...

Errors in your observations might be due to

- instrumental error (bias, drift, limited accuracy and precision)

Observations might not represent exactly the same as the model variables

- mismatch in resolved scale
- mismatch in resolved processes
- ...

- For Gaussian-distributed errors.
- Example: \mathbf x = \left(T_1,T_2\right) \mathbf P = \left( \begin{array}{cc} \color{red}{P_{11}} & P_{12} \\ P_{12} & \color{red}{P_{22}} \end{array} \right)
- Graphical representation: ellipsoid ( \mathbf x \mathbf P^{-1} \mathbf x = 1 )
- Error modes (EOF: empirical orthogonal functions)
- Ensemble

- Kalman filter
- Invented for spacecraft and missiles
- Known since 1960 → so problem solved?

- Assumptions of Kalman filter
- Error covariances known, no bias, linear model,...
- None are really applicable to realistic ocean models

- Perturbation of uncertain model inputs (initial conditions, boundary conditions, parameters) within their range of uncertainty
- Homogeneous and isotropic perturbations
- EOF perturbations

- Model run with perturbed forcing
- Analysis
- global scheme
- local scheme

- Homogeneous and isotropic perturbations
- time series: Matlab/Octave code
`randnvec`

(use`help randnvec`

in Octave/Matlab) - 2D-field: Matlab/Octave code
`randomfield`

- time series: Matlab/Octave code

Generate 500 random 2D-fields representing an area of 300 km x 300 km and a spatial correlation length scale of 20 km with adequate resolution. Compute the covariance of the central grid point with all other grid points. The ensemble covariance between two random variables x and y (with zero mean) is computed as:

\mbox{cov}(x,y) = \frac{1}{K} \sum_{k=1}^{K} x_k y_k

where k is the ensemble member index. Repeat with K = 100. Describe the structures you are seeing and the differences between the case K = 100 and K = 500.
m = 100; n = 100; K = 500; q = randomfield(m,n,K,3e3,3e3,20e3); covar = zeros(m,n); for j = 1:n for i = 1:m covar(i,j) = sum(q(i,j,:) .* q(m/2,n/2,:)); end end covar = covar/K; % or with without loops covar2 = sum(q .* repmat(q(m/2,n/2,:),[m n 1]),3)/K;

Create wind perturbations using multivariate EOFs using the test data in

`wind.nc`

.
The matrix \mathbf W containing the zonal u^a and meridional v^a wind components is defined as:
\begin{eqnarray}
\mathbf W_{ij} &=& u^a_{ij} \\
\mathbf W_{i+m,j} &=& v^a_{ij}
\end{eqnarray}

where i is the spatial index (1 \le i \le m) and j is the temporal index (1 \le j \le n) and m and n are the number of spatial grid points and time steps respectively. The size of the matrix \mathbf W is thus 2m \times n.
Wind fields anomalies:
\begin{eqnarray}
\overline{\mathbf{W}}_{i} &=& \frac{1}{n} \sum_{j=1}^n W_{ij} \\
\mathbf W'_{ij} &=& \mathbf W_{ij} - \overline{\mathbf{W}}_{j}
\end{eqnarray}

for all i \le 2m.
A SVD decomposition, limited to the r=10 dominant eigenvectors (EOFs):

\begin{equation}
\mathbf W' \sim \mathbf U \mathbf \Sigma \mathbf V^T
\end{equation}

where \mathbf U^T \mathbf U = \mathbf I, \mathbf V \mathbf V^T = \mathbf I and \mathbf \Sigma is a r \times r diagonal matrix. The EOFs \mathbf U are used to produce multivariate, spatial perturbations of the wind field:
\begin{equation}
\mathbf W^{(k)} = \mathbf W + \mathbf U \mathbf \Sigma \mathbf A^{(k)}
\end{equation}

where \mathbf A^{(k)} is a r \times n matrix containing random time series and where the superscript k represents the ensemble member.
The random time series (rows of \mathbf A^{(k)}) have a standard deviation of \frac{1}{3} and a temporal correlation length scale of L=7 days.
`statevector_init`

,`statevector_pack`

,`statevector_unpack`

to combine different variables in a multivariate state vector using a mask.

% a mask parameter for every variable sv = statevector_init(mask,mask); % create a multivariate state vector W = statevector_pack(sv,u,v); % clever algorithm to create Wp... % reconstitute individual variables [up,vp] = statevector_unpack(sv,Wp);

`svds`

to make a SVD decomposition limited to a reduced number of modes.- Description of test case
- What is a twin experiment?
- How to define the state vector \mathbf x?
- How to define the observation vector \mathbf y^o?

- In realistic assimilation experiments we do not know the truth \mathbf x^t.
- The analyzed model solution is closer to the assimilated observations.
- But what about the part of the model state which is not observed?
- → Twin experiment

Scheme of a twin experiment:

- Run your model with
*initial conditions*(IC),*boundary conditions*(BC),*forcing fields*(e.g. winds, air temperature,...) that you assume to be the "true" solution. - Extract
*pseudo-observations*from the simulation. - Perturb your IC, BC and forcing fields.
- Based on those
*perturbed fields*and the*extracted pseudo-observation*determine if you can recover the "true" solution.

- ROMS nested in Mediterranean Ocean Forecasting System
- 1/60 degree resolution and 32 vertical levels
- Atmospheric forcings come from the limited-area model COSMO (hourly at 2.8 km resolution)
- 100 ensemble members representing a 24 h simulation
- Number of ensemble members will be limited by memory
- Model domain is defined by netCDF files
`Common/LigurianSea.nc`

- Reduced rank square root analysis
- Global and local algorithm
- Modular Fortran 90 program
- Flexible definition of state vector
- Supports arbitrary curvilinear grid
- Local algorithm parallelized with OpenMP and MPI
- NetCDF or Fortran binary files as input

- Edit
`config.mk`

- Adapt:
`OS`

: operating system (e.g.*Linux*, AIX, ...)`FORT`

: Fortran compiler (e.g.*gfortran*, ifort, pgi, ...)- library location (NetCDF, LAPACK, BLAS). For NetCDF:
`NETCDF_INCDIR`

: directory which should contain include files (`netcdf.inc`

and`netcdf.mod`

`NETCDF_LIBDIR`

: directory which should contain the library (e.g.`libnetcdf.a`

`NETCDF_LIB`

: name of the Fortran NetCDF library and dependence (e.g.`-lnetcdff -lnetcdf`

)

# Default compilation (using all variables in config.mk) make # Override some variables in config.mk make OPENMP=on

- Always use
`make clean`

if compiler settings are changed.

- A netCDF file can contain multiple variables (as arrays)
- A specific array in a netCDF file is referenced by:

NetCDF_filename#NetCDF_variable

model_result.nc.gz#temperature

model_result.nc#temperature(1:10,1:20,:) model_result.nc#temperature(1:10,:,end)

`ufileformat.F90`

- Read a netCDF variable:

SST = gread('file.nc#temp(:,:,end)');

gwrite('file.nc#temp(:,:,end)',SST);

ncdump -h file.nc

- See template
`assim.init.template`

- Each line:
`name = value`

- If value is a string (such as file name), it must be in single quotes:
- Example:
- List of values in brackets
- Lines can become very long. Better disable wrapping of long lines in your text editor (Emacs: Alt-x toggle-truncate-line; gedit: Edit > Preferences > View Tab > Text Wrapping).

! this is a comment runtype = 2 Geoflow.maxU = 0.3 logfile = 'assim.log'

Model.variables = ['ETA','TEM','SAL'] Zones.corrLength.const = [30e3,30000,30000.]

- All variables that should be corrected by assimilation
- May contain diagnostic variables

`Model.variables`

: vector of unique names`Model.gridX,Y,Z`

: file names of coordinates`Model.mask`

: land-sea mask of model grid (land: _FillValue, sea: any other value)`Model.path`

: Path to be added to all file names of this section

Model.variables = [ 'zeta', 'temp', 'salt'] Model.gridX = ['domain.nc#lon(:,:,end)','domain.nc#lon','domain.nc#lon'] Model.gridY = ['domain.nc#lat(:,:,end)','domain.nc#lat','domain.nc#lat'] Model.gridZ = [ 'domain.nc#z(:,:,end)', 'domain.nc#z', 'domain.nc#z'] Model.mask = [ 'domain.nc#z(:,:,end)', 'domain.nc#z', 'domain.nc#z'] Model.path = '/home/user/Data/'

Note: The coordinate arrays gridX, gridY and gridZ should have the same size than the mask. If some dimensions are missing in grid array, then the values are repeated along this dimension. For example if mask is 25x20x10, but gridX is 25x20, then the values of gridX are repeated 10 times.

- All observations (usually at a given time)
`001`

: is the time index.`*`

: can be used to designate any time instance. For example if all observations are in the same folder then`Obs*.path`

can be used to avoid repetitions.- For some complicated observation types, the observation operator H can be specified as a sparse matrix.

Obs001.time = '2010-07-06T00:30:00.00' ! time as YYYY-MM-DDThh:mm:ss[.ss] Obs001.path = 'Obs/' ! where the file can be found Obs001.variables = [ 'TEM'] ! name as in Model.variables Obs001.names = [ 'temp_profile'] ! descriptive name Obs001.gridX = [ 'obs1.nc#lon'] ! longitude Obs001.gridY = [ 'obs1.nc#lat'] ! latitude Obs001.gridZ = [ 'obs1.nc#z'] ! depth Obs001.value = [ 'obs1.nc#temp'] ! value of the observations Obs001.mask = [ 'obs1.nc#temp'] ! mask of the observations (FillValue_) Obs001.rmse = [ 'obs1.nc#temp_rmse'] ! root mean square error

\begin{eqnarray}
\mathbf x^a &=& \mathbf x^f + \mathbf K
\left(\mathbf y^o - \mathbf H \mathbf x^f \right) \\
\mathbf K &=& \mathbf P^f \mathbf H^T
\left( \mathbf H \mathbf P^f \mathbf H^T + \mathbf R \right)^{-1}
\\
\mathbf P^a &=&
\mathbf P^f - \mathbf K \mathbf H \mathbf P^f
\end{eqnarray}

where \mathbf H is the observation operator extracting the observed
part of the state vector and \mathbf P^a the error covariance of the
analysis \mathbf x^a. As many other assimilation schemes (SEEK,
RRSQRT, ESSE, EnKF) we decompose \mathbf P^f in:
\mathbf P^f = \mathbf S^f {\mathbf S^f}^T

Where \mathbf S^f is a n \times r matrix. The reduced order
implementation is only effective if r is small (r << n).
For an ensemble {\mathbf x^f}^{(k)} \; (k=1,\dots,r), \mathbf x^f is the ensemble average and the columns of \mathbf S^f are constructed by:

(\mathbf S^f)^{(k)} = \frac{1}{\sqrt{r-1}} \left({\mathbf x^f}^{(k)} - {\mathbf x^f} \right)

In practice, the following eigenvalue decomposition is made,
\left(\mathbf H \mathbf S^f\right)^T \mathbf R^{-1} \mathbf H \mathbf S^f
= \mathbf U \mathbf \Lambda \mathbf U^T

where \mathbf U^T \mathbf U = \mathbf I and where \Lambda is
diagonal. The Kalman gain \mathbf K and \mathbf S^a can be computed by:
\begin{eqnarray}
\mathbf K &=& \mathbf S^f \mathbf U (1+ \mathbf \Lambda)^{-1} \mathbf U^T (\mathbf H \mathbf S^f)^T \mathbf R^{-1} \\
\mathbf S^a &=& \mathbf S^f \mathbf U (1+ \mathbf \Lambda)^{-1/2} \mathbf U^T \mathbf \Omega
\end{eqnarray}

\mathbf \Omega is an orthogonal matrix chosen such that average of all columns is zero.
\mathbf S^a is the square root of \mathbf P^a:
\mathbf P^a = \mathbf S^a {\mathbf S^a}^T

Based on \mathbf x^a and \mathbf S^a, an ensemble can be reconstructed:
{\mathbf x^a}^{(k)} = \mathbf x^a + {\sqrt{r-1}} \; {\mathbf S^a}^{(k)}

- Vector space containing the error of the state vector \mathbf x^{(k)}
- Can be defined through:
- EOF modes (columns of \mathbf S^f),i.e. mean is already subtracted
- An ensemble of states ({\mathbf x^a}^{(k)})

ErrorSpace.dimension = 50 ErrorSpace.path = './' ErrorSpace.init = ['("Ens",I3.3,"/ic.nc#zeta")','("Ens",I3.3,"/ic.nc#u")'] ErrorSpace.type = 2 ! 1: modes; 2: ensemble

`ErrorSpace.init`

are Fortran formats. For example for the first ensemble member, `("Ens",I3.3,"/ic.nc#zeta")`

will be expanded as `Ens001/ic.nc#zeta`

`stddevxf,stddevxa`

: standard deviation of the model error before and after assimilation.`xa-xf`

: analysis increment.`Hxf,Hxa`

: analysis/forecast at the observation location.`yo, invsqrtR`

: observations used by the assimilation and their weight (\mathbf R^{-1/2}).

! same elements as state vector Diag*.stddevxf = ['stddevxf.nc#zeta','stddevxf.nc#temp','stddevxf.nc#salt'] Diag*.stddevxa = ['stddevxa.nc#zeta','stddevxa.nc#temp','stddevxa.nc#salt'] Diag*.xf = ['forecast.nc#zeta','forecast.nc#temp','forecast.nc#salt'] Diag*.xa = [ 'ic.nc#zeta', 'ic.nc#temp', 'ic.nc#salt'] Diag*.xa-xf = [ 'inc.nc#zeta', 'inc.nc#temp', 'inc.nc#salt'] ! same element as observation vector Diag001.Hxf = [ 'xf.nc#sst'] Diag001.Hxa = [ 'xa.nc#sst'] Diag001.yo = [ 'yo.nc#sst'] Diag001.invsqrtR = [ 'invsqrtR.nc#sst'] Diag001.stddevHxf = [ 'stddevHxf.nc#sst'] Diag001.stddevHxa = [ 'stddevHxa.nc#sst'] Diag001.path = 'Analysis001/'

- Correction is proportional to a column of \mathbf P^f
- With a finite-size ensemble: spurious long-range correlation

Using the ensemble of the Ligurian Sea, assimilate a point measurement located at (lon = 9.5, lat = 43.2) and 3 meters depth. Choose sensible values for temperature, and root-mean square error.
Repeat this with a point at (lon = 9, lat = 44.2).
Describe the differences of the increments \mathbf x^a - \mathbf x^f and error standard deviation of the forecast and the analysis. Useful files:

`assim.init.template`

and `LigurianSea.nc`

. A model output can be used as mask.
assim-gfortran-single assimTS.init 1

- Long-range correlation: error in large-scale processes air sea exchange
- Short-range correlation: small-scale dynamical processes (like eddies)
- With a finite-size ensemble: some times spurious long-range correlation

- Error reduction of standard deviation error as measured by \mbox{stddev}(\mathbf x^f_i)/\mbox{stddev}(\mathbf x^a_i)

- State vector is partitioned into zones (e.g. water columns)
- Assimilation is performed in each zone independently using a sub-set of observations
- Weight of observation (\mathbf R^{-1}) is artificially increased for observations far away from the considered zone.
- since every zone is independent: easily done in parallel (OpenMP or MPI)

! possible values of "schemetype" are: ! 0 = global assimilation (default) ! 1 = local assimilation schemetype = 0

!Model.variables = [ 'ETA', 'TEM', 'SAL'] Zones.partition = ['grid.nc#part(:,:,end)','grid.nc#part','grid.nc#part'] Zones.corrLength.const = [ 30e3, 30e3, 30e3] Zones.maxLength.const = [ 2000e3, 2000e3, 2000e3] Zones.path = 'Common/'

Repeat the point measurement assimilation with local scheme and
describe the differences of the increments \mathbf x^a - \mathbf x^f.
Use the script

`gen_part`

to create the partition file from the mask.
assim-gfortran-single-openmp assimTS.init 1

- Use a state vector composed by temperature and salinity (
`ic.nc#temp`

,`ic.nc#salt`

) - Extract SST from ensemble member 100
- Add noise (spatially uncorrelated noise) with standard deviation of 0.5 °C
- Assimilate this extracted observation (but without using the ensemble member 100)
- Assess the impact of the assimilation on temperature and other variables (script
`twinexp_compareTS`

). - Repeat with spatially correlated noise to be added to the observations.

Variable | RMS(\mathbf x^f,\mathbf x^t) | RMS(\mathbf x^a,\mathbf x^t) |
---|---|---|

Temperature | 0.080 | 0.066 |

Salinity | 0.0063 | 0.0052 |

Note: RMS is here a volume average

- In this case, forecast already quite close to observations.
- Global analysis brings model closer to observations but some structures are not changed (especially if number of ensemble members is low).
- Local analysis allows the model to be closer to the observations.
- What could be the drawback of local analysis?

- State vector \mathbf x include n model states at different times:
\mathbf x := \left(\mathbf x(1),\mathbf x(2),\dots,\mathbf x(n) \right)^T
- \mathbf x may now be called a model "trajectory".
- Similarly, the observation vector \mathbf y^o includes all observations within the corresponding time interval.
- Conceptually, the assimilation scheme becomes a smoother.

- The forcing fields (wind, boundary conditions, ...) can also be included in the 4D vector:
\mathbf x := \left(\mathbf x(1),\mathbf x(2),\dots,\mathbf x(n),\mbox{wind},\dots \right)^T
- The model can be re-run to give a model result consistent with the updated forcings.

- Radial currents are extracted by:
- k_b = \frac{2 \pi}{\lambda_b}
- \mathbf e_r is the unit vector pointing in the direction opposite to the location of the HF radar site
- positive values: current away from the system
- essentially represent an average over the upper meters.

u_{\mathrm {HF}} = \frac{k_b}{1 - \exp(-k_b h)} \int_{-h}^0 \mathbf u(z) \cdot \mathbf e_r \exp(k_b z) dz - Smoothed in the azimuthal direction by a diffusion operator to filter scales smaller than 6 degrees

Radial currents on 2010-07-06 18:30 relative to the Palmaria site: WERA (left) and ROMS without assimilation (right).

- Use surface current from ensemble member 100 (in folder
`HFRadar`

) - Add noise (spatially uncorrelated noise) with standard deviation of 0.05 m/s
- Use initialization file
`assim4D.init`

- Assimilate this extracted observation (but without using the ensemble member 100)
- Assess the impact of the assimilation (script
`twinexp_compare`

). - Repeat with spatially correlated noise to be added to the observations.

Variable | RMS(\mathbf x^f,\mathbf x^t) | RMS(\mathbf x^a,\mathbf x^t) |
---|---|---|

Temperature | 0.080 | 0.067 |

Salinity | 0.0063 | 0.0057 |

u-velocity | 0.010 | 0.0097 |

v-velocity | 0.010 | 0.0095 |

u-wind | 0.61 | 0.40 |

v-wind | 0.60 | 0.54 |

- Incorporate
*new data types*from remote sensing and in situ platforms (HF radars, gliders, SSS, coastal altimetry, ...). - Which processes are resolved, how to handle high temporal resolution? (formulation of the observation operator).
- How does the error variance vary spatially and how is it correlated in space?) and bias estimation (estimation of observation error covariance)
- Also potential for improvement for ``classical'' observation types.
- Adapt assimilation schemes for complex
*coupled models*(hydrodynamic models + biogeochemical models + sea-ice models ...). - Non-linear and non-Gaussian systems
- Representation of error covariance (or pdf) between variables of different sub-models.

- Ensemble simulations:
- identification of the origin of model errors (e.g. uncertain forcing fields, parameters,...),
- parametrization those errors,
- potential to provide meaningful
*error estimates*on model products (and derived products).

- What do we learn form data assimilation to:
*improve our models*(reducing systematic and random errors),- identify
*errors in our forcing fields*(atmospheric parameters, open-ocean boundary conditions,...), and - enhance the
*observational network*?

- Develop
*robust*and*efficient*methods suitable for operational use