Skip to content
Publicly Available Published by De Gruyter July 7, 2018

Tucker Tensor Analysis of Matérn Functions in Spatial Statistics

  • Alexander Litvinenko ORCID logo EMAIL logo , David Keyes , Venera Khoromskaia , Boris N. Khoromskij and Hermann G. Matthies

Abstract

In this work, we describe advanced numerical tools for working with multivariate functions and for the analysis of large data sets. These tools will drastically reduce the required computing time and the storage cost, and, therefore, will allow us to consider much larger data sets or finer meshes. Covariance matrices are crucial in spatio-temporal statistical tasks, but are often very expensive to compute and store, especially in three dimensions. Therefore, we approximate covariance functions by cheap surrogates in a low-rank tensor format. We apply the Tucker and canonical tensor decompositions to a family of Matérn- and Slater-type functions with varying parameters and demonstrate numerically that their approximations exhibit exponentially fast convergence. We prove the exponential convergence of the Tucker and canonical approximations in tensor rank parameters. Several statistical operations are performed in this low-rank tensor format, including evaluating the conditional covariance matrix, spatially averaged estimation variance, computing a quadratic form, determinant, trace, loglikelihood, inverse, and Cholesky decomposition of a large covariance matrix. Low-rank tensor approximations reduce the computing and storage costs essentially. For example, the storage cost is reduced from an exponential 𝒪(nd) to a linear scaling 𝒪(drn), where d is the spatial dimension, n is the number of mesh points in one direction, and r is the tensor rank. Prerequisites for applicability of the proposed techniques are the assumptions that the data, locations, and measurements lie on a tensor (axes-parallel) grid and that the covariance function depends on a distance, x-y.

MSC 2010: 60H15; 60H35; 65N25

1 Introduction

Nowadays it is very common to work with large spatial data sets [64, 15, 62, 44, 61, 52], for instance, with satellite data, collected over a very large area (e.g., the data collected by the National Center for Atmospheric Research, USA, https://www.earthsystemgrid.org/). This data can also come from a computer simulator code as a solution of a certain multiparametric equation (e.g., Weather research and Forecasting model, https://www.mmm.ucar.edu/weather-research-and-forecasting-model), it could be also sensor data from multiple sources. Typical operations in spatial statistics, such as evaluating the spatially averaged estimation variance, computing quadratic forms of the conditional covariance matrix, or computing maximum of likelihood function [62] require high computing power and time. Our motivation for using low-rank tensor techniques is that operations on advanced matrices, such as hierarchical, low-rank and sparse matrices, are limited by their high computational costs, especially in three dimensions and for a large number of observations.

A tensor can be simply defined as a high-order matrix, where multi-indices are used instead of indices (see Section 3 and equation (3.1) for a rigorous definition). One way to obtain a tensor from a vector or matrix is to reshape it. For example, we assume that 𝐯106 is a vector. We reshape it and obtain a matrix of size 103×103, or a tensor of order 3 of size 102×102×102 or a tensor of order 6 of size 10××10 (6 times). Each element of such a six-dimensional hypercube is described by the multi-index α=(α1,,α6). The obtained tensors contain not only rows and columns, but also slices and fibers [40, 41, 10]. These slices and fibers can be analyzed for linear dependences, super symmetry, or sparsity and may result in a strong data compression. Another difference between tensors and matrices is that a matrix (obtained, for instance, after the discretization of a kernel c(𝐱,𝐲)=c(|𝐱-𝐲|)) separates a point 𝐱=(x1,,xd)d from a point 𝐲=(y1,,yd)d, whereas the corresponding tensor (depending on tensor format) separates x1-y1 from x2-y2,x3-y3,,xd-yd. This implies that tensors may have not just one rank like a matrix, but many. Therefore, we speak about a tensor rank, but not a matrix rank. In this work, we consider two very common tensor formats: canonical (denoted as CP) and Tucker (see Section 3).

Low-rank tensor methods can be gainfully combined with other data-compression techniques in low dimensions. For example, a three-dimensional function can be approximated as the sum of the tensor products of one-dimensional function. Then the usual matrix techniques can be applied to those one-dimensional functions.

To be more concrete, we consider a relatively wide class of Matérn covariance functions. We demonstrate how to approximate Matérn covariance matrices in a low-rank tensor format, then how to perform typical Kriging and spatial statistics operations in this tensor format. Matérn covariance matrices typically depend on three to five unknown hyper-parameters, such as smoothness, three covariance lengths (in a three-dimensional anisotropic case), and variance. We study the dependences of tensor ranks and approximation errors on these parameters. Splitting the spatial variables via low-rank techniques reduces the computing cost for a matrix-vector product from 𝒪(N2) to 𝒪(dr2n2) FLOPs, where d is the spatial dimension, r is the tensor rank, and n is the number of mesh points along the longest edge of the computational domain. For simplicity, we assume that N=nd (e.g., d=4 for a time-space problem in three dimensions). Other motivating factors for applying low-rank tensor techniques include the following:

  1. The storage cost is reduced from 𝒪(nd) to 𝒪(drn) or, depending on the tensor format, to 𝒪(drn+rd), where d>1.

  2. The low-rank tensor technique allows us to compute not only the matrix-vector product, but also the inverse 𝐂-1, square root 𝐂12, matrix exponent exp(𝐂), trace(𝐂), det(𝐂), and a likelihood function.

  3. The low-rank tensor approximation is relatively new, but already a well-studied technique with free software libraries available.

  4. The approximation accuracy is fully controlled by the tensor rank. The full rank gives an exact representation.

  5. Low-rank techniques are either faster than a Fourier transform (𝒪(drn) vs. 𝒪(ndlognd)) or can be efficiently combined with it [51, 13].

General limitations of the tensor technique are the following:

  1. It could be time consuming to compute a low-rank tensor decomposition.

  2. It requires axes-parallel mesh.

  3. Some theoretical estimations exist for functions depending on |x-y| (although more general functions have a low-rank representation in practice).

During the last few years, there has been great interest in numerical methods for representing and approximating large covariance matrices [44, 54, 56, 51, 1, 2, 43]. Low-rank tensors were previously applied to accelerated Kriging and spatial design by orders of magnitude [51]. The covariance matrix under consideration was assumed to be circulant, and the first column had a low-rank decomposition. Therefore, d-dimensional Fourier was applied to and drastically reduce the storage and the computing cost.

The maximum likelihood estimator was computed for parameter-fitting given Gaussian observations with a Matérn covariance matrix [47]. The presented framework for unstructured observations in two spatial dimensions allowed for an evaluation of the log-likelihood and its gradient with computational complexity 𝒪(n32). The -matrix techniques [19, 22, 21] provide the efficient data sparse approximation for the differential and integral operators in d, d=1,2,3. -matrices are very robust for approximating the covariance matrix [38, 56, 2, 26, 1, 4], its inverse [1], and its Cholesky decomposition [38, 44, 43], but can also be expensive, especially for large n in three dimensions. Namely, the complexity in three dimensions will be Ckd-1NlogN, where N=nd, d=3, kn is the rank and C is a large constant which scales exponentially in dimension d, see [22]. Thus, the -matrix techniques scale exponentially in dimension size. Therefore, more efficient methods for fast and efficient matrix linear algebra operations are still needed.

The key idea is to compute a low-rank decomposition not of the covariance function (it could be hard), but of its analytically known spectral density (which could be a much easier object) and then apply the inverse Fourier to the obtained low-rank components. The Fourier transformation of the Matérn covariance function is known analytically as the Hilbert tensor. This Hilbert tensor can be decomposed numerically in a low-rank tensor format. Both the Fourier transformation and its inverse have the canonical (CP) tensor rank-1. Therefore, the inverse Fourier does not change the tensor rank of the argument. By applying the inverse Fourier to the low-rank tensor, we obtain a low-rank approximation of the initial covariance matrix, which can be further used in the Kalman filter update, Karhunen–Loève expansion, Bayesian update, and Kriging.

The structure of the paper is as follows. In Section 2, we list typical tasks from statistics that motivate us to use low-rank tensor techniques and define the Matérn covariance functions and their Fourier transformations. Section 3 is devoted to low-rank tensor decomposition. Sections 3.4, 3.5 and 3.6 contain the main theoretical contribution of this work. We present low-rank tensor techniques and separate radial basis functions using the Laplace transform and the sinc quadrature, give estimations of the approximation error, convergence rate, and the tensor rank, and we also prove the existence of a low-rank approximation of a Matérn function. Section 4 contains another important contribution of this work, namely, the solutions to typical statistical tasks in the low-rank tensor format.

2 Motivation

2.1 Problem Settings in Spatial Statistics

Below, we formulate five tasks. These computational tasks are very common and important in statistics. Fast and efficient solution of these tasks will help to solve many real-world problems, such as the weather prediction, moisture modeling, and optimal design in geostatistics.

Task 1: Approximate a Matérn covariance function in a low-rank tensor format.

The covariance function c(𝐱,𝐲), 𝐱=(x1,,xd), 𝐲=(y1,,yd), is discretized on a tensor grid with N mesh points, N=nd, d1 and ε>0. The task is to find the following decomposition into one-dimensional functions:

c(𝐱,𝐲)-i=1rμ=1dciμ(xμ,yμ)ε

for some given ε>0. Alternatively, we may look for factors 𝐂iμ such that

𝐂-i=1rμ=1d𝐂iμε.

Here, the matrices 𝐂iμ correspond to the one-dimensional covariance functions ciμ(xμ,yμ) in the direction μ.

Task 2: Computing of square root of 𝐂.

The square root 𝐂12 of the covariance matrix 𝐂 is needed in order to generate random fields and processes. It is also used in the Kalman filter update.

Task 3: Kriging.

Spatial statistics and Kriging [39] are used to model the distribution of ore grade, forecast of rainfall intensities, moisture, temperatures, or contaminant. The missing values are interpolated from the known measurements by Kriging [46, 30]. When considering space-time Kriging on fine meshes [66, 14, 28, 9], Kriging may easily exceed the computational power of modern computers. Estimating the variance of Kriging and geostatistical optimal design problems are especially numerically intensive [48, 50, 60].

The Kriging can be defined as follows. Let 𝐬^ be the N×1 vector of values to be estimated with zero expectation and covariance matrix 𝐂ss. Let 𝐲 be the m×1, mN, vector of measurements. The corresponding cross- and auto-covariance matrices are denoted by 𝐂sy and 𝐂yy, and sized N×m and m×m, respectively. If the measurements are subject to error, an error covariance matrix 𝐑 is included in 𝐂yy. Using this notation, the Kriging estimate 𝐬^ is given by 𝐬^=𝐂sy𝐂yy-1𝐲.

Task 4: Geostatistical design.

The goal of geostatistical optimal design is to optimize the sampling patterns from which the data values in 𝐲 will be obtained. The objective function that will be minimized is typically a scalar measure of either the conditional covariance matrix or the estimation variance (4.4). The two most common measures for geostatistical optimal design are φA and φC:

(2.1)φA=N-1trace[𝐂ss|y]andφC=𝐳T(𝐂ss|y)𝐳,

where 𝐂ss|y:=𝐂ss-𝐂sy𝐂yy-1𝐂ys, see [48, 50].

Task 5: Computing the joint Gaussian log-likelihood function.

We assume that 𝒛N is an available vector of measurements, and 𝜽 is an unknown vector of the parameters of a covariance matrix 𝐂. The task is to compute the maximum likelihood estimation (MLE), where the log-likelihood function is as follows:

(𝜽)=-N2log(2π)-12logdet{𝐂(𝜽)}-12(𝐳T𝐂(𝜽)-1𝐳).

The difficulty here is that each iteration step of a maximization procedure requires the solution of a linear system 𝐋𝐯=𝐳, the Cholesky decomposition, and the determinant.

In Section 4 we give detailed solutions. We give strict definition of tensors later in Section 3.

2.2 Matérn Covariance and Its Fourier Transform

A low-rank approximation of the covariance function is a key component of the tasks formulated above. Among of the many covariance models available, the Matérn family [45, 25] is widely used in spatial statistics, geostatistics [7], machine learning [4], image analysis, weather forecast, moisture modeling, and as the correlation for temperature fields [49]. The work [25] introduced the Matérn form of spatial correlations into statistics as a flexible parametric class with one parameter determining the smoothness of the underlying spatial random field.

The main idea of this low-rank approximation is shown in Figure 1 and explained in details in Section 3.3. Figure 1 demonstrates two possible ways to find a low-rank tensor approximation of the Matérn covariance function. The first way (marked with “?”) is not so trivial and the second via the Fast Fourier Transform (FFT), low-rank and the inverse FFT (IFFT) is more trivial. We use here the fact that the FT of the Matérn covariance is analytically known and has a known low-rank approximation. The IFFT can be computed numerically and does not change the tensor ranks.

Figure 1 Two possible ways to find a low rank tensor approximation of
the Matérn covariance matrix Cν,ℓ⁢(r){C_{\nu,\ell}(r)}.
Figure 1

Two possible ways to find a low rank tensor approximation of the Matérn covariance matrix Cν,(r).

The Matérn covariance function is defined as

(2.2)Cν,(r)=21-νΓ(ν)(2νr)ν𝒦ν(2νr),

where distance r:=x-y, x,y two points in d, ν>0 defines the smoothness of the random field, and 𝒦ν denotes the modified Bessel function of order ν. The larger ν, the smoother the random field. The parameter >0 is a spatial range parameter that measures how quickly the correlation of the random field decays with distance, with larger corresponding to a slower decay (keeping ν fixed). When ν=12 (see [55]), the Matérn covariance function reduces to the exponential covariance model and describes a rough field. The value ν= corresponds to a Gaussian covariance model, which describes a very smooth field, that is infinitely differentiable. Random fields with a Matérn covariance function are ν-1 times mean square differentiable. Thus, the hyperparameter ν controls the degree of smoothness.

The d-dimensional Fourier transform 𝑭d(C(r,ν) of the Matérn kernel, defined in equation (2.2), in d is given by [45]

(2.3)𝑼(ξ):=𝑭d(C(r,ν)=β(1+22ν|ξ|2)-ν-d2,

where β=β(ν,,n) is a constant and |ξ| is the Euclidean distance in d. The following tensor approach also applies to the case of anisotropic distance, where r2=A(x-y),(x-y), and A is a positive diagonal d×d matrix.

3 Low-Rank Tensor Decompositions

In this section, we review the definitions of the CP and Tucker tensor formats. Then we provide the analytic sinc-based proof of the existence of low-rank tensor approximations of Matérn functions. We investigate numerically the behavior of the Tucker and CP ranks across a wide range of parameters specific to the family of Matérn kernels in equation (2.3). The Tucker tensor format is used in this work for additional rank compression of the CP factors. There are no reliable algorithms to compute CP decomposition, which can be difficult to compute, but there are such algorithms for Tucker decomposition. The Tucker decomposition is only limited with respect to the available memory storage, since the term rd in 𝒪(drn+rd) grows exponentially with d.

3.1 General Definitions

CP and Tucker rank-structured tensor formats have been applied for the quantitative analysis of correlation in multidimensional experimental data for a long time in chemometrics and signal processing [59, 8]. The Tucker tensor format was introduced in 1966 for tensor decomposition of multidimensional arrays in chemometrics [65]. Though the canonical representation of multivariate functions was introduced as early as in 1927 [29], only the Tucker tensor format provides a stable algorithm for decomposition of full-size tensors. A mathematical approval of the Tucker decomposition algorithm was presented in papers on higher-order singular value decomposition (HOSVD) and the Tucker ALS algorithm for orthogonal Tucker approximation of higher-order tensors [10]. For higher dimensions, the so-called Matrix Product States (MPS) (see the survey paper [57]) or the Tensor Train (TT) [53] decompositions can be applied. However, for three-dimensional applications, the Tucker and CP tensor formats remain the best choices. The fast convergence of the Tucker decomposition was proved and demonstrated numerically for higher-order tensors that arise from the discretization of linear operators and functions in d for a class of function-related tensors and Green’s kernels in particular, it was found that the approximation error of the Tucker decomposition decayed exponentially in the Tucker rank [33, 36].

These results inspired the canonical-to-Tucker (C2T) and Tucker-to-canonical (T2C) decompositions for function-related tensors in the case of large input ranks, as well as the multigrid Tucker approximation [37].

A tensor of order d in a full format is defined as a multidimensional array over a d-tuple index set:

(3.1)𝐀=[ai1,,id][a(i1,,id)]n1××ndwith iI:={1,,n}.

Here, 𝐀 is an element of the linear space

𝕍n==1d𝕍,𝕍=I

equipped with the Euclidean scalar product ,:𝕍n×𝕍n, defined as

𝐀,𝐁:=(i1id)ai1idbi1idfor 𝐀,𝐁𝕍n.

Tensors with all dimensions having equal size n=n, =1,,d, are called nd tensors. The required storage size scales exponentially with the dimension, nd, which results in the so-called “curse of dimensionality”.

To avoid exponential scaling in the dimension, the rank-structured separable representations (approximations) of the multidimensional tensors can be used. The simplest separable element is given by the rank-1 tensor

𝐔=𝐮(1)𝐮(d)n1××nd,

with entries ui1,,id=ui1(1)uid(d), which requires only n1++nd numbers for storage.

The rank-1 canonical tensor is a discrete counterpart of the separable d-variate function, which can be represented as the product of univariate functions

f(x1,x2,,xd)=f1(x1)f2(x2)fd(xd).

An example of the separable d-variate function is f(x1,x2,x3)=e(x1+x2+x3). Then, by discretization of this multivariate function on a tensor grid in a computational box, we obtain a canonical rank-1 tensor.

A tensor in the R-term canonical format is defined by a finite sum of rank-1 tensors (Figure 2, left)

(3.2)𝐀c=k=1Rξk𝐮k(1)𝐮k(d),ξk,

where 𝐮k()n are normalized vectors, and R is the canonical rank. The storage cost of this parametrization is bounded by dRn. An element a(i1,,id) of the tensor 𝐀=i=1Rν=1duiν can be computed as

a(i1,,id)=α=1Ru1(i1,α)u2(i2,α)ud(id,α).

An alternative (contracted product) notation is used in computer science community:

(3.3)𝐀=𝐂×1U(1)×2U(2)×3×dU(d),

where 𝐂=diag{c1,,cd}Rd, and U()=[𝐮1()𝐮R()]n×R. An analogous multivariate function can be represented by a sum of univariate functions

f(x1,x2,,xd)=k=1Rf1,k(x1)f2,k(x2)fd,k(x2).

For d3, there are no stable algorithms to compute the canonical rank of a tensor 𝐀, that is, the minimal number R in representation (3.2), and the respective decomposition with the polynomial cost in d, i.e., the computation of the canonical decomposition is an N-P hard problem [27].

Figure 2 Canonical (left) and Tucker (right) decompositions of three-dimensional tensors.
Figure 2 Canonical (left) and Tucker (right) decompositions of three-dimensional tensors.
Figure 2

Canonical (left) and Tucker (right) decompositions of three-dimensional tensors.

The Tucker tensor format (Figure 2, right) is suitable for stable numerical decompositions with a fixed truncation threshold. We say that the tensor 𝐀 is represented in the rank-𝐫 orthogonal Tucker format with the rank parameter 𝐫=(r1,,rd) if

𝐀=ν1=1r1νd=1rdβν1,,νd𝐯ν1(1)𝐯ν()𝐯νd(d),=1,,d,

where {𝐯ν()}ν=1rn represents a set of orthonormal vectors for =1,,d, and 𝐁=[𝐁ν1,,νd]r1××rd is the Tucker core tensor. The storage cost for the Tucker tensor is bounded by drn+rd, with r=|𝐫|:=maxr. Using the orthogonal side matrices V()=[v1()vr()] and contracted products, the Tucker tensor decomposition may be presented in the alternative notation

𝐀(𝐫)=𝐁×1V(1)×2V(2)×3×dV(d).

In the case d=2, the orthogonal Tucker decomposition is equivalent to the singular value decomposition (SVD) of a rectangular matrix.

3.2 Tucker Decomposition of Full Format Tensors

We use the following algorithm to compute the Tucker decomposition of the full format tensor. The most time-consuming part of the Tucker algorithm is higher-order singular value decomposition (HOSVD), the computation of the initial guess for matrices V() using the SVD of the matrix unfolding A(), =1,2,3 (Figure 3), of the original tensor along each mode of a tensor [10]. Figure 3 illustrates the matrix unfolding of the full format tensor 𝐀 (see (3.1)) along the index set I1={1,,n1}.

Figure 3 Unfolding of a three-dimensional tensor along the mode Iℓ{I_{\ell}} with ℓ=1{\ell=1}.
Figure 3

Unfolding of a three-dimensional tensor along the mode I with =1.

The second part of the algorithm is the ALS procedure. For every tensor mode, a “single-hole” tensor of reduced size is constructed by the mapping all of the modes of the original tensor except one into the subspaces V(). Then the subspace V() for the current mode is updated by SVD of the unfolding of the “single hole” tensor for this mode. This alternates over all modes of the tensor, which are updated at the current iteration of ALS. The final step of the algorithm is computation of the core tensor by using the ultimate mapping matrices from ALS.

The numerical cost of Tucker decomposition for full size tensors is dominated by the initial guess, which is estimated as O(nd+1) when all n=n are equal , or O(n4) for our three-dimensional case. This step restricts the available size of the tensor to be decomposed since, for conventional computers, the three-dimensional case n>102 is the limiting case for SVD.

The multigrid Tucker algorithm for full size tensors allows the computational complexity to be linear in the full size of the tensor, O(nd), i.e., O(n3) for three-dimensional tensors [37]. It is computed on a sequence of diadically refined grids and is based on implementing the HOSVD only at the coarsest grid level, n0n. The initial guess for the ALS procedure is computed at each refined level by the interpolation of the dominating Tucker subspaces obtained from the previous coarser grid. In this way, at fine three-dimensional Cartesian grids, we need 29 only O(n3) storage (to represent the initial full format tensor) to contract with the Tucker side matrices, obtained by the Tucker approximation via ALS on the previous grids.

3.3 Illustration of the Low-Rank Approximation Idea

In this subsection we describe a possible ways to find a low-rank tensor approximation of the Matérn covariance matrix C(r,ν) (Figure 1). Let 𝑭d=ν=1d𝑭ν be the d-dimensional Fourier transform, where 𝑭-d=ν=1d𝑭ν-1 is its inverse and denotes the Kronecker product. We assume that 𝑼(ξ)=𝑭d(C(r,ν)) is known analytically and has a low-rank tensor approximation 𝐔=j=1rν=1d𝐮jν. Since the Fourier and inverse Fourier transformations do not change the Kronecker tensor rank of the argument [51], by applying the inverse Fourier, we obtain a low-rank representation of the covariance function by applying the inverse Fourier:

𝑭-d(𝐔)=(ν=1d𝑭ν-1)i=1r(ν=1d𝐮νi)=i=1rν=1d(𝑭ν-1(𝐮νi))=i=1rν=1d𝐮~νi=:C(r,ν),

where 𝐮~νi:=𝑭ν-1(𝐮νi).

3.4 Sinc Approximation of the Matérn Function

The Sinc method provides a constructive approximation of the multivariate functions in the form of a low-rank canonical representation. It can be also used for the theoretical proof and for the rank estimation. Methods for the separable approximation of the three-dimensional Newton kernel and many other spherically symmetric functions that use the Gaussian sums have been developed since the initial studies in chemical [5] and mathematical literature [63, 6, 23, 17]. Here, we use a tensor-decomposition approach for lattice-structured interaction potentials [32]. We also recall the grid-based method for a low-rank canonical representation of the spherically symmetric kernel function q(x), where xd, d=2,3,, by its projection onto the set of piecewise constant basis functions; see [3] for the case of Newton and Yukawa kernels for x3.

Following the standard schemes, we introduce the uniform n×n×n rectangular Cartesian grid Ωn with mesh size h=2bn (we assume that n is even) in the computational domain Ω=[-b,b]3. Let {ψ𝐢} be a set of tensor-product piecewise constant basis functions

ψ𝐢(𝐱)==13ψi()(x)

for the 3-tuple index 𝐢=(i1,i2,i3), =I1×I2×I3, with iI={1,,n}, where =1, 2, 3. The generating kernel q(x) is discretized by its projection onto the basis set {ψ𝐢} in the form of a third-order tensor of size n×n×n, which is defined entry-wise as

(3.4)𝐐:=[q𝐢]n×n×n,q𝐢=3ψ𝐢(x)q(x)dx.

The low-rank canonical decomposition of the third-order tensor 𝐐 is based on applying exponentially convergent sinc-quadratures to the integral representation of the function q(p), p, in the form

q(p)=a1(t)e-p2a2(t)dt,

specified by the weights a1(t),a2(t)>0. Figure 4 illustrates a scheme of the proof of existence of the canonical low-rank tensor approximation. It could be easier to apply the Laplace transform to the Fourier transform of a Matérn covariance matrix than to the Matérn covariance. To approximate the resulting Laplace integral we apply the sinc quadrature. The number (2M+1) of terms in the approximate sum (3.5) is the canonical tensor rank.

Figure 4 Scheme of the proof of existence of low-rank tensor approximation, r=2⁢M+1{r=2M+1}.
Figure 4

Scheme of the proof of existence of low-rank tensor approximation, r=2M+1.

In particular, the sinc-quadrature for the Laplace–Gauss transform

(3.5)q(p)=+a(t)e-t2p2dtk=-MMake-tk2p2for |p|>0,p

can be applied, where the quadrature points (tk) and weights (ak) are given by

(3.6)tk=k𝔥Mandak=a(tk)𝔥M,where 𝔥M=C0log(M)M,C0>0.

Under the assumption 0<a0|p|<, this quadrature can be proven to provide an exponential convergence rate in M (uniformly in p) for a class of functions a(z) that are analytic in a certain strip |z|D of the complex plane such that the functions a1(t)e-p2a2(t) decay polynomially or exponentially on the real axis. The exponential convergence of the sinc-approximation in the number of terms (i.e., the canonical rank R=2M+1) was analyzed elsewhere [63, 6, 23].

We assume that a representation similar to (3.5) exists for any fixed x=(x1,x2,x3)3 such that x>a0>0. Then we apply the sinc-quadrature approximation (3.5) and (3.6) to obtain the separable expansion

(3.7)q(x)=+a(t)e-t2x2dtk=-MMake-tk2x2=k=-MMak=13e-tk2x2,

providing an exponential convergence rate in M:

|q(x)-k=-MMake-tk2x2|Cae-βMwith some C,β>0.

By combining (3.4) and (3.7), and taking into account the separability of the Gaussian basis functions, we arrive at the low-rank approximation of each entry of the tensor 𝐐:

q𝐢k=-MMak3ψ𝐢(x)e-tk2x2dx=k=-MMak=13ψi()(x)e-tk2x2dx.

Recalling that ak>0, we define the vector 𝐪 as

𝐪k()=ak13[bi()(tk)]i=1nnwith bi()(tk)=ψi()(x)e-tk2x2dx.

Then the third-order tensor 𝐐 can be approximated by the R-term (R=2M+1) canonical representation

(3.8)𝐐𝐐R=k=-MMak=13𝐛()(tk)=k=-MM𝐪k(1)𝐪k(2)𝐪k(3)n×n×n,

where 𝐪k()n. Given a threshold ε>0, M can be chosen as the minimal number such that in the max-norm

𝐐-𝐐Rε𝐐.

The skeleton vectors can be reindex by kk=k+M+1, 𝐪k()𝐪k() (k=1,,R=2M+1), =1,2,3. The symmetric canonical tensor 𝐐Rn×n×n in (3.8) approximates the three-dimensional symmetric kernel function q(x) (xΩ), centered at the origin, such that 𝐪k(1)=𝐪k(2)=𝐪k(3) (k=1,,R).

In some applications, the tensor can be given in the canonical tensor format, but with large rank R and discretized on large grids n×n××n; thus, computation of the initial of guess in the Tucker-ALS decomposition algorithm becomes intractable. This situation may arise when composing the tensor approximation of complicated kernel functions from simple radial functions that can be represented in the low-rank CP format.

For such cases, the canonical-to-Tucker decomposition algorithm was introduced [37]. It is based on the minimization ALS procedure, similar to the Tucker algorithm, described in Section 3.2 for full-size tensors, but the initial guess is computed by just the SVD of the side matrices U()=[𝐮1()𝐮R()]n×R, =1,2,3; see (3.3). This schema is the Reduced HOSVD (RHOSVD), which does not require unfolding of the full tensor.

Another efficient rank-structured representation of the multidimensional tensors is the mixed-tensor format [31], which combines either the canonical-to-Tucker decomposition with the Tucker-to-canonical decomposition, or standard Tucker decomposition with the canonical-to-Tucker decomposition, in order to produce a canonical tensor from a full-size tensor.

3.5 Laplace Transform of the Covariance Matrix

The integral representations like (3.5) can be derived by the Laplace transform either directly to the Matérn covariance function or to its spectral density (2.3). For example, in the case of the Newton kernel, q(p)=1p, and the Laplace–Gauss transform representation takes the form

1p=2π+e-p2t2dt,where p=x=x12+x22+x32.

In this case, 𝐪k()=𝐪-k(), and the sum of (3.8) reduces to k=0,1,,M, implying that R=M+1. Therefore, the Laplace transform representation of the Slater function q(p)=e-2αp (i.e., exponential covariance) with p=x2 can be written as

q(p)=e-2αp=απ+t-32exp(-αt-pt)dt.

When the Matérn spectral density in (2.3) has an even dimension parameter d=2d1, d1=1,2, and ν=0,1,2,, the Laplace transform

η!(p+a)η+1=+tηe-ate-ptdt

can be applied after substituting p=ξ2 in

q(p)=β(1+22νp)η,η=-ν-d1.

If -η=ν+d1=12,32,52,,2k+12,, then the Laplace transform is

(2η)!πη!4η1(p+a)ηp+a=+tηte-ate-ptdt,η.

3.6 Covariance Matrix in Rank-Structured Tensor Format

In what follows, we consider the CP approximation of the radial function q(r) in the positive sector, i.e., on the domain [0,b]3 (by symmetry, the canonical tensor can be extended to the whole computational domain [-b,b]3.) Let the covariance function q(r)=Cν,(r) in (2.2) be represented by the rank-R symmetric CP tensor on an n×n×n tensor grid, denoted by ΩnΩ=[0,b]3 as described in the previous sections,

(3.9)q(r)R=k=1R𝐪k(1)𝐪k(2)𝐪k(3)n×n×n

with the same skeleton vectors 𝐪k()n for =1,2,3.

We define the covariance matrix 𝐂=[c𝐢,𝐣]𝐧×𝐧 entry-wise by

c𝐢,𝐣=Cν,(x𝐢-y𝐣),𝐢,𝐣.

Using the tensor representation (3.9), we represent the large n3×n3 matrix 𝐂 in the rank-R Kronecker (tensor) format as

(3.10)𝐂𝐂R=k=1R𝐐k(1)𝐐k(2)𝐐k(3),

where the symmetric Toeplitz matrix 𝐐k()=Toepl[𝐪k()]n×n, =1,2,3, is defined by its first column, which is specified by the skeleton vectors 𝐪k(1)=𝐪k(2)=𝐪k(3) in the decomposition (3.9).

Figure 5 illustrates eight selected canonical generating vectors from 𝐪k(1) for k=1,,R, R=34, on a grid of size n=2,049 for the Slater function e-xp, which defines the corresponding Toeplitz matrices.

Figure 5 Selected eight canonical vectors from the full
set 𝐪k(1){{\mathbf{q}}^{(1)}_{k}}, k=1,…,R{k=1,\dots,R}, see (3.9).
Figure 5

Selected eight canonical vectors from the full set 𝐪k(1), k=1,,R, see (3.9).

A Toeplitz matrix can be multiplied by a vector in O(nlogn) operations via complementing it to the circulant matrix. In general, the inverse of a Toeplitz matrix cannot be calculated in the closed form; unlike circulant matrices, which can be diagonalized by the Fourier transform.

In the rest of this subsection, we introduce the numerical scheme, based on certain specific properties of the skeleton matrices in the symmetric rank-R decomposition (3.10) that is capable of rank-structured calculations of the analytic matrix functions (𝐂R). We discuss the most interesting examples of the functions 1(𝐂)=𝐂-1 and 2(𝐂)=𝐂12, where 𝐂=𝐂R.

Given an symmetric positive definite matrix such that 𝐀=q<1, the matrix-valued function is given as the exponentially fast converging series

(3.11)(𝐀)=𝐄+a1𝐀+a2𝐀2+,

where the matrix 𝐀, acting in the multidimensional index set, allows the low-rank Kronecker tensor decomposition. Then the low-rank tensor approximation of (𝐀) can be computed by the “add-and-compress” scheme, where each term in the series above (3.11) is summed using the rank-truncation algorithm in the corresponding format.

To limit the rank-structured evaluation of 1(𝐂) to the described framework, we propose the special rank-structured additive splitting of the covariance matrix 𝐂 with the easily invertible dominating part. To that end, we construct the diagonal matrix 𝐐0(1) by assembling all of the diagonal sub-matrices in 𝐐k(1), k=1,,R (in the following, we simplify the notation by omitting the upper index (1)):

𝐐0:=k=1Rdiag(𝐐k),

and modify each Toeplitz matrix 𝐐k by subtracting its diagonal part,

𝐐k𝐐^k:=𝐐k-diag(𝐐k),k=1,,R.

Using the matrices defined above, we introduce the rank R+1 additive splitting of 𝐂 , which is defined by the skeleton matrices 𝐐0 and 𝐐^k since 𝐂 is Kronecker symmetrical,

𝐂=𝐐0𝐐0𝐐0+k=1R𝐐^k(1)𝐐^k(2)𝐐^k(3).

Hence, we have

𝐂-1=𝐐0-1𝐐0-1𝐐0-1(𝐄+k=1R𝐐0-1𝐐^k(1)𝐐0-1𝐐^k(2)𝐐0-1𝐐^k(3))-1.

Likewise, since 𝐐0𝐐0𝐐0 is a scaled identity, we obtain

(3.12)𝐂12=𝐐012𝐐012𝐐012(𝐄+k=1R𝐐0-1𝐐^k(1)𝐐0-1𝐐^k(2)𝐐0-1𝐐^k(3))12.

We assume that 𝐐0-1𝐐^k(1)<1 for k=1,,R in some norm; thus, we can apply the “add-and-compress” scheme described above.

We illustrate the “add-and-compress” computational scheme in the following example. We consider the covariance matrix 𝐂R, obtained by a rank-Rsinc approximation of the Slater function e-xp with R=40 on a grid with n=1,025 sampling points. Figure 6 demonstrates the decay in both the matrix norms 𝐐k(1) (left), and the scaled, preconditioned matrices 𝐐0-1𝐐^k(1), k=1,,R (right). We use the scaling factor of 1n. The right figure indicates that the analytic matrix functions (𝐂R) can be evaluated by using an exponentially fast convergent power series supported by the “add-and-compress” strategy to control the tensor rank.

Figure 6 Scaled norms ∥𝐐k(1)∥{\|{\mathbf{Q}}^{(1)}_{k}\|} (left) and ∥𝐐0-1⁢𝐐^k(1)∥{\|{\mathbf{Q}}_{0}^{-1}\widehat{\mathbf{Q}}^{(1)}_{k}\|}
(right) vs. k=1,…,R{k=1,\dots,R}.
Figure 6 Scaled norms ∥𝐐k(1)∥{\|{\mathbf{Q}}^{(1)}_{k}\|} (left) and ∥𝐐0-1⁢𝐐^k(1)∥{\|{\mathbf{Q}}_{0}^{-1}\widehat{\mathbf{Q}}^{(1)}_{k}\|}
(right) vs. k=1,…,R{k=1,\dots,R}.
Figure 6

Scaled norms 𝐐k(1) (left) and 𝐐0-1𝐐^k(1) (right) vs. k=1,,R.

In the Kriging calculations (Task 3 above), the low-rank tensor structure in the covariance matrix 𝐂R can be directly utilized if the sampling points in the Kriging algorithm form a smaller m1×m2×m3 tensor sub-grid of the initial n×n×n tensor grid Ωn with m<n. The same argument also applies to the evaluation of conditional covariance. In the general case of “non-tensor” locations of the sampling points, some mixed tensor factorizations could be applied.

3.7 Numerical Illustrations

In what follows, we check some examples of the low-rank Tucker tensor approximation of the p-Slater function C(x)=e-xp, and Matérn kernels with full-grid tensor representation. We demonstrate the fast exponential convergence of the tensor approximation in the Tucker rank. The functions were sampled on the n1×n2×n3 three-dimensional Cartesian grid with n=100, =1,2,3.

For a continuous function q:Ω, where Ω:==1d[-b,b]d, and 0<b<, we introduce the collocation-type function-related tensor of order d:

𝐐𝐐(q):=[qi1id]I1××Idwith qi1id:=q(xi1(1),,xid(d)),

where (xi1(1),,xid(d))d are grid collocation points, indexed by =I1××Id,

xi()=-b+(i-1)h,i=1,2,,n,=1,,d,

which are the nodes of equally spaced subintervals with a mesh size of h=2bn-1.

We test the convergence of the error in the relative Frobenius norm with respect to the Tucker rank for p-Slater functions with p=0.1,0.2,1.9,2.0. The Frobenius norm is computed as

(3.13)EFN=𝐐-𝐐(r)𝐐,

where 𝐐(r) is the tensor reconstructed from the Tucker rank-r decomposition of 𝐐.

Figure 7 Convergence in the Frobenius error (3.13) with respect to the Tucker rank for the
function (3.14) withp=0.1,0.2,1.9,2.0{p=0.1,0.2,1.9,2.0} (left);
Decay of singular values of the weighted Slater function (right).
Figure 7 Convergence in the Frobenius error (3.13) with respect to the Tucker rank for the
function (3.14) withp=0.1,0.2,1.9,2.0{p=0.1,0.2,1.9,2.0} (left);
Decay of singular values of the weighted Slater function (right).
Figure 7

Convergence in the Frobenius error (3.13) with respect to the Tucker rank for the function (3.14) withp=0.1,0.2,1.9,2.0 (left); Decay of singular values of the weighted Slater function (right).

Figure 8 Cross section of the three-dimensional radial function (3.14) with p=0.1{p=0.1} (left)
and p=1.9{p=1.9} (right) at level z=0z=0.
Figure 8 Cross section of the three-dimensional radial function (3.14) with p=0.1{p=0.1} (left)
and p=1.9{p=1.9} (right) at level z=0z=0.
Figure 8

Cross section of the three-dimensional radial function (3.14) with p=0.1 (left) and p=1.9 (right) at level z=0.

Figure 9 Multigrid Tucker: convergence with respect to Tucker ranks of a Slater function
with p=1{p=1} on a sequence of grids.
Figure 9

Multigrid Tucker: convergence with respect to Tucker ranks of a Slater function with p=1 on a sequence of grids.

Figure 10 Convergence with respect to the Tucker rank of three-dimensional spectral density of Matérn covariance
(3.15)with α=0.1{\alpha=0.1} (left) and α=100{\alpha=100} (right).
Figure 10 Convergence with respect to the Tucker rank of three-dimensional spectral density of Matérn covariance
(3.15)with α=0.1{\alpha=0.1} (left) and α=100{\alpha=100} (right).
Figure 10

Convergence with respect to the Tucker rank of three-dimensional spectral density of Matérn covariance (3.15)with α=0.1 (left) and α=100 (right).

Figure 7 shows convergence of the Frobenius error with respect to the Tucker rank for the p-Slater function discretized on the three-dimensional Cartesian grid

(3.14)C(x,y)=e-x-yp

for different values of the parameter p. These functions are illustrated in Figure 8 for p=0.1 and p=1.9. Figure 9 shows for a Slater function with p=1 the dependence of the Tucker decomposition error in the Frobenius norm (3.13) on the Tucker rank, for the increasing grid parameter n.

Figure 10 shows the convergence with respect to the Tucker rank for the spectral density of Matérn covariance

(3.15)fα,ν(ρ):=C(α2+ρ2)ν+d2,

where α(0.1,100) and d=1,2,3. The Tucker decomposition rank is strongly dependent on the parameter α and weakly depend on the parameter ν. The three-dimensional Matérn functions with the parameters ν=0.4, α=0.1 (left) and ν=0.4, α=100 (right) are presented in Figure 11, showing the function at z=0.

These numerical experiments demonstrate the good algebraic separability of the typical multidimensional functions used in spatial statistics, that lead us to apply the low-rank tensor decomposition methods to the multidimensional problems of statistical data analysis.

Figure 11 The shape of three-dimensional spectral density of Matérn covariance
(3.15) with α=0.1{\alpha=0.1} (left) and α=100{\alpha=100} (right).
Figure 11 The shape of three-dimensional spectral density of Matérn covariance
(3.15) with α=0.1{\alpha=0.1} (left) and α=100{\alpha=100} (right).
Figure 11

The shape of three-dimensional spectral density of Matérn covariance (3.15) with α=0.1 (left) and α=100 (right).

4 Solutions to Typical Tasks in Low-Rank Tensor Format

In this section, we walk through the solutions to the statistical questions raised in the motivation above, Section 2. We add some lemmas to summarize the new computing and storage costs. Let N=nd, measurement vector 𝐳m, 𝐂ssN×N, 𝐂zzm×m, and 𝐂szN×m. We also introduce the restriction operator P, which consists of only ones and zeros, and pick sub-indices {i1,,im} from the whole index set {i1,,iN}. This operator P has tensor-1 structure, i.e., P=ν=1dPν. An application of this restriction tensor does not change tensor ranks.

Computing Matrix-Vector Product.

Let 𝐂=i=1rμ=1d𝐂iμ. If 𝐳 is separable, i.e., 𝐳-j=1rbν=1d𝐳jνε, then

𝐂𝐳=i=1rj=1rzμ=1d𝐂iμ𝐳jμ.

If 𝐳 is non-separable, then low-rank tensor properties cannot be employed and either the FFT idea [51] or the hierarchical matrix technique should be applied instead [38, 21, 19, 22].

Lemma 4.1.

The computing cost of the product Cz is reduced from O(N2) to O(rrzdn2), where N=nd, d1.

Trace and Diagonal of 𝐂.

Let 𝐂𝐂~=i=1rμ=1d𝐂iμ. Then

(4.1)diag(𝐂~)=diag(i=1rμ=1d𝐂iμ)=i=1rμ=1ddiag(𝐂iμ),
(4.2)trace(𝐂~)=trace(i=1rμ=1d𝐂iμ)=i=1rμ=1dtrace(𝐂iμ).

The proof follows from the properties of the Kronecker tensors.

Lemma 4.2.

The cost of computing the right-hand sides in (4.1)–(4.2) is rdn, where CiμRn×n.

For simplicity, we assume that n1=n2==nd=n and i=1dni=dn.

Lemma 4.3.

The computing cost of diag(C) and trace(C) is reduced from O(N) to O(rdn). The cost of det(C) is reduced from O(N3) to O(dn3).

Example 4.1.

A simple Matlab test for computing trace(𝐂) on a working station with 128 GB produces the computing times in seconds shown in Table 1.

Table 1

Computing time (in seconds) to set up and compute the trace of 𝐂~=j=1rν=1d𝐂jν, r=10, 𝐂~N×N, where N=nd, d=1,000 and n={100,500,1000}. A modern desktop computer with 128 GB RAM was used.

n
d1005001000
10003.767491

In what follows, we discuss the computation of trace and diag in the case of Tucker representation of the generating Matérn function on (2k+1)d grid and present the corresponding numerical example for d=3. We notice that due to Toeplitz structure of the skeleton matrices of size (2k+1)×(2k+1), the diagonal of the covariance matrix 𝐂 is the weighted Tucker sum of the Kronecker products of scaled identity matrices in 2k+1. The scaling factor is determined by the value of generating Tucker vector 𝐯ν()(k+1) corresponding the origin of the computational box [-b,b]d. Let the matrix 𝐂 be composed by using the Tucker tensor 𝐀 approximating the generating Matérn function. Then, as a straightforward consequence of the above remark, we derive the simple representations

diag(𝐂)=𝐀(x0)𝐄(d),trace(𝐂)=𝐀(x0)(2k+1)d,

where x0 corresponds to the origin x=0 in [-b,b]d and 𝐄(d) is the identity matrix in the full tensor space. The grid coordinate of x0 is determined by the multi-index (k+1,,k+1).

Table 2 represents the values of 𝐀(x0) computed by the Tucker approximation to the three-dimensional Slater function e-x on the (2k+1)3 grids with n=2k+1=129,257,513, and for different Tucker rank parameters r=1,2,,10.

Table 2

The error of the Tucker approximation to the value 𝐀 at the origin (the exact value is equal to 1) versus the Tucker rank and the grid size n=2k+1.

Tucker rank r
n12345678910
1290.3860.200.120.070.040.0170.0021.2e-48.4e-67.5e-6
2570.3860.200.120.0730.0460.0290.0170.0078.0e-41.4e-5
5130.3860.200.120.0730.0470.0310.0200.01380.0080.0035

Given the rank-𝐫 Tucker tensor 𝐀, the complexity for calculation of 𝐀(x0) is estimated by O(rd).

Computing Square Root 𝐂12.

Observe that 𝐂12 can be computed as in (3.12). An iterative method for computing 𝐂12 is presented in [16].

Linear Solvers in a Low-Rank Tensor Format.

Likely, there is already a good theory for solving linear systems 𝐂𝐰=𝐳 with symmetric and positive definite matrix 𝐂, in a tensor format. We refer to the overview works [34, 35, 18, 20]. Some particular linear solvers are developed in [13, 2, 24, 36, 12], [11, 36]. We also recommend to use the Tensor Toolbox [41], which contains routines for CP and Tucker tensor formats.

4.1 Computing 𝐳T𝐂-1𝐳

Lemma 4.4.

Let z-i=1rμ=1dziμε. We assume that there is an iterative method that can be used to solve the linear system Cw=z in a low-rank tensor format and to find the solution in the form w=i=1rμ=1dwiμ. Then the quadratic form zTC-1z is the following scalar products:

𝐳T𝐂-1𝐳=i=1rj=1rzμ=1d(𝐰iμ,𝐳jμ).

The proof follows from the definition and properties of the tensor and scalar products. If 𝐳 is non-separable, then low-rank tensor properties cannot be employed and the FFT idea [51] or the hierarchical matrix technique [38] should be employed.

Lemma 4.5.

The computing cost of the quadratic form zTC-1z is the product of the number of required iterations and the cost of one iteration, which is O(rrzdm2) (assuming that the iterative method required only matrix-vector products).

The proof follows from the definitions and properties of the tensor and scalar products.

4.2 Interpolation by Simple Kriging

The three most computationally demanding tasks in Kriging are:

  1. solving an M×M system of equations to obtain the Kriging weights,

  2. obtaining the N×1 Kriging estimate by superposing the Kriging weights with the N×M cross-covariance matrix between the measurements and the unknowns,

  3. evaluating the N×1 estimation variance as the diagonal of an N×N conditional covariance matrix [51].

Here, M refers to the number of measured data values, and N refers to the number of estimation points. When optimizing the design of the sampling patterns, the challenge is to evaluate the scalar measures of the N×N conditional covariance matrix (see φA and φC in equation (2.1) and Task 4) repeatedly within a high-dimensional and non-linear optimization procedure (e.g., [42, 58]).

The following Kriging formula is well known [51]:

𝐬^=𝐂sz𝐂zz-1𝐳.

Lemma 4.6.

If Csz-i=1rCμ=1dCiμε for some small ε0 and Lemma 4.4 holds, then

(4.3)𝐂sz𝐂zz-1𝐳i=1rzj=1rCν=1d𝐂iν𝐰jν.

The proof follows from the definitions and the properties of the tensor and scalar products.

Lemma 4.7.

The computing cost of solving the linear system Czz-1z is O(#itersrzrdm2). Computation of the Kriging coefficients by equation (4.3) costs O(rzrCdnm)+O(#itersrzrdm2).

If 𝐳 is non-separable, then the low-rank tensor properties cannot be employed and either the FFT idea [51] or the hierarchical matrix technique [38] should be applied.

4.3 Computing Conditional Covariance

Let 𝐲m be the vector of measurements. The conditional covariance matrix is

𝐂ss|y=𝐂ss-𝐂sy𝐂yy-1𝐂ys.

The associated estimation variance 𝝈^ is the diagonal of the N×N conditional covariance matrix 𝐂ss|y:

(4.4)𝝈^𝐬=diag(𝐂ss|y)=diag(𝐂ss-𝐂sy𝐂yy-1𝐂ys).

Let us assume that the measurements are taken at locations that form a subset of the total set of nodes ={0,,N-1}, i.e., ={i1,,im}. We also assume that the nodes belong to a tensor mesh, i.e., if =ν=1dIν and =ν=1dI^ν, then I^νInu.

Let

𝐂yy=k=1rμ=1d𝐂kμ.

Again, we use low-rank tensor solvers, this time to solve the matrix system 𝐂yy𝐖=𝐂ys. We obtain the solution

𝐖=𝐂yy-1𝐂ys=j=1rwμ=1d𝐂jμ.

Assuming that 𝐂syi=1rCμ=1d𝐂iμ, we obtain

𝐂sy𝐖=𝐂sy𝐂yy-1𝐂ys=i=1rCν=1d𝐂iνj=1rwμ=1d𝐂jμ=i=1rCj=1rwμ=1d𝐂iμ𝐂jμj=1r0μ=1d𝐂~jμ,

where r0>0 is the new rank after a rank-truncation procedure and 𝐂~jμ are new factors. Let

𝐂ss=i=1rsμ=1d𝐂iμ.

The conditional covariance is

𝐂ss|y=i=1rsμ=1d𝐂iμ-i=1r0μ=1d𝐂~iμ=i=1rs+r0μ=1d𝐂^iμ,

where 𝐂^iμ=𝐂iμ for 1irs and 𝐂^iμ=-𝐂~iν for rs<ir0+rs.

4.4 Example: Separable Covariance Matrices

Let

cov(𝐱,𝐲)=exp-|𝐱-𝐲|2

be the Gaussian covariance function, where 𝐱=(x1,,xd) and 𝐲=(y1,,yd)𝒟d. The function cov(𝐱,𝐲) can be written as a tensor product of one-dimensional functions:

cov(𝐱,𝐲)=exp-|x1-y1|2exp-|xd-yd|2.

After discretization of cov(𝐱,𝐲), we obtain 𝐂 as a rank-1 Kronecker product of the one-dimensional covariance matrices, i.e.,

(4.5)𝐂=𝐂1𝐂d.

We note that arbitrary discretization (anisotropy) can occur in any direction.

Lemma 4.8.

If d Cholesky decompositions exist, i.e., Ci=LiLiT and i=1,,d, then

𝐂1𝐂d=(𝐋1𝐋1T)(𝐋d𝐋dT)=(𝐋1𝐋d)(𝐋1T𝐋dT)=:𝐋𝐋T,

where L:=L1Ld and LT:=L1TLdT are also lower- and upper-triangular matrices, respectively.

Lemma 4.8 shows that

  1. the Gaussian covariance function in dimensions d>1 can be written as the tensor sum of one-dimensional covariance functions,

  2. its Cholesky factor can be computed via Cholesky factors computed from one-dimensional covariances.

The computational complexity drops from 𝒪(NlogN), N=nd, to 𝒪(dnlogn), where n is the number of mesh points in a one-dimensional problem. Further research is required on non-Gaussian covariance functions.

Lemma 4.9.

Let C=C1Cd. If the inverse matrices Ci-1, i=1,,d, exist, then

(𝐂1𝐂d)-1=𝐂1-1𝐂d-1.

The computational complexity drops from 𝒪(NlogN), N=nd, to 𝒪(dnlogn), where n is the number of mesh points in a one-dimensional problem.

Remark 4.10.

We assume here that we have an efficient method to invert 𝐂 (e.g., FFT or hierarchical matrices) with a cost of 𝒪(NlogN). If not, then the complexity cost drops from 𝒪(N3) to 𝒪(dn3) (usual Gaussian elimination algorithm).

Lemma 4.11.

If Ci, i=1,,d, are covariance matrices, then we can compute logdetC, where C is a separable rank-1 d-dimensional covariance function, as

(4.6)logdet(𝐂1𝐂d)=j=1dlogdet𝐂ji=1,ijdni.

Proof.

We check for d=2 that

det(𝐂1𝐂2)=det(𝐂1)n2det(𝐂2)n1

and then apply mathematical induction. ∎

The computational cost drops again from 𝒪(NlogN), N=nd, to 𝒪(dnlogn). A similar assumption to Remark 4.10 for computing det(𝐂) also holds here.

Example 4.2.

Let n=6000, d=3, and N=60003. Using MATLAB on a MacBookPro with 16 GB RAM, the time required set up the matrices 𝐂1,𝐂2, and 𝐂3 is 11 seconds; it takes 4 seconds to compute 𝐋1, 𝐋2, and 𝐋3. The large matrices 𝐂 and 𝐋 are never constructed (i.e., the Kronecker product is never calculated).

Example 4.3.

In previous work [43] we used the hierarchical matrix technique to approximate 𝐂i and its Cholesky factor 𝐋i for n=2106 in 2 minutes. Here, we combine the hierarchical matrix technique and the Kronecker tensor product. Assuming 𝐂=𝐂1𝐂d, we approximate 𝐂 for n=(2106)d in 2d minutes.

Lemma 4.12.

Let C be the same as in equation (4.5), and let Lemma 4.4 hold. Then we apply the property in equation (4.6) to obtain a tensor approximation of the log-likelihood:

(4.7)~=-ν=1dnνlog(2π)-j=1dlogdet𝐂ji=1,ijdni-i=1rj=1rν=1d(𝐮i,νT,𝐮j,ν).

Equation (4.7) shows one disadvantage of the Gaussian log-likelihood function in high dimensions. Namely, the log-likelihood grows exponentially with d as nd.

5 Conclusion

In this work, we demonstrate that the basic functions and operators used in spatial statistics may be represented using rank-structured tensor formats and that the error of this representation exhibits the exponential decay with respect to the tensor rank. We applied the Tucker and canonical tensor decompositions to a family of Matérn-type and Slater-type functions with varying parameters and demonstrated numerically that their approximations exhibit exponentially fast convergence. A low-rank tensor approximation of the Matérn covariance function and its Fourier transform is considered. We separated the radial basis functions using the Laplace transforms to prove the existence of such low-rank approximations, and applied the sinc quadrature method to estimate the tensor ranks and accuracy.

We also demonstrated how to compute diag(𝐂), trace(𝐂), the matrix-vector product, Kriging operations, and the geostatistical optimal design in a low-rank tensor format with at a linear cost. For matrix 𝐂 of size N×N, N=60003 and of tensor rank 1, we were able to compute the Cholesky factorization in 15 seconds. We also computed the Tucker approximation to the three-dimensional Slater function e-x on the grid with 5133 points and Tucker ranks r=1,2,,10. Furthermore, we demonstrated how to compute trace(𝐂) for N=nd=10001000. This might be useful in machine learning. In this paper, operations such as computing the Cholesky factorization, inverse, and determinant have been implemented for rank-1 tensors (e.g., the Gaussian covariance has a tensor rank-1). These formulas could be useful for developing successive rank-1 updates in greedy algorithms. Further investigations are needed for the representation of these quantities with the ranks higher than one.

Additionally, in Section 3.7 we studies the influence of the parameters of the Matérn covariance function on the tensor ranks (Figures 7 and 9). We observed (see Figure 10) that the dependence of the parameters of the Matérn covariance function on the tensor ranks is very weak, and the ranks grew slowly. In this paper, we also highlighted that big data statistical problems can be effectively treated by using the special low-rank tensor techniques.

Funding statement: The research reported in this publication was supported by funding from King Abdullah University of Science and Technology (KAUST).

References

[1] S. Ambikasaran, J. Y. Li, P. K. Kitanidis and E. Darve, Large-scale stochastic linear inversion using hierarchical matrices, Comput. Geosci. 17 (2013), no. 6, 913–927. 10.1007/s10596-013-9364-0Search in Google Scholar

[2] J. Ballani and D. Kressner, Sparse inverse covariance estimation with hierarchical matrices, preprint (2015), http://sma.epfl.ch/~anchpcommon/publications/quic_ballani_kressner_2014.pdf. Search in Google Scholar

[3] C. Bertoglio and B. N. Khoromskij, Low-rank quadrature-based tensor approximation of the Galerkin projected Newton/Yukawa kernels, Comput. Phys. Commun. 183 (2012), no. 4, 904–912. 10.1016/j.cpc.2011.12.016Search in Google Scholar

[4] S. Börm and J. Garcke, Approximating gaussian processes with H2-matrices, Proceedings of 18th European Conference on Machine Learning—ECML 2007, Lecture Notes in Artificial Intelligence 4701, Springer, Berlin (2007), 42–53. 10.1007/978-3-540-74958-5_8Search in Google Scholar

[5] S. F. Boys, G. B. Cook, C. M. Reeves and I. Shavitt, Automatic fundamental calculations of molecular structure, Nature 178 (1956), 1207–1209. 10.1038/1781207a0Search in Google Scholar

[6] D. Braess, Nonlinear Approximation Theory, Springer Ser. Comput. Math. 7, Springer, Berlin, 1986. 10.1007/978-3-642-61609-9Search in Google Scholar

[7] J.-P. Chilès and P. Delfiner, Geostatistics, Wiley Ser. Probab. Stat., John Wiley & Sons, New York, 1999. 10.1002/9780470316993Search in Google Scholar

[8] A. Cichocki and S. Amari, Adaptive Blind Signal and Image Processing: Learning Algorithms and Applications, Wiley, New York, 2002. 10.1002/0470845899Search in Google Scholar

[9] S. De Iaco, S. Maggio, M. Palma and D. Posa, Toward an automatic procedure for modeling multivariate space-time data, Comput. Geosci. 41 (2011), 10.1016/j.cageo.2011.08.008. 10.1016/j.cageo.2011.08.008Search in Google Scholar

[10] L. De Lathauwer, B. De Moor and J. Vandewalle, A multilinear singular value decomposition, SIAM J. Matrix Anal. Appl. 21 (2000), no. 4, 1253–1278. 10.1137/S0895479896305696Search in Google Scholar

[11] S. Dolgov, B. N. Khoromskij, A. Litvinenko and H. G. Matthies, Computation of the response surface in the tensor train data format, preprint (2014), https://arxiv.org/abs/1406.2816. Search in Google Scholar

[12] S. Dolgov, B. N. Khoromskij, A. Litvinenko and H. G. Matthies, Polynomial chaos expansion of random coefficients and the solution of stochastic partial differential equations in the tensor train format, SIAM/ASA J. Uncertain. Quantif. 3 (2015), no. 1, 1109–1135. 10.1137/140972536Search in Google Scholar

[13] S. Dolgov, B. N. Khoromskij and D. Savostyanov, Superfast Fourier transform using QTT approximation, J. Fourier Anal. Appl. 18 (2012), no. 5, 915–953. 10.1007/s00041-012-9227-4Search in Google Scholar

[14] P. A. Finke, D. J. Brus, M. F. P. Bierkens, T. Hoogland, M. Knotters and F. De Vries, Mapping groundwater dynamics using multiple sources of exhaustive high resolution data, Geoderma 123 (2004), no. 1, 23–39. 10.1016/j.geoderma.2004.01.025Search in Google Scholar

[15] R. Furrer and M. G. Genton, Aggregation-cokriging for highly multivariate spatial data, Biometrika 98 (2011), no. 3, 615–631. 10.1093/biomet/asr029Search in Google Scholar

[16] I. P. Gavrilyuk, W. Hackbusch and B. N. Khoromskij, Data-sparse approximation to a class of operator-valued functions, Math. Comp. 74 (2005), no. 250, 681–708. 10.1090/S0025-5718-04-01703-XSearch in Google Scholar

[17] I. P. Gavrilyuk, W. Hackbusch and B. N. Khoromskij, Hierarchical tensor-product approximation to the inverse and related operators for high-dimensional elliptic problems, Computing 74 (2005), no. 2, 131–157. 10.1007/s00607-004-0086-ySearch in Google Scholar

[18] L. Grasedyck, D. Kressner and C. Tobler, A literature survey of low-rank tensor approximation techniques, GAMM-Mitt. 36 (2013), no. 1, 53–78. 10.1002/gamm.201310004Search in Google Scholar

[19] W. Hackbusch, A sparse matrix arithmetic based on -matrices. I. Introduction to -matrices, Computing 62 (1999), no. 2, 89–108. 10.1007/s006070050015Search in Google Scholar

[20] W. Hackbusch, Tensor Spaces and Numerical Tensor Calculus, Springer Ser. Comput. Math. 42, Springer, Heidelberg, 2012. 10.1007/978-3-642-28027-6Search in Google Scholar

[21] W. Hackbusch, Hierarchical Matrices: Algorithms and Analysis, Springer Ser. Comput. Math. 49, Springer, Heidelberg, 2015. 10.1007/978-3-662-47324-5Search in Google Scholar

[22] W. Hackbusch and B. N. Khoromskij, A sparse -matrix arithmetic. II. Application to multi-dimensional problems, Computing 64 (2000), no. 1, 21–47. 10.1007/PL00021408Search in Google Scholar

[23] W. Hackbusch and B. N. Khoromskij, Low-rank Kronecker-product approximation to multi-dimensional nonlocal operators. I. Separable approximation of multi-variate functions, Computing 76 (2006), no. 3–4, 177–202. 10.1007/s00607-005-0144-0Search in Google Scholar

[24] W. Hackbusch and B. N. Khoromskij, Low-rank Kronecker-product approximation to multi-dimensional nonlocal operators. II. HKT representation of certain operators, Computing 76 (2006), no. 3–4, 203–225. 10.1007/s00607-005-0145-zSearch in Google Scholar

[25] M. S. Handcock and M. L. Stein, A Bayesian analysis of Kriging, Technometrics 35 (1993), 403–410. 10.1080/00401706.1993.10485354Search in Google Scholar

[26] H. Harbrecht, M. Peters and M. Siebenmorgen, Efficient approximation of random fields for numerical applications, Numer. Linear Algebra Appl. 22 (2015), no. 4, 596–617. 10.1002/nla.1976Search in Google Scholar

[27] J. Håstad, Tensor rank is NP-complete, J. Algorithms 11 (1990), no. 4, 644–654. 10.1007/BFb0035776Search in Google Scholar

[28] M. R. Haylock, N. Hofstra, A. M. Klein Tank, E. J. Klok, P. D. Jones and M. New, A european daily high-resolution gridded data set of surface temperature and precipitation for 1950–2006, J. Geophys. Res. 113 (2008), 10.1029/2008JD010201. 10.1029/2008JD010201Search in Google Scholar

[29] F. L. Hitchcock, The expression of a tensor or a polyadic as a sum of products, J. Math. Phys. 6 (1927), 164–189. 10.1002/sapm192761164Search in Google Scholar

[30] A. G. Journel and C. J. Huijbregts, Mining Geostatistics, Academic Press, New York, 1978. Search in Google Scholar

[31] V. Khoromskaia, Computation of the Hartree–Fock exchange by the tensor-structured methods, Comput. Methods Appl. Math. 10 (2010), no. 2, 204–218. 10.2478/cmam-2010-0012Search in Google Scholar

[32] V. Khoromskaia and B. N. Khoromskij, Fast tensor method for summation of long-range potentials on 3D lattices with defects, Numer. Linear Algebra Appl. 23 (2016), no. 2, 249–271. 10.1002/nla.2023Search in Google Scholar

[33] B. N. Khoromskij, Structured rank-(R1,,RD) decomposition of function-related tensors in D, Comput. Methods Appl. Math. 6 (2006), no. 2, 194–220. 10.2478/cmam-2006-0010Search in Google Scholar

[34] B. N. Khoromskij, Tensors-structured numerical methods in scientific computing: Survey on recent advances, Chemometr. Intell. Laboratory Syst. 110 (2011), no. 1, 1–19. 10.1016/j.chemolab.2011.09.001Search in Google Scholar

[35] B. N. Khoromskij, Tensor numerical methods for multidimensional PDEs: Theoretical analysis and initial applications, CEMRACS 2013—Modelling and Simulation of Complex Systems: Stochastic and Deterministic Approaches, ESAIM Proc. Surveys 48, EDP Sci., Les Ulis (2015), 1–28. 10.1051/proc/201448001Search in Google Scholar

[36] B. N. Khoromskij and V. Khoromskaia, Low rank Tucker-type tensor approximation to classical potentials, Cent. Eur. J. Math. 5 (2007), no. 3, 523–550. 10.2478/s11533-007-0018-0Search in Google Scholar

[37] B. N. Khoromskij and V. Khoromskaia, Multigrid accelerated tensor approximation of function related multidimensional arrays, SIAM J. Sci. Comput. 31 (2009), no. 4, 3002–3026. 10.1137/080730408Search in Google Scholar

[38] B. N. Khoromskij, A. Litvinenko and H. G. Matthies, Application of hierarchical matrices for computing the Karhunen–Loève expansion, Computing 84 (2009), no. 1–2, 49–67. 10.1007/s00607-008-0018-3Search in Google Scholar

[39] P. K. Kitanidis, Introduction to Geostatistics, Cambridge University Press, Cambridge, 1997. 10.1017/CBO9780511626166Search in Google Scholar

[40] T. G. Kolda, Orthogonal tensor decompositions, SIAM J. Matrix Anal. Appl. 23 (2001), no. 1, 243–255. 10.1137/S0895479800368354Search in Google Scholar

[41] T. G. Kolda and B. W. Bader, Tensor decompositions and applications, SIAM Rev. 51 (2009), no. 3, 455–500. 10.1137/07070111XSearch in Google Scholar

[42] J. B. Kollat, P. M. Reed and J. R. Kasprzyk, A new epsilon-dominance hierarchical bayesian optimization algorithm for large multiobjective monitoring network design problems, Adv. Water Res. 31 (2008), no. 5, 828–845. 10.1016/j.advwatres.2008.01.017Search in Google Scholar

[43] A. Litvinenko, HLIBCov: Parallel hierarchical matrix approximation of large covariance matrices and likelihoods with applications in parameter identification, preprint (2017), https://arxiv.org/abs/1709.08625. 10.1016/j.mex.2019.07.001Search in Google Scholar PubMed PubMed Central

[44] A. Litvinenko, Y. Sun, M. G. Genton and D. Keyes, Likelihood approximation with hierarchical matrices for large spatial datasets, preprint (2017), https://arxiv.org/abs/1709.04419. 10.1016/j.csda.2019.02.002Search in Google Scholar

[45] B. Matérn, Spatial Variation, 2nd ed., Lecture Notes in Statist. 36, Springer, Berlin, 1986. 10.1007/978-1-4615-7892-5Search in Google Scholar

[46] G. Matheron, The Theory of Regionalized Variables and its Applications, Ecole de Mines, Fontainebleau, 1971. Search in Google Scholar

[47] V. Minden, A. Damle, K. L. Ho and L. Ying, Fast spatial Gaussian process maximum likelihood estimation via skeletonization factorizations, Multiscale Model. Simul. 15 (2017), no. 4, 1584–1611. 10.1137/17M1116477Search in Google Scholar

[48] W. G. Müller, Collecting Spatial Data. Optimum Design of Experiments for Random Fields, 3rd ed., Contrib. Statist., Springer, Berlin, 2007. Search in Google Scholar

[49] G. R. North, J. Wang and M. G. Genton, Correlation models for temperature fields, J. Climate 24 (2011), 5850–5862. 10.1175/2011JCLI4199.1Search in Google Scholar

[50] W. Nowak, Measures of parameter uncertainty in geostatistical estimation and geostatistical optimal design, Math. Geosci 42 (2010), no. 2, 199–221. 10.1007/s11004-009-9245-1Search in Google Scholar

[51] W. Nowak and A. Litvinenko, Kriging and spatial design accelerated by orders of magnitude: Combining low-rank covariance approximations with FFT-techniques, Math. Geosci. 45 (2013), no. 4, 411–435. 10.1007/s11004-013-9453-6Search in Google Scholar

[52] D. Nychka, S. Bandyopadhyay, D. Hammerling, F. Lindgren and S. Sain, A multiresolution Gaussian process model for the analysis of large spatial datasets, J. Comput. Graph. Statist. 24 (2015), no. 2, 579–599. 10.1080/10618600.2014.914946Search in Google Scholar

[53] I. V. Oseledets, Tensor-train decomposition, SIAM J. Sci. Comput. 33 (2011), no. 5, 2295–2317. 10.1137/090752286Search in Google Scholar

[54] J. Quiñonero Candela and C. E. Rasmussen, A unifying view of sparse approximate Gaussian process regression, J. Mach. Learn. Res. 6 (2005), 1939–1959. Search in Google Scholar

[55] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning, Adapt. Comput. Mach. Learn., MIT, Cambridge, 2006. 10.7551/mitpress/3206.001.0001Search in Google Scholar

[56] A. K. Saibaba, S. Ambikasaran, J. Yue Li, P. K. Kitanidis and E. F. Darve, Application of hierarchical matrices to linear inverse problems in geostatistics, Oil Gas Sci. Technol. Rev. IFP Energ. Nouv. 67 (2012), no. 5, 857–875. 10.2516/ogst/2012064Search in Google Scholar

[57] U. Schollwöck, The density-matrix renormalization group in the age of matrix product states, Ann. Physics 326 (2011), no. 1, 96–192. 10.1016/j.aop.2010.09.012Search in Google Scholar

[58] R. Shah and P. Reed, Comparative analysis of multiobjective evolutionary algorithms for random and correlated instances of multiobjective d-dimensional knapsack problems, European J. Oper. Res. 211 (2011), no. 3, 466–479. 10.1016/j.ejor.2011.01.030Search in Google Scholar

[59] A. K. Smilde, R. Bro and P. Geladi, Multi-Way Analysis with Applications in the Chemical Sciences, Wiley, New York, 2004. 10.1002/0470012110Search in Google Scholar

[60] G. Spöck and J. Pilz, Spatial sampling design and covariance-robust minimax prediction based on convex design ideas, Stoch. Environmental Res. Risk Assess. 24 (2010), 463–482. 10.1007/s00477-009-0334-ySearch in Google Scholar

[61] M. L. Stein, J. Chen and M. Anitescu, Difference filter preconditioning for large covariance matrices, SIAM J. Matrix Anal. Appl. 33 (2012), no. 1, 52–72. 10.1137/110834469Search in Google Scholar

[62] M. L. Stein, Z. Chi and L. J. Welty, Approximating likelihoods for large spatial data sets, J. R. Stat. Soc. Ser. B Stat. Methodol. 66 (2004), no. 2, 275–296. 10.1046/j.1369-7412.2003.05512.xSearch in Google Scholar

[63] F. Stenger, Numerical Methods Based on Sinc and Analytic Functions, Springer Ser. Comput. Math. 20, Springer, New York, 1993. 10.1007/978-1-4612-2706-9Search in Google Scholar

[64] Y. Sun and M. L. Stein, Statistically and computationally efficient estimating equations for large spatial datasets, J. Comput. Graph. Statist. 25 (2016), no. 1, 187–208. 10.1080/10618600.2014.975230Search in Google Scholar

[65] L. R. Tucker, Some mathematical notes on three-mode factor analysis, Psychometrika 31 (1966), 279–311. 10.1007/BF02289464Search in Google Scholar PubMed

[66] S. M. Wesson and G. G. S. Pegram, Radar rainfall image repair techniques, Hydrol. Earth Syst. Sci. 8 (2004), no. 2, 8220–8234. 10.5194/hess-8-220-2004Search in Google Scholar

Received: 2017-11-21
Revised: 2018-03-12
Accepted: 2018-05-02
Published Online: 2018-07-07
Published in Print: 2019-01-01

© 2018 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 1.6.2023 from https://www.degruyter.com/document/doi/10.1515/cmam-2018-0022/html
Scroll to top button