Adaptive Image Compression via Optimal Mesh Refinement

The JPEG algorithm is a defacto standard for image compression. We investigate whether adaptive mesh refinement can be used to optimize the compression ratio and propose a new adaptive image compression algorithm. We prove that it produces a quasi-optimal subdivision grid for a given error norm with high probability. This subdivision can be stored with very little overhead and thus leads to an efficient compression algorithm. We demonstrate experimentally, that the new algorithm can achieve better compression ratios than standard JPEG compression with no visible loss of quality on many images. The mathematical core of this work shows that Binev's optimal tree approximation algorithm is applicable to image compression with high probability, when we assume small additive Gaussian noise on the pixels of the image.


Introduction
The JPEG algorithm has been one of the standard image compression algorithm for decades.It decomposes any given image into a fixed mesh of 8 × 8 pixel blocks.On each of these blocks the image is transformed via the discrete 2D cosine transform.The resulting sequence of coefficients is then quantized by use of a quantization matrix  and a rounding step which is designed in such a way that higher frequencies are stored with less precision (or completely omitted).This leads to a significant storage space reduction compared to a bitmap representation of the image.Particularly for photographs, which typically do not exhibit sharp color or brightness changes, this compression method is very effective.The goal of the present work is to generate the background mesh adaptively in an optimal way instead of using a fixed block size of 8 × 8 in order to improve the compression rate.The mesh refinement algorithms are inspired by adaptive mesh refinement for the approximation of partial differential equations, particularly the adaptive finite element method.
Adaptive mesh refinement for partial differential equations was first mathematically analyzed for the finite element method in the 1980s and 1990s.I. M. Babuška initiated a wave of mathematical research by developing the first a posteriori error estimators [BM87] for the finite element method.This opened a new field of numerical analysis and led to tremendous research activity, we only mention the books [AO00, Ver13,Rep08].Dörfler [Dör96] gave a first rigorous strategy for mesh refinement based on error estimators in 2D more than a decade later and after another five years Morin, Nochetto, and Siebert [MNS00] presented the first unconditional convergence proof.
Optimality of the adaptive strategy, i.e., the fact that asymptotically the optimal meshes are generated, was first proven by Babuška and Vogelius in 1D [BV84] and for  ≥ 2 by Binev, Dahmen, and DeVore [BDD04] in 2004 for adaptive FEM for the Poisson model problem.Building on that, Stevenson [Ste07] and Cascon, Kreuzer, Nochetto, and Siebert [CKNS08] simplified the algorithms and extended the problem class.We refer to [CFPP14,Fei22] for an overview on adaptive mesh refinement.The present work is inspired by adaptive tree approximation, which is an important building block of adaptive algorithms (particularly for time-dependent problems, when mesh coarsening is necessary) developed in [BD04] and extended to the ℎ-setting in [Bin18].Adaptive tree approximation produces meshes on which a given target function can be approximated with near optimal accuracy vs. cost ratio.We use this algorithm for the JPEG compression by approximating the target image (which can be seen as a function that is constant on each pixel) on a mesh of rectangles.On each of these rectangles, we perform the standard quantization of the image via the discrete 2D cosine transform as is done for classical JPEG compression.
We propose two algorithms: A first one takes a given image  and produces a vector   which encodes the compressed image.A second decoding algorithm produces an approximation  app to the original image by (almost) reversing the compression steps in the vector   .The algorithms satisfy the following properties: 1.  app is an approximation of , i.e. there holds ‖ −  app ‖ <  for a certain norm and a preset tolerance  .2. The necessary storage space for   is (near-)minimal.3. The run time for both the encoding and decoding algorithm is (near-)optimal.
Several other adaptive versions of JPEG can be found in the literature: In [EAS14,RW96], the compression quality on each 8 × 8 subgrid is chosen adaptively.In [TTQ09], an adaptive grid is chosen via a greedy algorithm based on interpolation error and this algorithm is used in [UJQT12] to compress images, however, no (near) optimality of the algorithm is proved (or can be expected).
The remainder of this work is structured as follows: In Section 2, we establish the adaptive algorithm for general norms.In Section 3, we apply the algorithm for the  2 -norm and derive a near-optimality result.Section 4 discusses effective storage of the compressed image and a final Section 5 tests the algorithm on realistic images.

The JPEG compression algorithm
The JPEG compression algorithm was first proposed in the 90s (standardized in [Sta94]) and is a defacto standard for image compression since.
The compression algorithm transforms the image into the YCbCr color space, which separates the colors into blue-difference chroma, red-difference chroma, and a gamma corrected luminance channel.Since human eyes are more susceptible to changes in luminance than in color, each of the channels is compressed with different quality parameters.However, the basic algorithm is the same on all of the channels: The image is split into 8 × 8-blocks of pixels on which the discrete cosine transform is applied resulting in an 8 × 8 matrix containing the coefficients.Each entry of this matrix is divided by the corresponding entry of the so-called quantization matrix  of the same size.This matrix is predefined in the JPEG standard and controls the compression ratio of the algorithm.The result of the division is then rounded to the nearest integer, which usually produces many zeros in the higher frequency coefficients.The resulting sequence of integers is further compressed via a lossless runlength encoding algorithm.Note that the rounding step is the only irreversible step of the compression algorithm and hence the reason for quality loss in JPEG (see, e.g., [HLN + 18] for further details and references).

The adaptive compression algorithm
In the following, we will develop a mathematical formulation of adaptive JPEG compression.

Representation of images
We suppose the image with ℎ (height) times  (width) pixels is given by a function: where (, ) corresponds to the RGB-values of the pixel (, ), which might be normalized to [0, 1] (although the precise encoding does not matter for what follows).The coordinates of the pixels are chosen such that the pixel (1, 1) is the top left pixel and (ℎ, ) is the bottom right pixel of the image, hence we can think of the image as a matrix of size ℎ ×  with values in R 3 .
The principal idea for obtaining an approximate image  app is to divide the image into smaller elements and approximate the image on those via the classical JPEG strategy.
The aspect ratio condition is not strictly necessary, but makes it easier to efficiently store the mesh, as only one dimension and the position of the element have to be stored.Without loss of generality, we may assume that  and ℎ are powers of two in order to ensure that admissible elements exist.A collection of elements  which forms a disjoint union of {1, . . ., ℎ} × {1, . . ., } is called a mesh  .Corresponding to a mesh  , we consider the approximate image  app, =  app computed on that mesh, where we hide the dependency on  when clear from the context.By  0 , we denote the initial mesh which consists of only one element  0 = { 0 } covering the entire image, i.e.,  0 = {1, . . ., ℎ} × {1, . . ., }.
Finally, by |  , we denote the restriction of the image onto a given element .In order to produce an optimally adapted mesh, we consider a simple refinement rule via bisection illustrated in Figure 1.Given a mesh  , refinement of an element  ∈  produces a new mesh  ′ such that i.e., we do not consider mesh closure (since the approximation problems below will be entirely local).Note that due to the assumptions on the aspect ratio of the elements, all newly created elements are geometrically similar to  0 .
A mesh  is called coarser than a mesh  ′ if the mesh  ′ can be obtained by applying refinement steps on  .In this case we also say  ′ is finer than  and write  <  ′ or  ′ >  .With this notion we further define the sets Note that while two meshes might not be comparable, the refinement rule of our choice still gives local nestedness.The following result is well-known and repeated for the convenience of the reader. (1) Proof.If (1) is not true, there must exist some element  ′ ∈  ′ with  ∩  ′ ̸ = ∅ and  ′ ̸ ⊆ .Since  and  ′ are both generated starting from  0 , we conclude  ⊆  ′ .

The adaptive Algorithm
We follow the strategy of JPEG explained in Section 1.1 and convert the image to the YCbCr color basis, thus resulting in three image components   ,    , and   .We downsample the   and   components to half the image size by computing means of 2 × 2 blocks of pixels.The adaptive algorithm below (Algorithm 2) is then applied to each of the components.
The general strategy of the adaptive algorithm will be to start with  0 and to iteratively refine the mesh until the compression error in the given norm is sufficiently small.To that end, we require local errors which steer the refinement.Definition 1.Let ‖.‖ be a norm of our choice on the space of images of size ℎ × ,  an image of size ℎ ×  and  a mesh.We define: (iii) The best approximation error for  elements

𝐸(𝒯 ).
As the adaptive algorithm below is essentially a greedy algorithm, these straightforward definitions will not lead to optimal results for most error norms ‖ • ‖.It is necessary to consider a modified error first proposed and analyzed in [BD04].
We denote the modified local error by ̃︀  and set ̃︀ ( 0 ) = ( 0 ) for the initial element  0 .Let  ∈ T and we assume that ̃︀ () is already defined.We define ̃︀ (  ) for the children   ,  = 1, .., 4 of  by This means ̃︀  is constant among siblings.The general idea behind this definition is that through the relation to the parent element introduced here, the modified error implicitly depends upon the level of the element, i.e., how many refinements were necessary to produce the element.This forces a non-local behavior of the algorithm which would be impossible with a standard greedy algorithm and turns out to be key to show optimality of the procedure [BD04].
We are now in the position to formulate the adaptive algorithm.
Remark 1.Note that the algorithm does not produce elements  which are smaller than 8 × 8, i.e., it does not produce a grid finer than the standard JPEG grid.
The main goal of the following section will be to prove near-optimality of Algorithm 2 by employing the framework developed in [BD04].
Definition 2. Let an algorithm compute an approximation of a given image  on some (adaptively refined) mesh   ∈ T. We call the algorithm near-optimal (w.r.t. the number of refinements) if there exist positive constants  1 ,  2 such that for ,  ∈ N with  ≤  1 The constants are universal, i.e., they do not depend on the image .

Adaptive JPEG compression in the 𝐿 2 -norm
In this section, we employ Algorithm 2 with ‖ • ‖ denoting the  2 -norm weighted with (ℎ) −1/2 , i.e., the inverse square root of the total number of pixels.It is a natural question whether the  2 -norm appropriately reflects human tolerance for quality loss in images (the answer is likely negative, see, e.g.[WB06]) and we refer to Sections 3.3 and Section 5 for further discussion and experiments.

Image compression on mesh elements
The approximate image on a given mesh  is produced similarly to the original JPEG algorithm, i.e., we use the discrete cosine transform (DCT) and a quantization Matrix  to quantize the image.We use a quantization matrix defined in the JPEG standard [Sta94]: In the following, we will require some operators to formalize the adaptive compression: Let  be an element of size ℎ × .We define a restriction to the top-left 8 × 8 block of as well as its right-inverse, an embedding from R 8×8 into R ℎ× by adding zeros, Furthermore, let DCT  denote the 2D discrete cosine transform on R ℎ× and IDCT  its inverse, i.e.
DCT  : R ℎ× → R ℎ× where () = 1/ √ 2 if  = 0 and () = 1 otherwise.Note that these operators solely depend on the size of  and are oblivious to the position of  within the image.Since no elements  with size less then 8 × 8 will occur in Algorithm 2, TL  will always be well-defined.In the following, let the symbols ⊙ and ./denote entry wise multiplication and division for matrices of the same size.For a given image  and a mesh  we compute an approximate image  final app, via However, for the adaptive procedure we will ignore the quantization and therefore only compute the approximate image for the final mesh this way.For a mesh  ′ <  , which occurs during the adaptive procedure, we compute an approximate image  app, ′ by just restricting the image to the first 8 × 8 DCT coefficients on each element  ∈  ′ , i.e., In the next sections we will only use (6) and return to (5) when the goal is to efficiently store and reconstruct the approximate image.The local and global errors defined in (2)-(3) thus read and (

The refinement property
The key for optimality of Algorithm 2 is the so-called refinement property (in literature also called subadditivity of the estimator): There exists a universal constant  0 > 0 such that Essentially, the refinement property (7) states that refinement of the mesh does not significantly increase the approximation error.Unfortunately, it turns out that this is not true in general in our application as is demonstrated in Figure 2.However, we show that under some assumptions on the distribution of images, there is a high probability that the refinement property (7) holds for a given image.
Proof.We use the well-known identity Φ , () = 1 −  /2 ( √ , √ ), where  denotes the Marcum Qfunction.According to [Esp68, Equation (2)], the derivative can be represented via Together with [ATM09, Equation (6)], this leads to (|   ).The green lines mark the block of 8 × 8 coefficients which do not contribute to () or (  ) respectively.Thus, for this example, the right-hand side of (7) is zero (no non-zero coefficient outside the green block in the left image), while the left-hand side is not.where the modified Bessel function of the first kind is defined as This shows   Φ , () ≤ 0 for all 0 ≤  < ∞ and concludes the proof.
Theorem 4. Let  ∈ R ℎ× denote the original image and let ̃︀  ∈ R ℎ× denote a random distortion obtained by adding   to the -th pixel, where the   are i.i.d.standard normal random numbers.Then, for all ,  ≥ 0, the refinement property (7) is true with probability As we will see below, the operator   encodes the difference between the error on the coarse element  and the fine element   .As compositions of orthogonal projections, embeddings and (inverse) discrete cosine transforms, the operators   are bounded uniformly in ,   .Given   and , we may compute their norm exactly by calculating the corresponding matrix norm.Obviously, there holds ‖‖ ≤  = 1, but we can track the norm numerically for various realistic image sizes (see Fig. 3 for the results).This suggests that  = 0.13 is a realistic upper bound and improves the constant in the statement compared to  = 1.Note that all the remaining arguments of the proof remain valid with the naive bound  = 1.We aim to argue that the refinement property (7) is satisfied for most images.By assumption, we have ︀ (, ) = (, ) +  , for i.i.d.standard normal variables  , .Note that ̃︀  is a random variable and the law of its  2 -norm follows a non-central chi-squared distribution.The local error (  ) can be identified with the difference of   and the corresponding approximate image on   .Therefore, we compute for all children ̃︀  of where we used the triangle inequality and the identity Next, we show for  > 0 that holds with high probability.We denote this probability with 0 ≤  ref ≤ 1.Note that adding the noise to  is equivalent to adding the noise to DCT  ( ̃︀ ), since the discrete cosine transform is just a change of orthonormal bases.Thus, it remains to estimate the norm of where ̃︀  := (DCT  () , +  , ).Thus, it suffices to estimate the probability for and  :=  ∖ {1, . . ., 8} 2 , this can be rewritten as Without loss of generality, we may assume that ‖‖ 2 = 1 and hence ∑︀ 1≤,≤8 DCT  () 2 , ≤ 1, which leads to with independent random variables  := ∑︀ 1≤,≤8 (DCT  () , /+ , ) 2 and  := ∑︀ (,)∈ ( −1 DCT  () , +  , ) 2 .We set  =  2 / 2 and denote by   ,   and Φ  , Φ  the respective density and cumulative density functions of  and  .This leads to where we used Lemma 3 for the last inequality as well as the fact that  contains at least 32 • 16 − 64 = 448 terms.This follows from the fact that Algorithm 2 only refines elements with ℎ  ,   ≥ 16.For ℎ  =   = 16, the left-hand side of 7 is zero by definition.Since   , ℎ  are powers of two by assumption, the next smallest element  satisfies max{ℎ  ,   } ≥ 32, such that # ≥ 32 • 16 − 64 = 448.For  ≥ 0, we may estimate Combining the estimates, we obtain with probability  ref that This concludes the proof.
The cdfs of non-centralized chi squared distributions can be found, e.g. in [JH99] and can be numerically calculated in, e.g., Matlab via the command ncx2cdf.This allows us to evaluate the probability bound from Theorem 4 for realistic values of  = 0.13 and  = 1.This implies the refinement property (7) with  0 = 4 with probability See Figure 3 for the behavior of the lower bound for  ref from Theorem 4 with respect to .
Remark 2. We note that Theorem 4 implies that a noise amplitude  = 0.0075, ensures a probability of 99.99% to satisfy the refinement property (7) with  0 = 4.We also emphasize that such a noise is not noticeable to the human eye as demonstrated in Figure 4 even for an amplitude of 0.015.This suggests that images which do not satisfy (7) are the rare exception.The following result states the optimality of Algorithm 2.
Proof.Since the refinement property (7) is true, the statement follows from [BD04, Theorem 5.2], where the number of children is  = 4.We note that we use a slightly different definition of T  and hence   .
In [BD04], T  denotes the set of meshes generated by exactly  refinements, while we count the number of elements.Since each refinement generates exactly three new elements, the number of refinements corresponds to (# − # 0 )/3.Thus, both definitions are equivalent and the statement above is proved.
Remark 4. In Algorithm 2 the use of the largest ̃︀ () may require a sorting procedure with complexity ( log ).To keep the number of operations of order ( ) we can use binary bins instead of sorting.The result and proof can be changed accordingly with only the absolute constant in the estimate of the theorem changing.

Adaptive compression in other norms
A norm which might be better suited to capture the quality loss in images is the  -norm.A heuristic explanation for this is the fact that the human eye is more sensitive to sharp color/brightness changes which are not captured in the  2 -norm.The  -norm for a function  ∈  ⊂  1 ((0, ℎ] × (0, ]) is defined by where denotes the total variation of .For weakly differentiable functions  ∈  1 , i.e.,  ∈  2,1 the  norm of  is given by However, in the present setting, the images  and  app represent functions which are constant on each pixel □ of the image.For such a function , the total variation computes to where  denotes the unit normal vector on □,  ∘ denotes the open subset of R 2 occupied by the pixels of , and  ∘ denotes the set of interior interfaces between the pixels.We also use the common notation []|  = | □ − | □ ′ to denote the jump of  over the interface  = □ ∩ □ ′ ∈  ∘ between two pixels □ and □ ′ .Thus the  norm of the approximation error ( −  app ) can be written as The new norm induces new local and global errors which can be used to drive Algorithm 2.
Remark 5. Since we consider norms on finite-dimensional spaces, we immediately obtain equivalence of the  2 -norm and  -norm.Therefore, the refinement property (7) also holds for the  -norm for most images and hence also the near-optimality result Theorem 5.However, the constants in (7) for the  -norm depend on the norm equivalence constants, which in turn depend on the image size.A direct computation of the constant seems very difficult as it would involve mapping properties of the discrete cosine transform as an operator on  .

Efficient storage of the adaptive compression
The adaptive grid produced by Algorithm 2 has to be stored in order to reconstruct the compressed image.This potentially produces overhead compared to standard JPEG compression, which uses a predefined grid.However, using some simple optimizations, we demonstrate that this overhead is not significant.

Storing 𝐼 app
Algorithm 2 is applied to each of the color components  ∈ {,   ,   } and produces a corresponding mesh   .As we neglected the rounding step of the JPEG compression in the adaptive algorithm, we compress the  component of  app ,    , for each element  ∈   by: -Compute   () := round(TL  (DCT  (  |  ))./) -Store entries from   () in the order depicted in Figure 5 in a vector    until the last non-zero entry Note that the discrete cosine transform returns a matrix of coefficients of basis functions, where the coefficients towards bottom and right correspond to basis functions of higher frequency.Typical images usually have small coefficients corresponding to basis functions of high frequency, which means that after rounding they become zero and are dropped.
To reconstruct   app from the vectors    , we also need to store the width  of  app , the ratio ℎ/ of  app as well as the width of  ∈   .Furthermore, we need some separator bits denoted by sep in order to mark the start of the next element in   .Surprisingly, it is not necessary to store the location of the element  within .To that end, we introduce a total order on the elements  within a mesh   .Recall that an element  is given by the position of its top-left pixel and its width.Therefore, we can simply order the top-left pixels of the elements: We define the lexicographic order < on the pixels (, ) of  by  We then sort the elements   = {  1 , . . .    } such that their top-left pixels (  ,   ) are ordered with respect to < in an ascending fashion.
A further lossless runlength compression step [Hop21] is applied to   before storing the data.This is similar to the standard JPEG compression and removes the typically many zeros appearing in    .However, we notice that the minimum is always located next to the corner pixels of previously reconstructed elements  (precisely, next to the top-right and bottom-left pixels, see Figure 6).Updating a list of those candidate pixels reduces the search effort significantly.

Practical performance of the adaptive compression algorithm
We perform a limited number of experiments with Algorithm 2 in order to test its performance on realistic images depicted in Figure 7.As shown in Figures 9-10, we choose the tolerance  in Algorithm 2 such that no quality difference between original and compression is immediately visible.Figure 8 compares the storage size of the original JPEG file and the adaptive recompression with Algorithm 2. We observe that storage space for larger JPEG Files can be significantly reduced with adaptive compression, while smaller files show little improvement.Since Algorithm 2 does not produce elements smaller than 8×8, the finest obtainable mesh is the one of the standard JPEG algorithm.Hence, heuristically, the required storage space for any given image is at most about the same as JPEG, although a rigorous statement on the worst case is non-trivial.

Conclusion
We demonstrated the feasibility of adaptive JPEG compression both in practical experiments and in mathematical optimality results.From an efficiency standpoint, it would make sense to start with a finer uniform grid  0 .Both Algorithm 2 and Algorithm 6 parallelize in a straightforward fashion over the elements of the initial grid  ∈  0 .
The method can be extended in several other ways: A first idea is to apply the algorithm to videos instead of images.This would require us to run Algorithm 2 on a mesh consisting of three dimensional cuboid elements  (each cuboid can be split into eight children) and to approximate |  by a discrete 3D cosine transform (or a variant which treats the time coordinate separately).
Another extension of the method applies Binev's ℎ-approximation algorithm [Bin18].Instead of increasing the polynomial degree (as is done in the finite element method),  refinement can modify the quantization matrix locally on each element , or it can alter the number of frequencies that are stored for each .
Finally, it would be interesting to replace the norm ‖ • ‖ in the local errors by some more sophisticated method which judges subjective quality loss of images.

Fig. 1 :
Fig. 1: Elements are refined into four children by two bisections of their sides.

Fig. 3 :
Fig. 3: Left: The norm of  for various sizes of .Right: Upper bound on 1 −  ref for different values of  and  ≥ 0.

Funding:
Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) -Project-ID 258734477 -SFB 1173 as well as the Austrian Science Fund (FWF) under the special research program Taming complexity in PDE systems (grant SFB F65).

Fig. 9 :
Fig. 9: Comparison of the original image (top) with the two output of the adaptive algorithm (middle) for small and large  .Closeup of original (bottom-left) and recompression (bottom-right) used in Figure 8. Image from unsplash.comby artist Kadyn Pierce.

Fig. 10 :
Fig. 10: Comparison of the original image (top) with the two output of the adaptive algorithm (middle) for small and large  .Closeup of original (bottom-left) and recompression (bottom-right) used in Figure 8. Image from unsplash.comby artist Willian Justen de Vasconcellos.

Fig. 11 :
Fig. 11: Comparison of the original image (top) with the output of the adaptive algorithm with respect to the  2 norm (middle-left) and the BV norm (middle-right).On the bottom row a detail of the  2 output (left) and the BV output (right) is compared, which suggests that the BV norm may be better suited.The respective threshold parameters are chosen such that both outputs would take up almost the same storage space.Image from unsplash.comby artist Kevin Charit.
Remark 6. Searching for the minimal pixel in Step 2 of Algorithm 6 is quite costly if done in a naive way.