Approximating morphological operators with part-based representations learned by asymmetric auto-encoders

: This paper addresses the issue of building a part-based representation of a dataset of images. More precisely, we look for a non-negative, sparse decomposition of the images on a reduced set of atoms, in order to unveil a morphological and explainable structure of the data. Additionally, we want this decomposition to be computed online for any new sample that is not part of the initial dataset. Therefore, our solution relies on a sparse, non-negative auto-encoder, where the encoder is deep (for accuracy) and the decoder shallow (for explainability). This method compares favorably to the state-of-the-art online methods on two benchmark datasets (MNIST and Fashion MNIST) and on a hyperspectral image, according to classical evaluation measures and to a new one we introduce, based on the equivariance of the representation to morphological operators.


Introduction
Mathematical morphology is strongly related to the problem of data representation.Applying a morphological filter can be seen as a test on how well the analyzed element is represented by the set of invariants of the filter.For example, applying an opening by a structuring element B tells how well a shape can be represented by the supremum of translations of B. The morphological skeleton [18,24] is a typical example of description of shapes by a family of building blocks, classically homothetic spheres.It provides a disjunctive decomposition where components -for example, the spheres -can only contribute positively as they are combined by supremum.A natural question is the optimality of this additive decomposition according to a given criterion, for example its sparsity -the number of components needed to represent an object.Finding a sparse disjunctive (or part-based) representation has at least two important features: first, it allows saving resources such as memory and computation time in the processing of the represented object; secondly, it provides a better understanding of this object, as it reveals its most elementary components, hence operating a dimensionality reduction that can alleviate the issue of model over-fitting.Such representations are also believed to be the ones at stake in human object recognition [25].
Similarly, the question of finding a sparse disjunctive representation of a whole database is also of great interest and will be the main focus of the present paper.More precisely, we will approximate such a representation by a non-negative, sparse linear combination of non-negative components, and we will call additive this representation.Given a large set of images, our concern is then to find a smaller set of non-negative image components, called dictionary, such that any image of the database can be expressed as an additive combination of the dictionary components.As we will review in the next section, this question lies at the crossroad of two broader topics known as sparse coding and dictionary learning [17].
Besides a better understanding of the data structure, our approach is also more specifically linked to mathematical morphology applications.Inspired by recent work [1,28], we look for image representations that can be used to efficiently calculate approximations to morphological operators.The main goal is to be able to apply morphological operators to massive sets of images by applying them only to the reduced set of dictionary images.This is especially relevant in the analysis of remote sensing hyperspectral images where different kinds of morphological decomposition, such as morphological profiles [19] are widely used.For reasons that will be explained later, sparsity and non-negativity are sound requirements to achieve this goal.What is more, whereas the representation process can be learned o ine on a training dataset, we need to compute the decomposition of any new sample online.Hence, we take advantage of the recent advances in deep, sparse and non-negative auto-encoders to design a new framework able to learn part-based representations of an image database, compatible with morphological processing.To that extent, this work is part of the resurgent research line investigating interactions between deep learning and mathematical morphology [9,22,23,27,32].However with respect to these studies, focusing mainly on introducing morphological operators in neural networks, the present paper addresses a different question.
The existing work on non-negative sparse representations of images is reviewed in Section 2, that stands as a baseline and motivation of the present study.Then we present in Section 3 new results about part-based approximations of morphological operators.The proposed model for part-based representation learning is described in Section 4, a preliminary version of which can be found in [20].Results on two image datasets (MNIST [13] and Fashion MNIST [29]) are discussed in Section 5, and we show how the proposed model compares to other deep part-based representations.An example on hyperspectral images is illutrated as well.We finally draw conclusions and suggest several tracks for future work.The code for reproducing our experiments is available online¹.

Non-negative sparse mathematical morphology
The present work finds its original motivation in [28], where the authors set the problem of learning a representation of a large image dataset to quickly compute approximations of morphological operators on the images.They find a good representation in the sparse variant of Non-negative Matrix Factorization (sparse NMF) [11], that we present hereafter.
Consider a family of M images (binary or gray-scale) x (1) , x (2) , ..., x (M) of N pixels each, aggregated into a M × N data matrix X = (x (1) , x (2) , ..., x (M) ) T (the i th row of X is the transpose of x (i) seen as a vector).Given a feature dimension k ∈ N * and two numbers s H and s W in [0, 1], a sparse NMF of X with dimension k, as Figure 1: A subset of 30 images, extracted from a larger synthetic dataset of 1000 images, built as non-negative linear combinations of the five atom images of Figure 2(a).Although some images may look identical they are not, as the gray levels slightly differ.
defined in [11], is any solution (H, W) of the problem where the second constraint means that both H and W have non-negative coefficients, and the third constraint imposes the degree of sparsity of the columns of H and lines of W respectively, with σ the function defined by Note that σ takes values in [0, 1].The value σ(v) = 1 characterizes vectors v having a unique non-zero coefficient, therefore the sparsest ones, and σ(v) = 0 the vectors whose coefficients all have the same absolute value.Hoyer [11] designed an algorithm to find at least a local minimizer for the problem (1), and it was shown that under fairly general conditions (and provided the L 2 norms of H and W are fixed) the solution is unique [26].
In representation learning, each row h (i) of H is called the encoding or latent features of the input image x (i) , and W holds in its rows a set of k images called the dictionary.In the following, we will refer to the images w j = W j,: of the dictionary as atom images or atoms.As stated by Equation (1), the atoms are combined to approximate each image x (i) := X i,: of the dataset by an estimate x(i) , which writes as follows: where h i,j is the coefficient at row i and column j in matrix H (see Figures 3 and 4 for illustration).The assumption behind this decomposition is that the more similar the images of the set, the smaller the required dimension to accurately approximate this set.Note that only k(N + M) values need to be stored or handled when using the previous approximation to represent the data, against the NM values composing the original data.
For illustration purposes, we propose a toy example.We generated a dataset of 1000 images of size 32×32 pixels, as non negative linear combinations of the five atom images shown on Figure 2 (a).We call this dataset the Rectangles dataset and show 30 samples of it in Figure 1.Here the matrix X counts M = 1000 rows and N = 32 × 32 = 1024 columns.We apply the sparse NMF algorithm to recover five atoms (stored in a nonnegative matrix W ∈ R 5×1024 + ) and 1000 encodings (stored in a non-negative matrix H ∈ R 1000×5 + ) such that X = HW approximates well X.The five recovered atoms are shown in Figure 2 (b), and Figure 3 shows two examples of approximate non-negative reconstructions.Note that the excellent results here are due to the 1000 images of the Rectangles dataset being created precisely as sparse, non-negative combinations of only five, pairwise disjoint, atoms.As such, it is close to verify the hypothesis for which the NMF yields a unique and accurate part-based representation of data [6].In the remaining of the paper we will no longer work with this dataset and focus on more realistic data.The leftmost images are x (i) s, the second column images are their approximations x(i) s, the gray coeflcients are the h i,j and the other images are the five computed atoms w j , 1 ≤ j ≤ 5, also shown in Figure 2 (b).
By choosing the sparse NMF representation, the authors of [28] aim at approximating a morphological operator  on the data X by applying it to the atom images W only, before projecting back into the input image space.That is, they want (x (i) ) ≈ Φ(x (i) ), with Φ(x (i) ) defined by where the h i,j and w j are the same as in Equation (3).The operator Φ in Equation ( 4) is called a part-based approximation to .To understand why non-negativity and sparsity help this approximation to be a good one, we can point out a few key arguments.First, sparsity favors the support of the weighted atom images to have little pairwise overlap.Secondly, a sum of images with disjoint supports is equal to their (pixel-wise) supremum.Finally, dilations commute with the supremum and, under certain conditions that are favored by sparsity, this also holds for the erosions.This will be developed in more details in Section 3.For now, Figure 4 illustrates the part-based approximation D B of the dilation δ B by a structuring element B, expressed as:

Deep auto-encoders approaches
The main drawback of the NMF algorithm is that it is an o ine process, and the encoding of any new sample with regards to the previously learned basis W requires either to solve a computationally extensive constrained optimization problem, or to relax the non-negativity constraint by using the pseudo-inverse W + of the basis.Some approaches proposed to overcome this shortcoming rely on Deep Learning, and especially on deep auto-encoders, which are widely used in the representation learning field, and offer an online representation process [8,10,15].An auto-encoder, as represented in Figure 5, is a model composed of two stacked neural networks, an encoder and a decoder whose parameters are trained by minimizing a loss function.A common example of loss function is the mean square error (MSE) between the input images x (i) and their reconstructions by the decoder x(i) :

Part-based approximation
In this framework, and when the decoder is composed of a single linear layer (possibly followed by a nonlinear activation function), the model approximates the input images as: where h (i) is the encoding of the input image by the encoder network, b and W respectively the bias and weights of the linear layer of the decoder, and f the (possibly non-linear) activation function, that is applied pixel-wise to the output of the linear layer.The output x(i) is called the reconstruction of the input image x (i) by the auto-encoder.It can be considered as a linear combination of atom images, up to the addition of an offset image b and to the application of the activation function f .The images of our learned dictionary are hence the columns of the weight matrix W of the decoder.We can extend the definition of part-based approximation, described in Section 2.1, to our deep learning architectures, by applying the morphological operator to these atoms w 1 , ..., w k , as pictured by Figure 5.Note that a central question lies in how to set the size k of the latent space.This question is beyond the scope of this study and the value of k will be arbitrarily fixed (we take k = 100) in the following.
The NNSAE architecture (for Non-Negative Sparse Autoencoder), from Lemme et al. [15], proposes a very simple and shallow architecture for online part-based representations using linear encoder and decoder with tied weights (the weight matrix of the decoder is the transpose of the weight matrix of the encoder).Both the NCAE architectures (Nonnegativity-Constrained Autoencoder), from Hosseini-Asl et al. [10] and the work from Ayinde et al. [2], that aims at extending it, drop this transpose relationship between the weights of the encoder and of the decoder, increasing the capacity of the model.Those three networks enforce the non-negativity of the elements of the representation, as well as the sparsity of the image encodings using various techniques.

Enforcing sparsity of the encoding
The most prevalent idea to enforce sparsity of the encoding in a neural network can be traced back to the work of H. Lee et al. [14].This variant penalizes, through the loss function, a deviation S of the expected activation of each hidden unit (i.e. the output units of the encoder) from a low fixed level p. Intuitively, this should ensure that each of the units of the encoding is activated only for a limited number of images.The resulting loss function of the sparse auto-encoder is then: where the parameter p sets the expected activation objective of each of the hidden neurons, and the parameter β controls the strength of the regularization.The function S can be of various forms, which were empirically surveyed in [31].The approach adopted by the NCAE [10] and its extension [2] both rely on a penalty function based on the KL-divergence between two Bernoulli distributions, whose parameters are the expected activation and p respectively, as used in [10]: The NNSAE architecture [15] introduces a slightly different way of enforcing the sparsity of the encoding, based on a parametric logistic activation function at the output of the encoder, whose parameters are trained along with the other parameters of the network.

Enforcing non-negativity of the decoder weights
For the NMF (Section 2.1) and for the decoder, non-negativity results in a part-based representation of the input images.In the case of neural networks, enforcing the non-negativity of the weights of a layer eliminates cancellations of input signals.In all the aforementioned works, the encoding is non-negative since the activation function at the output of the encoder is a sigmoid.In the literature, various approaches have been designed to enforce weight positivity.A popular approach is to use an asymmetric weight decay, added to the loss function of the network, to enact more decay on the negative weights that on the positive ones.However this approach, used in both the NNSAE [15] and NCAE [10] architectures, does not ensure that all weights will be non-negative.This issue motivated the variant of the NCAE architecture [2,15], which uses either the L 1 rather than the L 2 norm, or a smooth version of the decay using both the L 1 and the L 2 norms.The source code of this method being unavailable at the time the present work was done, we did not use this more recent version as a baseline for our study.
Another type of approaches consists in initializing the decoder weights with non-negative values and ensure they remain so after each update during the optimization process.The simplest strategy, as implemented in the projected gradient descent [5], is to project the weights onto the positive orthant by setting negative components to zero.More recently, the exponentiated gradient descent was proposed as an alternative to the projected gradient descent [8].The idea is to update the weights by multiplying them by a positive coefficient, which is an exponentially decreasing function of the partial derivative of the loss with respect to the weights.Although promising, the latter proposition does not include any sparsity constraint and the authors provide no quantitative measure on image reconstruction errors.
As far as non-negativity of weights is concerned, we may also mention [30], which uses an optimization process inspired by the NMF to satisfy the non-negative probability constraints of Random Neural Networks stacked in auto-encoders.
We will present in Section 4 our own auto-encoder solution for an online, non-negative and sparse representation of data, compatible with the approximation of morphological operators.In the next section we provide some mathematical insights on how non-negativity and sparsity are connected to such an approximation.

Equivariance of morphological operators to non-negative linear combinations
In this section we precise the intuitions sketched in Section 2.1 about the part-based approximation of morphological operators.Let L be the complete lattice of images with N pixels and with values in [0, +∞] ordered by the Pareto ordering (x ≤ y iff for any q, 1 ≤ q ≤ N, xq ≤ yq ).Consider a flat, extensive dilation δ B on L and its adjoint anti-extensive erosion ε B , B being a flat structuring element.Let x ∈ L be an image approximated by the non-negative combination x = ∑︀ k j=1 h j w j of k atom images w 1 , . . ., w k ∈ L. Following Equation ( 4), we define the part based approximations of the four operators δ B , ε B ,  B = δ B ε B and φ B = ε B δ B as: We focus on establishing whether these expressions approximate well their exact counterparts δ B (x), ε B (x),  B (x) and φ B (x), assuming x is well approximated by x = ∑︀ k j=1 h j w j = Wh.It is likely to be so as soon as , which is to say as soon as the four operators commute with the non-negative linear application W = [w 1 , . . ., w k ] ↦ → Wh = ∑︀ k j=1 h j w j .As sketched earlier, sums can be identified to suprema if the involved images have disjoint supports, and this also favors the commutation of the erosion with the supremum.This is why we introduce the following hypothesis that characterizes the disjunction of supports (i.e. the regions where the image is non-zero) of the h j w j .
Let H 1 denote the hypothesis: where 0 denotes an image equal to zero everywhere (i.e. with empty support), and more generally, for an integer n, Hn: "For any is the identity for n = 0. Note that, since δ B is extensive, Hn implies any Hp with p ≤ n.In particular, any Hn implies H 0 , which simply states the disjunction of the supports of any two images h i w i and h j w j , i ≠ j.We can now state the following result: If H 1 holds for the representation x = ∑︀ k j=1 h j w j , then: If additionally H 2 holds, then we also have: A proof of this result is detailed in Appendix A. Proposition 1 implies that under the Hn hypothesis the error || B (x) − Φ B (x)|| 2 between the actual transformed image and its part-based approximation only depends on the quality of the reconstruction, that is to say on the error ||x and so on.Obviously, the more constrained the representation, the smaller the class of images that can be accurately represented.The non-negativity and sparsity constraints are therefore likely to increase the representation error ||x − x|| 2 .Hence, unless the data can be perfectly represented by non-negative combinations of atoms complying with a hypothesis Hn, a trade-off needs to be found to achieve a good approximation of morphological operators.This is the target of our asymmetric auto-encoder presented in Section 4.
We shall now generalize Proposition 1 by applying it to the representation that we note x(n−1) = ∑︀ k j=1 h j δ (n−1)B (w j ).Notice that H 1 holds for x(n−1) if and only if Hn holds for x.This yields the following corollary.

Corollary 1. If Hn holds for the representation x =
∑︀ k j=1 h j w j , then for any integer p ≤ n: and for any integer p ≤ n − 1

Remarks
Choice of the complete lattice L. At the beginning of this section we chose L as the complete lattice of images with N pixels and with values in [0, +∞] ordered by the Pareto ordering.However, in practice we deal more commonly with images whose values are in a bounded interval such as [0, 1].The previous results still hold in the latter case, provided we add the hypothesis h j ∈ [0, 1].More generally, we only need to make sure that w ∈ L ⇒ hw ∈ L. Interpretation of Hn.The hypothesis Hn, n ≥ 0, characterizes the degree of disjunction of the supports of the h j w j involved in the part-based approximation of an image x.The dilation δ B being extensive, the degree of disjunction, intended as distance between supports of the initial images, "increases" with n.Note that no assumption is made on the disjunction of the whole set of atom images w j , but only on those atoms that are used in the approximation of x, in other words the w j weighted by a positive h j .This helps realize that the number of atoms used to approximate an image matters.In the limit case where only one atom is used, Hn is verified for any n.By contrast, if as many as N atoms contribute to the approximation, then even H 1 becomes impossible.In the context of the representation of a large dataset, the ideal case seems to be when every image is well approximated by few atoms, as disjoint as possible.This indicates that the Hn are not unrealistic hypotheses in practice, provided a sparse part-based representation approximates well the data, and nB is small enough compared to the supports of the atoms.
How necessary is Hn?The proof of Proposition 1 mainly stands on points 3 and 5 (see Appendix A).Therefore, we may ask whether the hypothesis δ B (x) ⋀︀ δ B (y) = 0 is necessary to have δ B (x + y) = δ B (x) + δ B (y) and ε B (x + y) = ε B (x) + ε B (y), which comes down to questioning the necessity of H 1 and H 2 in Proposition 1, or Hn in the corollary.The answer is they are not necessary in general.For example, for any increasing function g : R + → R + and y = [g(x 1 ), . . ., g(x N )] such that x + y ∈ L, we do have δ B (x + y) = δ B (x) + δ B (y) and ε B (x + y) = ε B (x) + ε B (y).However, if we consider rather "independent" components, it is easy to build fairly general configurations where a certain degree of disjunction is necessary.In particular, as shown in the examples of Figures 6 and 7, a simple disjunction (corresponding to H 0 ) is not sufficient in general.This section was meant to precise mathematically the role played by sparsity and non-negativity in the part-based approximation of morphological operators.Motivated by previous approaches described in Section 2.2, we present in the next section our proposed auto-encoder, designed to achieve the desired trade-off between explainability, accuracy of the data reconstruction and accuracy of the approximation of morphological operators.

Proposed model
We propose an online part-based representation learning model, using an asymmetric auto-encoder with sparsity and non-negativity constraints.As pictured in Figure 8, our architecture is composed of two networks: a deep encoder and a shallow decoder (hence the asymmetry and the name of AsymAE we chose for our architecture).The encoder network is based on the discriminator of the infoGAN architecture introduced in [4], which was chosen for its average depth, its use of widely adopted deep learning components such as batch-normalization [12], 2D-convolutional layers [7] and leaky-RELU activation function [16].It has been designed specifically to perform interpretable representation learning on datasets such as MNIST and Fashion-MNIST.The network can be adapted to fit larger images.The decoder network is similar to the one presented in Figure 5.A Leaky-ReLU activation has been chosen after the linear layer.Its behavior is the same as the identity for positive entries, while it multiplies the negative ones by a fixed coefficient α lReLU = 0.1.This activation function has shown the best performances in similar architectures [16].The sparsity of the encoding is achieved using the same approach as in [2,10], that consists in adding to the previous loss function the regularization term described in Equations ( 8) and ( 9).We only enforced the non-negativity of the weights of the decoder, as they define the dictionary of images of our learned representation and as enforcing the non-negativity of the encoder weights would bring nothing but more constraints to the network and lower its capacity.Similarly to [5], we enforced this non-negativity constraint explicitly by projecting our weights on the nearest points of the positive orthant after each update of the optimization algorithm (such as the stochastic gradient descent).The main asset of this other method that does not use any additional penalty functions, and which is quite similar to the way the NMF enforces non-negativity, is that it ensures positivity of all weights without the cumbersome search for good values of the parameters of the various regularization terms in the loss function.

Experiment 1 on MNIST and Fashion MNIST
To demonstrate the goodness and drawbacks of our method, we have conducted experiments on two wellknown datasets MNIST [13] and Fashion MNIST [29].These two datasets share common features, such as the size of the images (28 × 28), the number of classes represented (10), and the total number of images (70000), divided into a training set of 60000 images and a test set of 10000 images.

Setting the parameters
For our AsymAE algorithm, we studied the effect of the sparsity objective p and regularization weight β in the loss function in Equation (8).In Figure 9 we present the results of the proposed approach on the Fashion-MNIST dataset.The maximum of the sparsity measure was reached with the sparsity parameters p = 0.01 and β = 0.01, whose atoms are shown in Figure 10e.It appears that these atoms are closer to full clothes shapes than parts.A possible interpretation is that, as the sparsity constraint gets stronger, the model is pushed to the limit where an atom should be involved in the reconstruction of a proportion p of the training images, that is approximately p • M images.When the number k of atoms is much smaller than p • M (which is the case for k = 100, p = 0.01 and M = 60000), each atom needs to be shared by a whole subset of images as their unique (or almost) representative.The model is therefore performing some sort of k-means clustering, each atom being a barycenter of a subgroup of the training set.
In Figure 10 we show examples of atom images for other values of sparsity parameters.The representations shown in Figures 10b, 10c and 10d are quite close to a part-based representation, even though the supports of the atom images are less disjoint as they would be in an ideal part-based representation, such as the sparse NMF, whose atom images are very neat.From this visual inspection as well as the plots of Figure 9, we found that a better trade-off seems to be reached for the values p = 0.05 and β = 0.0005 in the case of the Fashion-MNIST dataset.A similar study led to choose p = 0.05 and β = 0.001 with the MNIST dataset.

Comparison to state of the art methods
We compared our method to three baselines: the sparse-NMF [11], the NNSAE [15], and the NCAE [10].The three deep-learning models (the proposed AsymAE, NNSAE and NCAE) were trained until convergence on the training set, and evaluated on the test set.The sparse-NMF algorithm was ran and evaluated on the test set.Note that all models but the NCAE may produce reconstructions that do not fully belong to the interval [0, 1].In order to compare the reconstructions and the part-based approximations produced by the various algorithms, their outputs will be clipped between 0 and 1.There is no need to apply this operation to the output of NCAE as a sigmoid activation enforces the output of its decoder to belong to [0, 1].We used three measures to conduct this comparison: -the reconstruction error, that is the pixel-wise mean squared error between the input images x (i) of the test dataset and their reconstruction/approximation x(i) : ∑︀ N j=1 (x (i) j − x(i) j ) 2 ; -the sparsity of the encoding, measured using the mean on all test images of the sparsity measure σ in Equation 2: 1 M ∑︀ M i=1 σ(h (i) ); -the approximation error to dilation by a disk of radius one, obtained by computing the pixelwise mean squared error between the dilation δ B by a disk of radius one of the original image and the part-based approximation D B to the same dilation, using the learned representation: 2 .The parameter settings used for NCAE and the NNSAE algorithms are the ones provided in [10,15].For the sparse-NMF, a sparsity constraint of S h = 0.6 was applied to the encodings and no sparsity constraint was applied on the atoms of the representation.1) and the reconstruction images (Figure 11) demonstrate the capacity of our model to reach a better trade-off between the accuracy of the reconstruction and the sparsity of the encoding (that usually comes at the expense of the former criteria), than the other neural architectures.Indeed, in all conducted experiments, varying the parameters of the NCAE and the NNSAE as an attempt to increase the sparsity of the encoding came with a dramatic increase of the reconstruction error of the model.We failed however to reach a trade-off as good as the sparse-NMF algorithm that manages to match a high sparsity of the encoding with a low reconstruction error, especially on the Fashion-MNIST dataset.The major difference between the algorithms can be seen in Figure 12 that pictures 16 of the 100 atoms of each of the four learned representations.While sparse-NMF manages, for both datasets, to build highly explainable and clean part-based representations, the two deep baselines build representations that picture either too local shapes, in the case of the NNSAE, or too global ones, in the case of the NCAE.Our method suffers from quite the same issues as the NCAE, as almost full shapes are recognizable in the atoms.We noticed through experiments that increasing the sparsity of the encoding leads to less and less local features in the atoms.It has to be noted that the L 2 Asymmetric Weight Decay regularization used by the NCAE and NNSAE models allows for a certain proportion of negative weights.As an example, up to 32.2% of the pixels of the atoms of the NCAE model trained on the Fashion-MNIST dataset are negative, although their amplitude is lower than the average amplitude of the positive weights.The amount of negative weights can be reduced by increasing the corresponding regularization, which comes at the price of an increased reconstruction error and less sparse encodings.Finally Figure 13 pictures the part-based approximation to dilation by a structuring element of size one, computed using the four different approaches on ten images from the test set.Although the quantitative results state otherwise, we can note that our approach yields an interesting part-based approximation, thanks to a good balance between a low overlapping of atoms (and dilated atoms) and a good reconstruction capability.

Experiment 2: the Pavia University hyperspectral image
In order to test our approach on more realistic and complex data, we carried an experiment on the Pavia University hyperspectral image², of spatial size 610 × 340 pixels and containing M = 103 spectral bands (Figure 14).For memory issues and in order to take advantage of the previous experiment, we divided each channel image into 9 × 5 = 45 non-overlapping 64 × 64 patches, covering 576 × 320 pixels starting from the top left hand corner.The database thus counted 45 × 103 = 4635 patches, that we split into a training set and a test set by dedicating a fixed proportion ρ ∈ [0, 1] of the spectral bands to the training.This means the patches of a given spectral band were all assigned to the training set or all to the test set.What is more, the spectral bands assigned to the test set were sampled regularly (not randomly).We trained on these data the asymmetric auto-encoder presented earlier, with the same latent dimension (k = 100), same parameter p = 0.05 but larger β = 0.005.For comparison, as before, we also trained the sparse-NMF [11] and the NCAE [10] model, with the same parameters as before (those suggested by the authors).Despite all our attempts, we did not succeed in training the NNSAE [15] model to achieve sufficiently good performances so as to be interestingly compared to the other models.This might be a limitation of the model but could also be a misunderstanding on our part on how to set its parameters properly.We decided anyway no to report the obtained results, which were well below those presented hereafter.The two others deep-learning models were trained until they reached a reconstruction error of approximately 10 −3 on the test.Regarding the sparse NMF, we observed that both the reconstruction error and the sparsity of the encoding could be easily controlled, and high quality results could be achieved that were out of reach for the online methods -at least during the tests we ran.Therefore, the sparse NMF shall be considered as a reference for the online methods, and this is why here we decided to apply it a posteriori to the whole dataset (training set and test set) targeting the best performance of the online models: a reconstruction error of approximately 10 −3 and a sparsity of the encoding of approximately 0.7 (we set S h = 0.7).In this comparison, the training set represented ρ = 6/7 of the whole set of patches.
Since the present experiment applies to richer data, the methods are compared on the four basic morphological operators (dilation, erosion, opening and closing) with several sizes of structuring elements.In order to enhance the differences across methods, we present the quality of the morphological approximations through the Peak Signal to Noise Ratio (PSNR), defined here by PSNR = −10 log 10 (MSE), (11) where MSE is the pixel-wise mean squared error between the actual morphological operator and its partbased approximation.We recall that this comparison was made among models achieving a similar reconstruction error of the original images (≈ 10 −3 ) and a similar level of sparsity of the encoding (0.72 for our AsymAE, 0.75 for NCAE and the Sparse NMF).The plots of Figure 15 and Table 2 sum up the results, whereas  provide visual examples for a structuring element of size three.2.
In generel, the Sparse NMF achieves the best part-based approximations and our model (AsymAE) is the best online method.This is the case except for the erosions of sizes two and onward, and the openings of sizes three and onward, where NCAE achieves better PSNRs than the AsymAE (and sometimes even than the Sparse NMF).This seems surprising as the visual examples for a structuring element of size three (Figures [16][17][18][19][20] do not show a better accuracy for the NCAE.These exceptions might have the same cause as the U-shape of the erosion curve, that we observe for all methods: for darker images, such as the eroded and openings of large structuring elements, the PSNR tends to favor over-dark approximations.In the limit case, it seems that As for the atom images, shown in Figures 21-23, they might not correspond to the intuition of a partbased representation, as their supports are quite extended.However there seems to be approximately one scale represented per atom, as in a granulometry decomposition, which is also a possible approximation of a part-based representation.Furthermore, we note that NCAE's atoms are the noisiest whereas the Sparse NMF's are the least noisy. Another important remark is that the Sparse NMF could achieve even better results, but it still has the drawback of an o ine method.By contrast, it is remarkable that both NCAE and AsymAE maintain almost exactly the same performance when we reduce the relative size of the training set down to ρ = 0.5.We do not report the results here as the difference with the presented ones is negligible.This shows the great interest of having a good online model when the training set is statistically representative of the whole data.16); following rows: approximation using the sparse-NMF, the NCAE and the AsymAE (from top to bottom).

Conclusions and future works
We have presented an online method to learn a part-based dictionary representation of an image dataset, designed for accurate and efficient approximations of morphological operators.This method relies on autoencoder networks, with a deep encoder for a higher reconstruction capability and a shallow linear decoder for a better interpretation of the representation.Among the online part-based methods using auto-encoders, it achieves the state-of-the-art trade-off between the accuracy of reconstructions and the sparsity of image encodings.Moreover, it ensures a strict (that is, non approximated) non-negativity of the learned representation.These results would need to be confirmed on color images, as the proposed model is scalable, but the illustration on the hyperspectral image already shows the potential use of the proposed approach in real applications.We especially evaluated the learned representation on an additional criterion, that is the commutation of the representation with morphological operators, and noted that all online methods perform worse than the o ine sparse-NMF algorithm.A possible improvement would be to impose a major sparsity to the dictionary images with an appropriate regularization.Additionally, using a morphological layer [3,21,32] as a decoder may be more consistent with our definition of part-based approximation, since a representation in the (max, +) algebra would commute with the morphological dilation by essence.
Then ε B (x ∨ y) i = 0 = ε B (x) i = ε B (y) i = (︀ ε B (x) ∨ ε B (y) )︀ i .Case 2: x i > 0. Then for any j ∈ B i , y j = 0, otherwise there would be j 0 ∈ B i such that y j0 > 0 and therefore δ B (y) j0 ≥ y j0 > 0; since i ∈ Bj0 we would also have δ B (x) j0 ≥ x i > 0 yielding δ B (x) j0 ∧ δ B (y) j0 > 0 which contradicts the initial hypothesis.We just showed x i > 0 ⇒ ∀j ∈ B i , y j = 0, which also implies y i = 0 and ε B (y) i = 0.As a consequence, ∀j ∈ B i , x j ≥ y j which leads to ε B (x ∨ y) i = ⋀︀ j∈B i (x j ∨ y j ) = ⋀︀ j∈B i x j = ε B (x With the five points listed here above, the conclusions of Proposition 1 are straightforward.Assuming H 1 is true: ∑︀ k j=1 h j δ B (w j ) = ∑︀ k j=1 δ B (h j w j ) = δ B ( ∑︀ k j=1 h j w j ) = δ B (x), where point 1 was applied in the second equality and point 3 in the third equality.The first and last equalities are definitions.
-E B (x) = ∑︀ k j=1 h j ε B (w j ) = ∑︀ k j=1 ε B (h j w j ) = ε B ( ∑︀ k j=1 h j w j ) = ε B (x), where point 1 was applied in the second equality, point 5 in the third equality.The first and last equalities are definitions. - )︀ =  B (x), where point 1 was applied (twice) in the third equality, point 3 was applied to the ε B (h j w j ) in the fourth equality, since the ε B (h j w j ) verify H 1 as the h j w j do and ε B (h j w j ) ≤ h j w j ; and the fifth equality is given by point 5.The other equalities are definitions. - , where point 1 was applied (twice) in the third equality.If the δ B (h j w j ) comply with H 1 , or equivalently if H 2 is true, then point 5 applies and we get F B (x) = ε B (︀ ∑︀ k j=1 δ B (h j w j ) )︀ = φ B (x).

Figure 2 :Figure 3 :
Figure 2: (a) The five atom images used to build a dataset of 1000 images such as those of Figure 1.(b) Computed atoms by the sparse NMF of the latter dataset.Up to a permutation in indexing, the computed atoms are very similar (but not strictly identical) to the original ones.

Figure 4 :
Figure 4: Process for computing the part-based approximation to dilation, based on Equations (3) and (5).

Figure 5 :
Figure 5: The auto-encoding process and the definition of part-based approximation to a morphological operator  in this framework.

Figure 6 :Figure 7 :
Figure 6: An example of non-equivariance of the dilation to non-negative linear combination.(a) The components h 1 W 1 and h 2 W 2 are piece-wise constant, equal to h 1 > 0 (in green) and h 2 > 0 (in red) respectively, where they are non-zero.(b) Dilation of the sum δB (h 1 W 1 + h 2 W 2 ), where B is the cross structuring element shown in blue.The color yellow represents the valueh 1 ∨ h 2 .(c) Sum of the dilations δ B (h 1 W 1 ) + δ B (h 2 W 2 ).The color purple represents the value h 1 + h 2 which is larger than h 1 ∨ h 2 .Thus although the two components do not overlap (H 0 holds), (b) and (c) are not equal.

( a )
Reconstruction error as a function of the parameter β (sparsity penalty strength).(b)Sparsity measure(Hoyer 2004) as a function of the parameter β (sparsity penalty strength).

( c )
Max-approximation error to dilation (of the original images) as a function of the parameter β (sparsity penalty strength).

( d )
Max-approximation error to dilation (of the reconstructed images) as a function of the parameter β (sparsity penalty strength).

Figure 9 :
Figure9: Some evaluation measures for sparse non-negative asymmetric auto-encoders for various parameters of the sparsity regularization, using a test set not used to train the network.

( a )Figure 10 :
Figure 10: Some atoms (out of the 100 atoms) of various versions of the proposed asymmetric auto-encoder.

Figure 13 :
Figure 13: Part-based approximation of the dilation by a structuring element of size one (first row), computed using the sparse-NMF, the NNSAE, the NCAE and the AsymAE.

Figure 14 :
Figure 14: Four bands of the Pavia University hyperspectral image and two examples of patch per band.

Figure 15 :
Figure15: Quality of the approximation of different morphological operators on the test set (full lines) and training set (dashed lines), depending on the size of the structuring element (always a discrete disk).The quality of the approximation is expressed by the peak signal to noise ratio (PSNR, as defined by Eq. (11)).Higher is better.The size 0 corresponds the identity operator, showing therefore the reconstruction PSNR of the corresponding method.The results on training and test sets are almost identical.Here the proportion of the training set is ρ = 6/7.The figures are also shown in Table2.

Figure 16 :
Figure 16: Examples of test patches (first row) and their reconstructions computed using the sparse-NMF, the NCAE and the AsymAE (from top to bottom).

Figure 17 :
Figure 17: Part-based approximation of the dilation by a structuring element of size three.First row: dilation by a disc B of radius three(same patches as in Figure16); following rows: approximation using the sparse-NMF, the NCAE and the AsymAE (from top to bottom).

Figure 18 :
Figure 18: Part-based approximation of the erosion by a structuring element of size three.First row: erosion by a disc B of radius three; following rows: approximation using the sparse-NMF, the NCAE and the AsymAE (from top to bottom).

Figure 19 :
Figure 19: Part-based approximation of the opening by a structuring element of size three.First row: opening by a disc B of radius three; following rows: approximation using the sparse-NMF, the NCAE and the AsymAE (from top to bottom).

Figure 20 :
Figure 20: Part-based approximation of the closing by a structuring element of size three.First row: closing by a disc B of radius three; following rows: approximation using the sparse-NMF, the NCAE and the AsymAE (from top to bottom).

Figure 21 :
Figure 21: Examples of atoms for the sparse NMF in the experiment on the Pavia University image.

Figure 22 :
Figure 22: Examples of atoms for the NCAE auto-encoder in the experiment on the Pavia University image (proportion of the training set: ρ = 6/7).

Figure 23 :
Figure 23: Examples of atoms for the AsymAE auto-encoder in the experiment on the Pavia University image (proportion of the training set: ρ = 6/7).

Table 1 :
Comparison of the reconstruction error, sparsity of encoding and part-based approximation error to dilation produced by the sparse-NMF, the NNSAE, the NCAE and the AsymAE, for both MNIST and Fashion-MNIST datasets.

Table 2 :
Peak signal to noise ratio (PSNR, as defined by Eq. (11)) for the approximation of morphological operators on the test set for different models and different sizes of structuring elements (disks).Higher is better.The size zero corresponds to the identity operator, showing therefore the reconstruction PSNR of the model.The figures can also be visualized in the plots of Figure15.
approximating such dark images by a constant zero-valued image yields a better PSNR than an approximation which would try to keep some structure.