Dictionary learning for transcriptomics data reveals type-specific gene modules in a multi-class setting

Abstract Extracting information from large biological datasets is a challenging task, due to the large data size, high-dimensionality, noise, and errors in the data. Gene expression data contains information about which gene products have been formed by a cell, thus representing which genes have been read to activate a particular biological process. Understanding which of these gene products can be related to which processes can for example give insights about how diseases evolve and might give hints about how to fight them. The Next Generation RNA-sequencing method emerged over a decade ago and is nowadays state-of-the-art in the field of gene expression analyses. However, analyzing these large, complex datasets is still a challenging task. Many of the existing methods do not take into account the underlying structure of the data. In this paper, we present a new approach for RNA-sequencing data analysis based on dictionary learning. Dictionary learning is a sparsity enforcing method that has widely been used in many fields, such as image processing, pattern classification, signal denoising and more. We show how for RNA-sequencing data, the atoms in the dictionary matrix can be interpreted as modules of genes that either capture patterns specific to different types, or else represent modules that are reused across different scenarios. We evaluate our approach on four large datasets with samples from multiple types. A Gene Ontology term analysis, which is a standard tool indicated to help understanding the functions of genes, shows that the found gene-sets are in agreement with the biological context of the sample types. Further, we find that the sparse representations of samples using the dictionary can be used to identify type-specific differences.


Introduction
Next generation RNA-sequencing (RNA-seq) uses high throughput approaches to enable genome and transcriptome profiling. Emerged over a decade ago, it is increasingly replacing the use of microarrays in the field of gene expression analyses and thereby becoming the method of choice in this discipline. Reasons are technological advances, such as increased genome coverage, reduced sequencing costs and less background noise.
Application of RNA-seq data include for instance feature selection, identification of alternative splicing and detection of differentially expressed pathways. A challenge in the analysis of RNA-seq data is its high dimensionality. This comes along with the curse of dimensionality, as usually a good performance on training data can be achieved well. Reproducibility, however, is often a problem as small changes in the data can lead to different results. Therefore the biological relevance of such results needs to be questioned. To solve this issue, methods enforcing sparsity were introduced. In sparse methods coefficients of variables related to noise or uninformative variables are set to zero. Well known representatives are Lasso [21], elastic net [25], and sparse random forest [9].
Even though gene expression data is highly structured [1,2,18,20], many applications act independent of this underlying structure. In this paper, we carry out an analysis of RNA-seq data with a variant of dictionary learning (DiL).
DiL is a constrained matrix factorisation approach that compresses a dataset, preserving most variation in the lower dimensional space. It is especially interesting in the analysis of big data as it allows to analyse the data in a compressed space. The runtime of data analysis once it is compressed is significantly smaller than an analysis of the data in the original space. At the same time, unlike other data compression methods, DiL allows to interpret the meaning of the new dimension. Further, the variant we present in this paper requires significantly less runtime than the original DiL setting (precise runtime benefits depends on the size of dataset and DiL parameters).
DiL has been widely used in many fields, such as image processing, pattern classification, and signal denoising [4,23,6]. The basic idea in DiL is, that a dataset can be well constructed from a linear combination of few modules (atoms) of some "basis" (dictionary) given that the data possesses some structure that can be represented sparsely. To our knowledge, its applications to Omics data, however, is rare.
When applying DiL to RNA-seq data from multiple sample types, the atoms can be described as the basic elements of the data (e. g. specific genes sets that are mutually activated/deactivated), that capture patterns of the analysed data. Each sample can then be reconstructed (with small errors) by a linear combination of these atoms. In interpretation this means, that the dictionary atoms entail the main building blocks, or modules of the analysed data and the sparse coding can be interpreted as a data compression.
Dictionary learning (DiL) and compressed sensing (CS) are data compression methods for which the problem of sparsity per sample is solved, meaning that the number of selected modules per sample is always ≤ s (where s is a defined integer, describing the level of sparsity). They have found wide application in image compression. However, for the application to biological data we found only few examples. In 2002 Prat et al. [18] applied CS to microarray data to recover a support set of genes for each gene in a genome. In 2014 Emad et al. [7] applied CS to Omics data to discover directed interactions in gene networks. In 2016 Khormuji and Khormuji [11] used a novel sparse coding algorithm for classification of tumors based on gene expression data. In 2017 Clearly et al. [3] presented a method based on compressed sensing, to generate a high dimensional transcriptomic profile from sequencing a small, random selection of genes. In their review on low rank representation and its application in bioinformatics You et al. [24] concluded that researchers need to make full use of the structure of problems. In her thesis from 2019 [12], Koletou applied nested dictionary learning on different types of Omics data from prostate cancer patients. She developed nested dictionary learning, describing an iterative application of dictionary learning: In the first step, the dictionary is learned from the data and further dictionaries are learned from the dictionary in the pre-descending step. Reason for this nested approach is the aim of a drastic dimensional decrease as the first dictionary from the data still has 10 % of the initial dimension (in the number of genes).
In this study we evaluate whether a variant of DiL can be used to separate gene expression data of different types and whether the atoms of the dictionary are biologically reasonable. Further, we analyse whether dictionary learning can be used to classify RNA-seq data with respect to type, e. g. tissue type, in the lower dimension. In particular we evaluate the use of dictionary learning for analysis of RNA-seq data from multiple different types and the influence of parameters in this analysis. Thereto, we focus on the following topics: 1. How do the parameters for dictionary learning influence results? (e. g. number of atoms and sparsity) 2. How well can the found gene-sets separate the types? 3. Does the biological evaluation of the dictionary matrix (using GO-terms) agree with the types?
We start by analysing simulated datasets which are designed to contain (expression) patterns similar to biological datasets. Based on results from the simulation study we can decrease the parameter search space for real world datasets. When analysing real world datasets we find that sample clustering of the sparse codings results in high overlap of the true type-groups. Further, a GO-term analysis of the identified modules is in agreement with the biological context of the sample types.
The analysed dataset in focus is the GTEx RNAseq with more than 50k reads and 26 tissues with 96 samples per tissue. Further, results for two tissue RNAseq datasets from GEO (GSE120795) and Expression Atlas (E-Mtab-2836), one single cell dataset from GEO (GSE112004), and one simulated dataset are evaluated.

Identification of gene modules
from RNA-seq data using a variant of dictionary learning

Dictionary learning
The key idea in dictionary learning (DiL) is that a variety of data can be well constructed from a linear combination of few modules (atoms) of some "basis" (dictionary) given that the data possesses a sparse structure.
Given an input dataset X = [x 1 , ..., x p ], x i ∈ ℝ d , and a positive integer m ∈ ℤ >0 , the number of columns the dictionary will have, and s ∈ ℤ >0 , the sparsity level per sample, in the dictionary learning problem we wish to find a matrix D ∈ ℝ d×m and s-sparse vectors r 1 , ..., r p ∈ ℝ m , so called sparse codings, such that Dr i ≈ x i ∀i. D is called the dictionary and its columns are referred to as atoms. The vectors r i , with at most s non-zero entries, contain the coefficients for the columns of D to linearly represent x i . To make the choices of D and r i unique, the constraint ‖d i ‖ 2 = 1 is added. The sparsity level, s, often satisfies s ≪ m. This can be formulated as: where D ∈ ℝ d×m : D = [d 1 , ..., d m ] and R = [r 1 , ..., r p ], r i ∈ ℝ m . In the standard setting the dictionary is an overcomplete system, meaning m > d.
The optimization problem in equation (1) is nonconvex, which means that only sub-optimal solutions can be found in polynomial time. To improve the computational feasibility and efficiency, (1) is usually reformulated using the L1-norm:

DiL for RNA-seq data
When applying DiL to RNA-seq data, according to the standard DiL setting the orientation of the data X would be X ∈ ℝ p×g , where g is the number of genes and p is the number of samples. As typically g >> p, the dictionary is overcomplete only for this data orientation (compare Figure 1). If X ∈ ℝ p×g , then D ∈ ℝ p×m . This, however, would mean that the dictionary does not entail information about the genes. In fact, this information would be in the sparse vectors, while at the same time each samples would be represented by many (up to m) sparse gene vectors which would only slightly fulfil the aim of dimension reduction.
We suggest to apply DiL to X, for X ∈ ℝ g×p . If X ∈ ℝ g×p , then D ∈ ℝ g×m . Only in this setting the dictionary entails information about the genes and each sample is represented by few (up to s) atoms. This does, however, mean that the dictionary is not overcomplete as D = X ∈ ℝ g×p would be an optimal solution to (2) already. A further increase of m would not lead to better solutions. Therefore, Figure 1: Visualisation of the two optional orientations of the data X for DiL and the resulting orientations of the matrices D, the dictionary, and R, the sparse codings. At the top X ∈ ℝ p×g and D is overcomplete. At the bottom X ∈ ℝ p×g (p, number of samples; g, number of genes) and D is not overcomplete. The overcomplete dictionary requires a long computational time, whereas runtime for our method is a lot shorter.
we neglect the overcompleteness while still using the formulation in (2). This means that we are already enforcing sparsity by the selection of m, as this will describe the number of dimensions we are compressing the data to and m < p in our variant of DiL.

Implementation and complexity analysis
We use DictionaryLearnig [15] for small datasets (up to a size of 100 MB) and MiniBatchDictionaryLearnig [16] for large datasets. The sparse codings (r i ) are computed using orthogonal matching pursuit (OMP) [17] for which parallel and GPU versions are available [8]. A detailed complexity analysis of OMP can be found in [19]. To summarize, the overall algorithmic complexity is: Recall that p is the number of samples (varying between different datasets), and g is the number of genes (varying between different species).
However, for the case of using an overcomplete dictionary the runtime would be: O g * (2pm + s 2 m + 7sm + s 3 + 4sp) + 5pm 2 As in RNA-seq experiments typically g >> p, this would require a highly increased runtime.

Results
Recall, that the dictionary atoms can be described as the basic elements of the data. When applied to RNA-seq data we interpret the atoms as specific genes sets that are mutually activated/deactivated in a large number of samples. The sparse codings entail information of how to reconstruct each sample (with small errors) by a linear combination of a subset of the atoms.
In this section we evaluate applications of our DiL based approach to different datasets to assess its relevance for RNA-seq data analysis. In general, if our approach is suitable for RNA-seq data, for one thing, main differences in the data should be preserved in the compressed representation, hence the sparse codings. To analyse if this is the case, we apply our approach to datasets with samples from different types. This allows to evaluate whether the respective types are represented differently in the sparse codings. Types can be anything referring to different classes of samples such as phenotype, stimulation in an experiment or donor. At the same time, the compression should be done in a biologically reasonable way, which we assess by an evaluation of the dictionary atoms from the real world datasets.
We apply our approach to simulated and real world datasets. The simulated datasets are constructed to have five different sample types. Five different simulation strategies are performed, which vary in construction of groups and noise. The four real world datasets each contain data from multiple types and with multiple samples per type. One of the datasets stems from a single cell experiment, the others are bulk experiments. Table 1 shows an overview of the four real world datasets. Details on simulated and real world data are given in section 5.
For each dataset we compute dictionaries of different sizes and reconstructions for different sparsity levels. We evaluate how the parameters for dictionary learning (e. g. number of atoms and sparsity) influence the results, how well the found gene-sets separate the types, and whether the biological evaluation of the genes corresponding to the atoms agrees with the types.
Results from the dictionary learning are (1) the dictionary matrix and (2) a sparse coding vector for each sample. To evaluate the sparse coding, we use clustering and analyse whether the clusters have a high agreement with the sample types. For the RNA-seq data, the dictionary is evaluated by a Gene Ontology (GO) term analysis [5] (of the genes corresponding to the high values in the dictionary matrix). Gene Ontology assigns genes to a) biochemical activities, b) biological goals and c) cellular structures thereby helping to understand functions of the genes and their products. We use the GO term analysis to evaluate whether these genes can be associated with the sample type.

Evaluation of results
We apply DiL to datasets from multiple types and with multiple samples per type. The respective sparse codings should reveal type specific characteristics. To evaluate whether this is the case, the sparse codings are clustered and the resulting clusters are compared with the true typegroups. The clustering is performed with k-means [14] (implementation from Python's sklearn) with k = n (where n is the number of types of the dataset). To evaluate the clustering we compute the adjusted rand index (ARI) and the adjusted mutual information (AMI) of the k-means clusters and the true type-groups.
The ARI [10] is based on the Rand index, which is defined as the number of pairs of elements that are either in the same group or in different groups in two partitions divided by the total number of pairs of elements. A problem with the Rand index, which is corrected in the ARI, is that the expected value of the Rand index between two random partitions is not constant. The maximum ARI is 1 and its expected value is 0 for random clusters.
The mutual information measures how much knowing one clustering reduces uncertainty about the other clustering. Similarly as for the Rand Index and the ARI, the AMI [22] is an adjustment of the mutual information to account for chance.
Additionally, the reconstruction error is measured as the Euclidean distance of the data matrix and the matrix resulting from D * R (compare (1), (2)).

How do the parameters for dictionary learning influence results?
The dictionary learning requires two parameters: The number of atoms of the dictionary, m, and the sparsity, s. In this section multiple parameter combinations are tested to evaluate how they influence the results.

Simulated data
For the simulated datasets the algorithm is run for dictionaries with m = [1, 2, 3, 4, 5, 10, 20]. Recall, that the simulated data is constructed to have n = 5 types and five different simulation strategies are performed, which vary in construction of groups and noise. As our method is new a new method these parameters are selected based on a first grid search over a wide parameter range. In the selected range (1 ≤ m ≤ 20) we see the most relevant changes in ARI, AMI and reconstruction error. Results are visualised in Figure 2. Construction of the simulated data is explained in the Data section. For all simulated datasets the clustering is in entire agreement with the types for a dictionary with 5 atoms no matter how sparsely the data is represented. Further, all clustering scores are 1 for settings without noise (A, B, C) whenever the dictionary has 4 or more atoms. For simulation setting C, the median clustering scores are 1 for a all dictionaries (1, 2, 5, and 10 atoms). Setting E is the only one for which the clustering scores are not 1 for a dictionary with 4 atoms. In the settings with noise (D, E), the clustering scores decrease when the number of atoms exceeds 5, whereas in the settings without noise clustering scores remain 1 for any number of atoms ≥ 4. An explanation could be that the additional dictionary atoms in settings D and E are trying to explain the noise which is independent from the data types and thus leads to a worse clustering result.
The median reconstruction error is minimal among all evaluated dictionaries for m ≥ 5. In all settings without noise the minimal median reconstruction error is zero, whereas in the ones with noise there appears to be information that cannot be explained by the dictionary for m ≤ 20 and the reconstruction error remains > 0.
Note that the ordering of "high genes", which are sorted for each group to appear in blocks in our visualisations (Figure 8), does not influence results.
Summarising the simulation results, it shows that the selection of the number of dictionary atoms has a big influence on the results, especially for noisy data. In this small example it seems to be a good choice to set the number of dictionary atoms equal to the number of groups in the dataset.

Real world data
Experiments for the real world datasets, are run for dictionaries with m ∈ [1, 2, ..., 2n], where n is the number of types in each dataset and s ∈ [1, 2, ..., m] for each dictionary. The choice of values of m is based on the results from the simulated data (where m = n showed to be a good choice and we would like to analyse the surrounding parameter space as well).
As behaviour of ARI and AMI are similar in our experiments (see Figure 3 for an example for dataset D1), we focus on the ARI in the detailed parameter evaluation and conclusions can be transferred to AMI.
For all RNA-seq datasets, we find that in general the ARI increases for increasing number of atoms up to a certain value where m is close to n (see Figure 4). Considering the sparsity, for many dictionaries the ARI first increases when s is increased and stays close to some ARI for larger values of s. ARIs for D2 are worse than for the other three datasets. Dataset D2, however, showed to have many samples with a strikingly high number of zero values. Discarding samples with zero-entries per sample of more than 75 %, yields significantly better results with a maximum ARI of 0.80. For dataset D1, we perform evaluations for both k = 26 (n = 26 with tissue labels as general types) and k = 48 (n = 48 with tissue labels as detailed types) as true labels.

How well can the found gene-sets separate the types?
To find the best parameter combination for the biological evaluation, we suggest to perform a grid search with m ∈ [0.5n, 1.5n] and for each dictionary s = [1, 2, ..., m] and select the parameters corresponding to the best ARI. In all experiments we conducted the best ARI of a wide range of parameters lay in this interval. We evaluate to what extent the resulting clusters overlap with the type groups and which types are grouped, respectively split. Figure 5 shows a visualisation of the sparse codings of all samples in each dataset. This visual representa-tion already reveals that the different types have different sparse coding values. This is especially the case for dataset D2, where each type is represented by one atom. For the other datasets these differences are not as clear, but still visible. This is in agreement with the ARI, which is 1 for dataset D2 and between 0.56 and 0.77 for the other datasets.

D1: GTEx
For dataset D1 the maximal ARI for the general types of 0.77 is reached for m = 42, s = 39 (see Figure 4). Further clustering scores for m = 42, s = 39 are: ARI detailed = 0.66, AMI general = 0.87, AMI detailed = 0.81. In the resulting clusters (compare Figure 6), -13 of 26 clusters have samples of one tissue type only and -in 18/24/25 clusters more than 90 % of samples are from one/two/three tissue type(s).
When the detailed labels are chosen as the tissue types with subtypes, -10 of 26 clusters have samples of one tissue type only and -in 13/21/24 clusters more than 90 % of samples are from one/two/three tissue type(s).
Considering the detailed tissue types, some are separated as for example tissues belonging to either cerebrum or cerebellum (in clusters 15, respectively 21). This, however, is a reasonable separation and we are happy to find that our algorithm detects differences in these groups. Note, that this decreases the ARI when the general type labels are used -as all brain tissues have the same type labeleven though this is actually a correct finding and it shows that the methods is capable of detecting subgroups. For the detailed types, the maximum ARI of 0.73 is reached for 39/32, 42/26, 47/34 (m/s).

D2: GEO GSE120795
For dataset D2 the maximal ARI of 0.56 is reached for m = 7, s = 2 (AMI = 0.67). Five clusters of the sparse codings are comprised of one tissue only, whereas clusters 2,4, and 7 are a mix of multiple tissues (compare Figure 10). We do not have detailed information for the brain tissue explaining which samples are grey/white matter, but it is conceivable that the separation of the brain samples into two clusters could be explained by that.
Pie charts with an additional preprocessing step regarding the number of zero-entries can be found in the Appendix.

D3: Expression Atlas E-MTAB-2836
For dataset D3 the maximal ARI is 1.00 (reached for m = 7, s = 1), which means that the clusters of the sparse codings are in entire agreement with the types (compare Figure 11).
The three treatment types are separated, wrong assignments appear only for times within one treatment class (compare Figure 12).

Summary of parameter evaluation for the real world datasets
It shows that there is a wide range of parameters for which the variation of ARI is small. E. g. setting m = s = n results in an ARI that is larger than 0.8 * ARI best for all datasets. A small grid search around m = s = n, e. g. m ∈ all integer values ∈ [0.5n, 1.5n], s ∈ all integer values ∈ [0.5m, 1.5m], improves results further.

Restriction to positive dictionary entries
In experiments with an additional constraint allowing for positive dictionary entries only, similar maximum ARIs (±0.01) are reached, however, the number of atoms is often higher: 47 atoms for D1 and 8 atoms for D2, D3, and D4.

Does the biological evaluation of the dictionary matrix agree with the types?
Let v pos represent the sorted absolute values of D and v neg the sorted absolute negative values of D. To select a set of genes for each group, only the largest absolute values of v pos and v neg are evaluated (compare [13]): All values in the dictionary matrix that are in the interval ]1st − percentile, 99th − percentile[ are set to zero. The resulting dictionary will be referred to as 2%Dictionary. Note that this most likely leads to atoms that have different amount of non-zero values.
To assign a subset of atoms to each type, we compute a value measuring the atom-preference among all samples for each type: -From all samples' sparse codings, we compute the mean value for each atom. For the biological analysis, for each type, we evaluate genes corresponding to those atoms with top three absolute atom-preferences. A GO-term analysis of these genes is performed. Only GO-terms with a p-value ≤ 10 −5 are considered. Each atom is analysed separately and the genes with positive, respectively negative values are analysed separately as well.
In the 2%Dictionary, the number of non-zero entries per atom varies between 127 and 1841 (out of 55,091) for . The x-axis represents the samples ordered by type and the atoms are shown on the y-axis. The black lines indicate the border between each two types. This means that each coloured horizontal stripe reflects the values of the sparse coding for one atom for one sample. This way, multiple dimensions can be visualised -and therefore compared among different types. What has been confirmed by the ARIs -that for each type (most of the) sample types have a similar and unique representation -is confirmed by this visualisation. dataset D1; Between 560 and 1601 (out of 53,679) for D2; Between 131 and 1331 (out of 49,914) for D3; Between 152 and 329 (out of 11,781) for D4. For many atoms the GO-terms entail terms that can be associated with the corresponding tissue (see Figure 7).

Use case runtime evaluation
The experiments were carried out on a machine with Intel(R) Core(TM) i5-8500 processors and 16 GB RAM. The runtime scales with a) dataset size and b) m, the number of atoms. Depending on the parameter m the runtime for the smaller datasets varies between 0.4 and 1.4 seconds for m = 1, respectively m = 20; For the larger dataset D1 the runtime varies between 3.15 and 46.45 seconds for the same parameters (see Table 2).

Data
Evaluations of our algorithm are performed on different datasets: (1) Simulated data and (2) four real world dataset from varying sources. Details on the respective datasets are given in the following paragraphs.

Simulated data
The construction of simulated data is inspired by biological data. We simulate a matrix of 50 samples and 100 reads. The data is simulated to have 5 groups of 10 samples each.
In the simulation, for each group we differentiate between a subset of background genes and up to 20 high genes. The subset-size and the meaning of "high"/"low" is differently defined in each simulation setting. The values for background genes and high genes are simulated separately. In two of the five settings a subset of the background values are high in all samples to simulate housekeeping genes. For a detailed explanation of the simulation settings see Table 3 and Figure 8. For each setting but the first (because the first does not require a random drawing) we simulate 100 datasets.

Real world data
We analyse four RNA-seq datasets (see Table 4), with samples from multiple types (see Table 1). The number of samples per type varies within each dataset. To obtain datasets with the same number of samples per type, for each type, samples are randomly selected such that the number of samples per tissue type is the same for all tissues and maximal given the data.
Each sample is labelled with a detailed tissue description (e. g. Adipose -subcutaneous, Adipose -Visceral, Brain -Amygdala,...), as well as a more general one (e. g. Adipose -Subcutaneous and Adipose -Visceral as type Figure 7: Wordclouds of the GO-terms with p-value ≤ 10−5 for a selection of 3 atoms for each tissue dataset. Dataset and tissue name with highest atom-preference for the respective atom is given in the subtitles. Only tissues for which the respective atom is added are shown (referring to positive dictionary entries and sparse coding values, or both negative). For many atoms the GO-terms entail terms that can be associated with the corresponding tissue.   Adipose). We evaluate the dictionary learning for the general grouping and use the detailed description in the result analysis only. Data is present for 53 detailed tissue types with 5-564 samples per type, respectively 31 grouped tissue types with 7-1671 samples per type.
The minimum number of samples per tissue is set to 50.
Outlier removal and selection of the same number of samples per type results in a 2,392 × 55,091 matrix (2,392 = 26types * 92samples_each).
The minimum number of samples per tissue is set to 8.
Outlier removal and selection of the same number of samples per type results in a 64 × 53,679 (64 = 8types * 8samples_each) matrix.
Note, that some samples have zero-values for most of the reads and only some high values. These samples could easily be detected by a limitation to the number of zero-values per sample (e. g. no more than 75 % zeroentries).
The minimum number of samples per tissue is set to 8.
Outlier removal and selection of the same number of samples per type results in a 64 × 49,914 (64 = 8types * 8samples_each) matrix.

D4: GEO GSE112004
The GEO dataset GSE112004 is a single cell dataset and contains read counts for 3,648 samples and 11,841 genes (already mapped). Samples stem from mice CD19+ B cells: untreated, treated to trans-differentiate to macrophages, and treated to trans-differentiate to induced pluripotent stem cells (3 types). For each type, measurements at three time points are available (considered as 9 types in total).
The minimum number of samples per tissue is set to 32.
Outlier removal and selection of the same number of samples per type results in a 288 × 11,781 (288 = 9types * 32samples_each) matrix.

Normalisation
In [3] Clearly et al. suggest to normalise the data by a removal of genes for which the sum of counts is > 99.5th − percentil to "avoid performance statistics that are skewed by few genes with extremely high expression". We adapt this normalisation and perform the same normalisation for the samples. Additionally, to avoid for a bias of different experiments, each sample is normalised by division through the sum of all counts for this sample. Subsequently, to provide numerical stability and allow greater interpretability of the dictionary entries, variables are centred to zero and scaled to have a standard deviation of one.

Discussion and conclusion
In this study, we present our results for a new version of dictionary learning. We apply the method to simulated and real world data. We evaluate both, the relation of sample labels and clusters of the sparse codings, as well as the biological relevance of the gene-sets represented by the dictionary atoms. Clustering the sparse codings of the real world datasets showed that the dictionary keeps relevant properties of the data. Further, the biological relevance of the gene-sets detected by the DiL approach was confirmed by a GO-term analysis, which revealed the large potential of dictionary learning in RNA-seq analysis to extract type specific gene-sets. We performed a large study on four datasets, showing that results are not unique to one dataset.
In contrast to [3] the focus of our analysis is not to find a dictionary such that a high-dimensional transcriptomic profile can be generated from sequencing a small, random selection of genes. Rather, we aim at 1.) detecting relevant genes-sets and 2.) thereby confirming that the dictionary keeps relevant properties of the data -this in mind, the sparse codings can be used for further analysis.
In contrast to [12] we perform only one step of dictionary learning which is a lot faster than the standard dictionary learning and again several times faster than the nested dictionary learning presented in [12].
We have only evaluated clusterings with the number of clusters, k, equal to the number of types given in the data, n. In some cases, we found that one type was split whereas others were found in one cluster, which results in an ARI < 1. We found that separation of some types was in agreement with subtypes in some cases. Assuming that differences between these subtypes is bigger than among other general types, this leads to those general types being grouped in one cluster. We have to keep in mind, that by setting k = n, we rely on the correctness of the data annotation. Evaluating higher values of k might resolve this issue, such that the types that appear in one clustered for k = n will then be in separate clusters.
The DiL method requires two parameters, the number of atoms of the dictionary, m, and the level of sparsity, s. Our experiments revealed that for every dataset evaluated there was a large range of parameters for which the variation of ARI, respectively AMI is small. E. g. setting m = s = n, hence only evaluating one parameter setting, results in an ARI that is larger than 0.8 * ARI best . A small grid search around m = s = n improves results further.
This paper is focused on the method and its analysis on multiple RNA-seq datasets as well as a detailed analysis of the results. We have only performed very limited normalisation of the data. A more extensive normalisation might improve results even further. This becomes visible in the analysis of dataset D2, where one simple additional normalisation step improved the ARI from 0.56 to 0.8.
By thresholding we filtered the dictionary matrix to receive the 2%Dictionary and used this to select the genesets. This could be improved by adding a constraint in the dictionary learning process, such that the dictionary itself will be sparse and thresholding would no longer be required.
Other related methods, such as NMF and SVD do not enforce sparsity per sample. DiL, however, fulfils this criterion. With the results of this paper in mind, showing that the dictionary keeps relevant properties of the data, one could use the dictionary for the representation of states (e. g. of a single cell), and study its e. g. time-dependent behaviour in the dictionary space. Application of dictionary learning on dataset D4 shows that results for single cell data are similarly good as for pooled cells.
In conclusion, our study demonstrates that our dictionary learning based approach is able to detect biologically relevant modules of genes specific to various types, as well as to represent RNA-seq data in lower dimension.
Funding: This study was supported by the Federal Ministry of Education and Research of Germany, grant 3FO18501 (Forschungscampus MODAL) and grant 05M16KEA (IBOSS) to TC and MR. The funders did not have any additional role in the study design, data collec-tion and analysis, decision to publish, or preparation of the manuscript.

Appendix
The following figures show results of the analysis of datasets D1-D4. Figure 9: Shown are results for dataset D1 for the maximal ARI measured (dictionary with 37 atoms and a 32-sparse coding). (a) Shown are the detailed tissue types in the clusters. Many clusters gather subtypes of one tissue (compare Figure 6), which is what we aim for since the number of clusters is set equal to the number of general tissue types. Interestingly, some general tissue types are separated into few cluster, such as brain in cluster 11 and 19. These cluster separate the cerebellum from the cerebrum. (c, d) Shown are the clusters for each tissue.