Abstract
Extracting information from large biological datasets is a challenging task, due to the large data size, high-dimensionality, noise, and errors in the data. Gene expression data contains information about which gene products have been formed by a cell, thus representing which genes have been read to activate a particular biological process. Understanding which of these gene products can be related to which processes can for example give insights about how diseases evolve and might give hints about how to fight them.
The Next Generation RNA-sequencing method emerged over a decade ago and is nowadays state-of-the-art in the field of gene expression analyses. However, analyzing these large, complex datasets is still a challenging task. Many of the existing methods do not take into account the underlying structure of the data.
In this paper, we present a new approach for RNA-sequencing data analysis based on dictionary learning. Dictionary learning is a sparsity enforcing method that has widely been used in many fields, such as image processing, pattern classification, signal denoising and more. We show how for RNA-sequencing data, the atoms in the dictionary matrix can be interpreted as modules of genes that either capture patterns specific to different types, or else represent modules that are reused across different scenarios. We evaluate our approach on four large datasets with samples from multiple types. A Gene Ontology term analysis, which is a standard tool indicated to help understanding the functions of genes, shows that the found gene-sets are in agreement with the biological context of the sample types. Further, we find that the sparse representations of samples using the dictionary can be used to identify type-specific differences.
1 Introduction
Next generation RNA-sequencing (RNA-seq) uses high throughput approaches to enable genome and transcriptome profiling. Emerged over a decade ago, it is increasingly replacing the use of microarrays in the field of gene expression analyses and thereby becoming the method of choice in this discipline. Reasons are technological advances, such as increased genome coverage, reduced sequencing costs and less background noise.
Application of RNA-seq data include for instance feature selection, identification of alternative splicing and detection of differentially expressed pathways. A challenge in the analysis of RNA-seq data is its high dimensionality. This comes along with the curse of dimensionality, as usually a good performance on training data can be achieved well. Reproducibility, however, is often a problem as small changes in the data can lead to different results. Therefore the biological relevance of such results needs to be questioned. To solve this issue, methods enforcing sparsity were introduced. In sparse methods coefficients of variables related to noise or uninformative variables are set to zero. Well known representatives are Lasso [21], elastic net [25], and sparse random forest [9].
Even though gene expression data is highly structured [1], [2], [18], [20], many applications act independent of this underlying structure. In this paper, we carry out an analysis of RNA-seq data with a variant of dictionary learning (DiL).
DiL is a constrained matrix factorisation approach that compresses a dataset, preserving most variation in the lower dimensional space. It is especially interesting in the analysis of big data as it allows to analyse the data in a compressed space. The runtime of data analysis once it is compressed is significantly smaller than an analysis of the data in the original space. At the same time, unlike other data compression methods, DiL allows to interpret the meaning of the new dimension. Further, the variant we present in this paper requires significantly less runtime than the original DiL setting (precise runtime benefits depends on the size of dataset and DiL parameters).
DiL has been widely used in many fields, such as image processing, pattern classification, and signal denoising [4], [23], [6]. The basic idea in DiL is, that a dataset can be well constructed from a linear combination of few modules (atoms) of some “basis” (dictionary) given that the data possesses some structure that can be represented sparsely. To our knowledge, its applications to Omics data, however, is rare.
When applying DiL to RNA-seq data from multiple sample types, the atoms can be described as the basic elements of the data (e. g. specific genes sets that are mutually activated/deactivated), that capture patterns of the analysed data. Each sample can then be reconstructed (with small errors) by a linear combination of these atoms. In interpretation this means, that the dictionary atoms entail the main building blocks, or modules of the analysed data and the sparse coding can be interpreted as a data compression.
Dictionary learning (DiL) and compressed sensing (CS) are data compression methods for which the problem of sparsity per sample is solved, meaning that the number of selected modules per sample is always
In this study we evaluate whether a variant of DiL can be used to separate gene expression data of different types and whether the atoms of the dictionary are biologically reasonable. Further, we analyse whether dictionary learning can be used to classify RNA-seq data with respect to type, e. g. tissue type, in the lower dimension. In particular we evaluate the use of dictionary learning for analysis of RNA-seq data from multiple different types and the influence of parameters in this analysis. Thereto, we focus on the following topics:
How do the parameters for dictionary learning influence results? (e. g. number of atoms and sparsity)
How well can the found gene-sets separate the types?
Does the biological evaluation of the dictionary matrix (using GO-terms) agree with the types?
When analysing real world datasets we find that sample clustering of the sparse codings results in high overlap of the true type-groups. Further, a GO-term analysis of the identified modules is in agreement with the biological context of the sample types.
The analysed dataset in focus is the GTEx RNA-seq with more than
2 Identification of gene modules from RNA-seq data using a variant of dictionary learning
2.1 Dictionary learning
The key idea in dictionary learning (DiL) is that a variety of data can be well constructed from a linear combination of few modules (atoms) of some “basis” (dictionary) given that the data possesses a sparse structure.
Given an input dataset
where
The optimization problem in equation (1) is non-convex, which means that only sub-optimal solutions can be found in polynomial time. To improve the computational feasibility and efficiency, (1) is usually reformulated using the L1-norm:
2.2 DiL for RNA-seq data
When applying DiL to RNA-seq data, according to the standard DiL setting the orientation of the data X would be

Figure 1
Visualisation of the two optional orientations of the data X for DiL and the resulting orientations of the matrices D, the dictionary, and R, the sparse codings. At the top
If
We suggest to apply DiL to X, for
2.3 Implementation and complexity analysis
We use DictionaryLearnig [15] for small datasets (up to a size of 100 MB) and MiniBatchDictionaryLearnig [16] for large datasets. The sparse codings (
Recall that p is the number of samples (varying between different datasets), and g is the number of genes (varying between different species).
However, for the case of using an overcomplete dictionary the runtime would be:
As in RNA-seq experiments typically
3 Results
Recall, that the dictionary atoms can be described as the basic elements of the data. When applied to RNA-seq data we interpret the atoms as specific genes sets that are mutually activated/deactivated in a large number of samples. The sparse codings entail information of how to reconstruct each sample (with small errors) by a linear combination of a subset of the atoms.
In this section we evaluate applications of our DiL based approach to different datasets to assess its relevance for RNA-seq data analysis. In general, if our approach is suitable for RNA-seq data, for one thing, main differences in the data should be preserved in the compressed representation, hence the sparse codings. To analyse if this is the case, we apply our approach to datasets with samples from different types. This allows to evaluate whether the respective types are represented differently in the sparse codings. Types can be anything referring to different classes of samples such as phenotype, stimulation in an experiment or donor. At the same time, the compression should be done in a biologically reasonable way, which we assess by an evaluation of the dictionary atoms from the real world datasets.
We apply our approach to simulated and real world datasets. The simulated datasets are constructed to have five different sample types. Five different simulation strategies are performed, which vary in construction of groups and noise. The four real world datasets each contain data from multiple types and with multiple samples per type. One of the datasets stems from a single cell experiment, the others are bulk experiments. Table 1 shows an overview of the four real world datasets. Details on simulated and real world data are given in section 5.
Table 1
Overview of the four real world datasets analysed in this study. Dataset D4 is a single cell dataset. Datasets D1 and D4 contain larger number of samples than D2 and D3.
ID | Database | Types | Samples per type | Reads/Genes | Type class |
D1 | GTEX | 26 | 92 | 55,091 | Tissues |
D2 | GEO | 8 | 8 | 53,679 | Tissues |
D3 | Expression Atlas | 8 | 8 | 49,914 | Tissues |
D4 | GEO | 9 | 32 | 11,781 | B-cell type & experiment time |
For each dataset we compute dictionaries of different sizes and reconstructions for different sparsity levels. We evaluate how the parameters for dictionary learning (e. g. number of atoms and sparsity) influence the results, how well the found gene-sets separate the types, and whether the biological evaluation of the genes corresponding to the atoms agrees with the types.
Results from the dictionary learning are (1) the dictionary matrix and (2) a sparse coding vector for each sample. To evaluate the sparse coding, we use clustering and analyse whether the clusters have a high agreement with the sample types. For the RNA-seq data, the dictionary is evaluated by a Gene Ontology (GO) term analysis [5] (of the genes corresponding to the high values in the dictionary matrix). Gene Ontology assigns genes to a) biochemical activities, b) biological goals and c) cellular structures thereby helping to understand functions of the genes and their products. We use the GO term analysis to evaluate whether these genes can be associated with the sample type.
3.1 Evaluation of results
We apply DiL to datasets from multiple types and with multiple samples per type. The respective sparse codings should reveal type specific characteristics. To evaluate whether this is the case, the sparse codings are clustered and the resulting clusters are compared with the true type-groups. The clustering is performed with k-means [14] (implementation from Python’s sklearn) with
The ARI [10] is based on the Rand index, which is defined as the number of pairs of elements that are either in the same group or in different groups in two partitions divided by the total number of pairs of elements. A problem with the Rand index, which is corrected in the ARI, is that the expected value of the Rand index between two random partitions is not constant. The maximum ARI is 1 and its expected value is 0 for random clusters.
The mutual information measures how much knowing one clustering reduces uncertainty about the other clustering. Similarly as for the Rand Index and the ARI, the AMI [22] is an adjustment of the mutual information to account for chance.
Additionally, the reconstruction error is measured as the Euclidean distance of the data matrix and the matrix resulting from
3.2 How do the parameters for dictionary learning influence results?
The dictionary learning requires two parameters: The number of atoms of the dictionary, m, and the sparsity, s. In this section multiple parameter combinations are tested to evaluate how they influence the results.
3.2.1 Simulated data
For the simulated datasets the algorithm is run for dictionaries with

Figure 2
The plots show evaluations for dictionary learning of the simulated data with several parameters. The data is simulated to consist of five groups (precise simulation setting is given by capital letters, compare section 5.1). Shown are the reconstruction error (blue), and ARI (green), respectively AMI (purple) for the k-means clusters of the sparse codings. The x-axis displays the number of atoms of the dictionaries. The lines and points display the median value in 100 simulations, the shadows display the 25th and 75th percentile. In each column results for 1, 2, 5, respectively 10-sparse coding are visualised (see bottom for respective sparsity). In all settings without noise (A, B, C) the minimal median reconstruction error is zero. Further, in all settings the clustering scores are 1 for a dictionary with 5 atoms no matter how sparsely the data is represented. In the settings with noise (D, E), the clustering scores decrease when the number of atoms exceeds 5, whereas in the settings without noise the clustering scores stay 1 for any number of atoms ≥4.
For all simulated datasets the clustering is in entire agreement with the types for a dictionary with 5 atoms no matter how sparsely the data is represented. Further, all clustering scores are 1 for settings without noise (A, B, C) whenever the dictionary has 4 or more atoms. For simulation setting C, the median clustering scores are 1 for a all dictionaries (1, 2, 5, and 10 atoms). Setting E is the only one for which the clustering scores are not 1 for a dictionary with 4 atoms. In the settings with noise (D, E), the clustering scores decrease when the number of atoms exceeds 5, whereas in the settings without noise clustering scores remain 1 for any number of atoms ≥4. An explanation could be that the additional dictionary atoms in settings D and E are trying to explain the noise which is independent from the data types and thus leads to a worse clustering result.
The median reconstruction error is minimal among all evaluated dictionaries for
Note that the ordering of “high genes”, which are sorted for each group to appear in blocks in our visualisations (Figure 8), does not influence results.
Summarising the simulation results, it shows that the selection of the number of dictionary atoms has a big influence on the results, especially for noisy data. In this small example it seems to be a good choice to set the number of dictionary atoms equal to the number of groups in the dataset.
3.2.2 Real world data
Experiments for the real world datasets, are run for dictionaries with
![Figure 3 Visualisation of ARI, AMI, and reconstruction error for dataset D1 for the dictionary with 26 atoms (as many atoms as types) and sparsities s ∈[1,2,...,26]. The AMI is always higher than ARI (about 0.2, but the difference varies slightly). For smaller values of s until around s=10s=10 for increasing s all measurements are improving (reconstruction error decreases, ARI and AMI increase). For higher values of s, ARI and AMI stay close to 0.6 and the reconstruction error decreases in smaller steps.](/document/doi/10.1515/itit-2019-0048/asset/graphic/j_itit-2019-0048_fig_003.jpg)
Figure 3
Visualisation of ARI, AMI, and reconstruction error for dataset D1 for the dictionary with 26 atoms (as many atoms as types) and sparsities s ∈[1,2,...,26]. The AMI is always higher than ARI (about 0.2, but the difference varies slightly). For smaller values of s until around
![Figure 4 Visualisation of the ARIs for the four real world datasets for all parameters evaluated (number of atoms, m∈[1,2,...,2n]m\in [1,2,...,2n], where n is the number of types in each dataset and s∈[1,2,...ms\in [1,2,...m] for each dictionary). Subject to certain exceptions, 1.) the ARI increases for increasing number of atoms until it reaches a maximum and then decreases for further increase of m (compare values from bottom to top), 2.) for increase of s, the ARI first increases and then stays relatively constant (compare values from left to right). ARIs for D2 are significantly worse than the for the other three datasets. The parameters sets of the highest ARIs for D1 vary depending on the types the clusters are compared with (detailed/general). As expected the optimal range for the detailed types (larger n) lies in the range of higher number of atoms, respectively sparsities.](/document/doi/10.1515/itit-2019-0048/asset/graphic/j_itit-2019-0048_fig_004.jpg)
Figure 4
Visualisation of the ARIs for the four real world datasets for all parameters evaluated (number of atoms,
As behaviour of ARI and AMI are similar in our experiments (see Figure 3 for an example for dataset D1), we focus on the ARI in the detailed parameter evaluation and conclusions can be transferred to AMI.
For all RNA-seq datasets, we find that in general the ARI increases for increasing number of atoms up to a certain value where m is close to n (see Figure 4). Considering the sparsity, for many dictionaries the ARI first increases when s is increased and stays close to some ARI for larger values of s. ARIs for D2 are worse than for the other three datasets. Dataset D2, however, showed to have many samples with a strikingly high number of zero values. Discarding samples with zero-entries per sample of more than 75 %, yields significantly better results with a maximum ARI of 0.80. For dataset D1, we perform evaluations for both
The maximum ARI is: 0.77/0.73 for D1 (for general types/detailed types), 0.56 for D2, 1.00 for D3, 0.7 for D4. The maximum AMI is: 0.88/0.85 (for general types/detailed types) for D1, 0.68 for D2, 1.00 for D3, 0.8 for D4.
3.3 How well can the found gene-sets separate the types?
To find the best parameter combination for the biological evaluation, we suggest to perform a grid search with
We evaluate to what extent the resulting clusters overlap with the type groups and which types are grouped, respectively split.
Figure 5 shows a visualisation of the sparse codings of all samples in each dataset. This visual representation already reveals that the different types have different sparse coding values. This is especially the case for dataset D2, where each type is represented by one atom. For the other datasets these differences are not as clear, but still visible. This is in agreement with the ARI, which is 1 for dataset D2 and between 0.56 and 0.77 for the other datasets.

Figure 5
Shown are the values of the sparse codings (values colour coded, see legend). The x-axis represents the samples ordered by type and the atoms are shown on the y-axis. The black lines indicate the border between each two types. This means that each coloured horizontal stripe reflects the values of the sparse coding for one atom for one sample. This way, multiple dimensions can be visualised – and therefore compared among different types. What has been confirmed by the ARIs – that for each type (most of the) sample types have a similar and unique representation – is confirmed by this visualisation.
3.3.1 D1: GTEx

Figure 6
Shown are the clustering results of the sparse codings for dataset D1 for the maximal ARI measured (dictionary with 42 atoms and a 39-sparse coding). (a) shows the tissues in the clusters. Most clusters consist of one tissue type (in high extent). Interestingly, some general tissue types are separated into few cluster, such as Brain in cluster 15 and 21. These cluster separate the cerebellum from the cerebrum. (b) shows the clusters for each tissue for the same sparse coding: In (a) we can see for example, that cluster 6 consists of one main type and a few Salivary Gland samples; In (b) we can see that all Muscle tissue samples appear in only one cluster (cluster 6).
For dataset D1 the maximal ARI for the general types of 0.77 is reached for
13 of 26 clusters have samples of one tissue type only and
in 18/24/25 clusters more than 90 % of samples are from one/two/three tissue type(s).
10 of 26 clusters have samples of one tissue type only and
in 13/21/24 clusters more than 90 % of samples are from one/two/three tissue type(s).
For the detailed types, the maximum ARI of 0.73 is reached for 39/32, 42/26, 47/34 (
3.3.2 D2: GEO GSE120795
For dataset D2 the maximal ARI of 0.56 is reached for
Pie charts with an additional preprocessing step regarding the number of zero-entries can be found in the Appendix.
3.3.3 D3: Expression Atlas E-MTAB-2836
For dataset D3 the maximal ARI is 1.00 (reached for
3.3.4 D4: GEO GSE112004
For dataset D4 the maximal ARI of 0.7 is reached for
The three treatment types are separated, wrong assignments appear only for times within one treatment class (compare Figure 12).
3.3.5 Summary of parameter evaluation for the real world datasets
It shows that there is a wide range of parameters for which the variation of ARI is small. E. g. setting
Restriction to positive dictionary entries
In experiments with an additional constraint allowing for positive dictionary entries only, similar maximum ARIs (±0.01) are reached, however, the number of atoms is often higher: 47 atoms for D1 and 8 atoms for D2, D3, and D4.
3.4 Does the biological evaluation of the dictionary matrix agree with the types?
Let
To assign a subset of atoms to each type, we compute a value measuring the atom-preference among all samples for each type:
From all samples’ sparse codings, we compute the mean value for each atom.
The resulting values are divided by the absolute sum of all atom selection values.
In the 2%Dictionary, the number of non-zero entries per atom varies between 127 and 1841 (out of 55,091) for dataset D1; Between 560 and 1601 (out of 53,679) for D2; Between 131 and 1331 (out of 49,914) for D3; Between 152 and 329 (out of 11,781) for D4. For many atoms the GO-terms entail terms that can be associated with the corresponding tissue (see Figure 7).

Figure 7
Wordclouds of the GO-terms with
4 Use case runtime evaluation
The experiments were carried out on a machine with Intel(R) Core(TM) i5-8500 processors and 16 GB RAM. The runtime scales with a) dataset size and b) m, the number of atoms. Depending on the parameter m the runtime for the smaller datasets varies between 0.4 and 1.4 seconds for
Table 2
Wall-clock time of our approach for the four datasets for two selected parameters values for the number of atoms,
Dataset | Wall-clock time [s] for | Wall-clock time [s] for |
D1 | 3.2* | 46.5* |
D2 | 0.4 | 1.4 |
D3 | 0.4 | 1.2 |
D4 | 0.4 | 0.7 |
5 Data
Evaluations of our algorithm are performed on different datasets: (1) Simulated data and (2) four real world dataset from varying sources. Details on the respective datasets are given in the following paragraphs.
5.1 Simulated data
The construction of simulated data is inspired by biological data. We simulate a matrix of 50 samples and 100 reads. The data is simulated to have 5 groups of 10 samples each. In the simulation, for each group we differentiate between a subset of background genes and up to 20 high genes. The subset-size and the meaning of “high”/“low” is differently defined in each simulation setting. The values for background genes and high genes are simulated separately. In two of the five settings a subset of the background values are high in all samples to simulate housekeeping genes. For a detailed explanation of the simulation settings see Table 3 and Figure 8.
Table 3
Overview of the simulation settings in the 5 simulated dataset. Three gene classes are simulated: Background entries, Housekeepin genes, and Group entries. Among the different simulation settings, for each class the percentage of these gene classes among all genes as well as the particular values are adjusted.
Setting | Background entries | Housekeeping genes | Group entries | |||
Amount | Value | Amount | Value | Amount | Value | |
A | All | 0 | None | – | All | 1 |
B | All | 0 | None | – | 50 % | 1 |
C | 60 % | 0 | 40 % | 1 | 50 % | 1 |
D | All | None | – | 50 % | ||
E | 60 % | 40 % | 50 % |

Figure 8
The simulated data is constructed to consist of 5 types. For each type a subset of all group specific entries is high. (a) In the simplest setting, which the others are based on, all group specific entries are 1 and the other values are 0. (b–e) In the other simulations group specific entities are drawn randomly; (d–e) Noise is added; (c), (e) A subset of genes is high in all groups.
For each setting but the first (because the first does not require a random drawing) we simulate 100 datasets.
5.2 Real world data
We analyse four RNA-seq datasets (see Table 4), with samples from multiple types (see Table 1). The number of samples per type varies within each dataset. To obtain datasets with the same number of samples per type, for each type, samples are randomly selected such that the number of samples per tissue type is the same for all tissues and maximal given the data.
Table 4
Meta data for the four real world datasets analysed in this study. Datasets D1 and D4 contain larger number of samples than D2 and D3. D1, D2, and D3 contain larger number of genes than D4.
ID | Database | Database ID | Total samples | Reads/Genes | Size [MB] raw (preprocessed) |
D1 | GTEX | GTEX | 2,392 | 55,091 | 2700 (572) |
D2 | GEO | GSE120795 | 64 | 53,679 | 24 (34) |
D3 | Expression Atlas | E-MTAB-2836 | 64 | 49,914 | 37 (5) |
D4 | GEO | GSE112004 | 288 | 11,781 | 28 (260) |
5.2.1 D1: GTEX
The GTEX data contains TPM-normalised counts for 11,688 samples and 56,202 reads.
Each sample is labelled with a detailed tissue description (e. g. Adipose – subcutaneous, Adipose – Visceral, Brain – Amygdala,...), as well as a more general one (e. g. Adipose – Subcutaneous and Adipose – Visceral as type Adipose). We evaluate the dictionary learning for the general grouping and use the detailed description in the result analysis only. Data is present for 53 detailed tissue types with 5–564 samples per type, respectively 31 grouped tissue types with 7–1671 samples per type.
The minimum number of samples per tissue is set to 50.
Outlier removal and selection of the same number of samples per type results in a 2,392×55,091 matrix (
5.2.2 D2: GEO GSE120795
The GEO dataset GSE120795 contains FPKM-normalised counts for 166 samples and 58,233 reads. Samples stem from 25 tissues with 1–15 samples per tissue.
The minimum number of samples per tissue is set to 8.
Outlier removal and selection of the same number of samples per type results in a 64×53,679 (
Note, that some samples have zero-values for most of the reads and only some high values. These samples could easily be detected by a limitation to the number of zero-values per sample (e. g. no more than 75 % zero-entries).
5.2.3 D3: Expression Atlas E-MTAB-2836
The Expression Atlas dataset E-MTAB-2836 contains counts for 200 samples and 65,217 reads. Samples stem from 32 tissues with 3–13 samples per tissue.
The minimum number of samples per tissue is set to 8.
Outlier removal and selection of the same number of samples per type results in a 64×49,914 (
5.2.4 D4: GEO GSE112004
The GEO dataset GSE112004 is a single cell dataset and contains read counts for 3,648 samples and 11,841 genes (already mapped). Samples stem from mice CD19+ B cells: untreated, treated to trans-differentiate to macrophages, and treated to trans-differentiate to induced pluripotent stem cells (3 types). For each type, measurements at three time points are available (considered as 9 types in total).
The minimum number of samples per tissue is set to 32.
Outlier removal and selection of the same number of samples per type results in a 288×11,781 (
5.2.5 Normalisation
In [3] Clearly et al. suggest to normalise the data by a removal of genes for which the sum of counts is
6 Discussion and conclusion
In this study, we present our results for a new version of dictionary learning. We apply the method to simulated and real world data. We evaluate both, the relation of sample labels and clusters of the sparse codings, as well as the biological relevance of the gene-sets represented by the dictionary atoms. Clustering the sparse codings of the real world datasets showed that the dictionary keeps relevant properties of the data. Further, the biological relevance of the gene-sets detected by the DiL approach was confirmed by a GO-term analysis, which revealed the large potential of dictionary learning in RNA-seq analysis to extract type specific gene-sets. We performed a large study on four datasets, showing that results are not unique to one dataset.
In contrast to [3] the focus of our analysis is not to find a dictionary such that a high-dimensional transcriptomic profile can be generated from sequencing a small, random selection of genes. Rather, we aim at 1.) detecting relevant genes-sets and 2.) thereby confirming that the dictionary keeps relevant properties of the data – this in mind, the sparse codings can be used for further analysis.
In contrast to [12] we perform only one step of dictionary learning which is a lot faster than the standard dictionary learning and again several times faster than the nested dictionary learning presented in [12].
We have only evaluated clusterings with the number of clusters, k, equal to the number of types given in the data, n. In some cases, we found that one type was split whereas others were found in one cluster, which results in an
The DiL method requires two parameters, the number of atoms of the dictionary, m, and the level of sparsity, s. Our experiments revealed that for every dataset evaluated there was a large range of parameters for which the variation of ARI, respectively AMI is small. E. g. setting

Figure 9
Shown are results for dataset D1 for the maximal ARI measured (dictionary with 37 atoms and a 32-sparse coding). (a) Shown are the detailed tissue types in the clusters. Many clusters gather subtypes of one tissue (compare Figure 6), which is what we aim for since the number of clusters is set equal to the number of general tissue types. Interestingly, some general tissue types are separated into few cluster, such as brain in cluster 11 and 19. These cluster separate the cerebellum from the cerebrum. (c, d) Shown are the clusters for each tissue.
This paper is focused on the method and its analysis on multiple RNA-seq datasets as well as a detailed analysis of the results. We have only performed very limited normalisation of the data. A more extensive normalisation might improve results even further. This becomes visible in the analysis of dataset D2, where one simple additional normalisation step improved the ARI from 0.56 to 0.8.
By thresholding we filtered the dictionary matrix to receive the
Other related methods, such as NMF and SVD do not enforce sparsity per sample. DiL, however, fulfils this criterion. With the results of this paper in mind, showing that the dictionary keeps relevant properties of the data, one could use the dictionary for the representation of states (e. g. of a single cell), and study its e. g. time-dependent behaviour in the dictionary space. Application of dictionary learning on dataset D4 shows that results for single cell data are similarly good as for pooled cells.
In conclusion, our study demonstrates that our dictionary learning based approach is able to detect biologically relevant modules of genes specific to various types, as well as to represent RNA-seq data in lower dimension.
Funding source: Bundesministerium für Bildung und Forschung
Award Identifier / Grant number: 3FO18501
Award Identifier / Grant number: 05M16KEA
Funding statement: This study was supported by the Federal Ministry of Education and Research of Germany, grant 3FO18501 (Forschungscampus MODAL) and grant 05M16KEA (IBOSS) to TC and MR. The funders did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Appendix
The following figures show results of the analysis of datasets D1–D4.

Figure 10
Shown are results for dataset D2 for the dictionary with 7 atoms and a 1-sparse coding. (a) Shown are the tissues in the clusters. Five clusters consist of one tissue type only. (b) Shown are the clusters for each tissue. Interestingly brain tissues are separated into two clusters which might represent a separation of grey and white matter (no data available).

Figure 11
Shown are results for dataset D3 for the dictionary with 7 atoms and a 1-sparse coding. (a) Shown are the tissues in the clusters. (b) Shown are the clusters for each tissue. As the ARI for this data is 1, all pies are of one colour only.

Figure 12
Shown are results for dataset D4 for the dictionary with 8 atoms and a 4-sparse coding. The type labels describe the treatment type and time (hours). (a) Shown are the types in the clusters. Five clusters consist of one tissue type only. (b) Shown are the clusters for each tissue. The three treatment types are separated, wrong assignments appear only for times within one treatment class.
References
1. Orly Alter, Patrick O. Brown, and David Botstein. Singular value decomposition for genome-wide expression data processing and modeling. Proceedings of the National Academy of Sciences, 97(18):10101–10106, 2000.10.1073/pnas.97.18.10101Search in Google Scholar PubMed PubMed Central
2. Sven Bergmann, Jan Ihmels, and Naama Barkai. Iterative signature algorithm for the analysis of large-scale gene expression data. Physical review E, 67(3):031902, 2003.10.1103/PhysRevE.67.031902Search in Google Scholar PubMed
3. Brian Cleary, Le Cong, Anthea Cheung, Eric S. Lander, and Aviv Regev. Efficient generation of transcriptomic profiles by random composite measurements. Cell, 171(6):1424–1436, 2017.10.1016/j.cell.2017.10.023Search in Google Scholar PubMed PubMed Central
4. Ronald R. Coifman and David L. Donoho. Translation-invariant de-noising. In Wavelets and statistics, pages 125–150. Springer, 1995.10.1007/978-1-4612-2544-7_9Search in Google Scholar
5. Gene Ontology Consortium. Gene ontology consortium: Going forward. Nucleic acids res. 43:D1049–d1056, 2015.10.1093/nar/gku1179Search in Google Scholar PubMed PubMed Central
6. Michael Elad and Michal Aharon. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image processing, 15(12):3736–3745, 2006.10.1109/TIP.2006.881969Search in Google Scholar PubMed
7. Amin Emad and Olgica Milenkovic. Caspian: A causal compressive sensing algorithm for discovering directed interactions in gene networks. PloS one, 9(3):e90781, 2014.10.1371/journal.pone.0090781Search in Google Scholar PubMed PubMed Central
8. Y. Fang, L. Chen, J. Wu, and B. Huang. Gpu implementation of orthogonal matching pursuit for compressive sensing. In 2011 IEEE 17th International Conference on Parallel and Distributed Systems, pages 1044–1047, Dec. 2011.10.1109/ICPADS.2011.158Search in Google Scholar
9. Lei Huang, Yan Jin, Yaozong Gao, Kim-Han Thung, Dinggang Shen, et al. Alzheimer’s Disease Neuroimaging Initiative. Longitudinal clinical score prediction in Alzheimer’s disease with soft-split sparse regression based random forest. Neurobiology of aging, 46:180–191, 2016.10.1016/j.neurobiolaging.2016.07.005Search in Google Scholar PubMed PubMed Central
10. Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of classification, 2(1):193–218, 1985.10.1007/BF01908075Search in Google Scholar
11. Morteza Kolali Khormuji and Mehrnoosh Bazrafkan. A novel sparse coding algorithm for classification of tumors based on gene expression data. Medical & biological engineering & computing, 54(6):869–876, 2016.10.1007/s11517-015-1382-8Search in Google Scholar PubMed
12. Elina Koletou. Prostate cancer patient stratification with MINING: Molecular Signatures via Nested Dictionary Learning. PhD thesis, ETH Zurich, 2019.Search in Google Scholar
13. Jin-Xing Liu, Yong Xu, Chun-Hou Zheng, Heng Kong, and Zhi-Hui Lai. Rpca-based tumor classification using gene expression data. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 12(4):964–970, 2015.10.1109/TCBB.2014.2383375Search in Google Scholar PubMed
14. Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137, 1982.10.1109/TIT.1982.1056489Search in Google Scholar
15. Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online dictionary learning for sparse coding. In Proceedings of the 26th annual international conference on machine learning, pages 689–696. ACM, 2009.10.1145/1553374.1553463Search in Google Scholar
16. Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online dictionary learning for sparse coding. In Proceedings of the 26th annual international conference on machine learning, pages 689–696. ACM, 2009.10.1145/1553374.1553463Search in Google Scholar
17. Yagyensh Chandra Pati, Ramin Rezaiifar, and Perinkulam Sambamurthy Krishnaprasad. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Signals, Systems and Computers, 1993. 1993 Conference Record of The Twenty-Seventh Asilomar Conference on, pages 40–44. IEEE, 1993.Search in Google Scholar
18. Yosef Prat, Menachem Fromer, Nathan Linial, and Michal Linial. Recovering key biological constituents through sparse representation of gene expression. Bioinformatics, 27(5):655–661, 2011.10.1093/bioinformatics/btr002Search in Google Scholar PubMed
19. Ron Rubinstein, Michael Zibulevsky, and Michael Elad. Efficient implementation of the k-svd algorithm using batch orthogonal matching pursuit. Technical report, Computer Science Department, Technion, 2008.Search in Google Scholar
20. Eran Segal, Michael Shapira, Aviv Regev, Dana Pe’er, David Botstein, Daphne Koller, and Nir Friedman. Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nature genetics, 34(2):166, 2003.10.1038/ng1165Search in Google Scholar PubMed
21. Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.10.1111/j.2517-6161.1996.tb02080.xSearch in Google Scholar
22. Nguyen Xuan Vinh, Julien Epps, and James Bailey. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11(Oct.):2837–2854, 2010.Search in Google Scholar
23. Meng Yang, Lei Zhang, Xiangchu Feng, and David Zhang. Fisher discrimination dictionary learning for sparse representation. In 2011 International Conference on Computer Vision, pages 543–550. IEEE, 2011.10.1109/ICCV.2011.6126286Search in Google Scholar
24. Yuan You, Hongmin Cai, and Jiazhou Chen. Low rank representation and its application in bioinformatics. Current Bioinformatics, 13(5):508–517, 2018.10.2174/1574893612666171121155347Search in Google Scholar
25. Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 67(2):301–320, 2005.10.1111/j.1467-9868.2005.00503.xSearch in Google Scholar
© 2020 Rams and Conrad, published by De Gruyter
This work is licensed under the Creative Commons Attribution 4.0 Public License.