Extracting information from large biological datasets is a challenging task, due to the large data size, high-dimensionality, noise, and errors in the data. Gene expression data contains information about which gene products have been formed by a cell, thus representing which genes have been read to activate a particular biological process. Understanding which of these gene products can be related to which processes can for example give insights about how diseases evolve and might give hints about how to fight them.
The Next Generation RNA-sequencing method emerged over a decade ago and is nowadays state-of-the-art in the field of gene expression analyses. However, analyzing these large, complex datasets is still a challenging task. Many of the existing methods do not take into account the underlying structure of the data.
In this paper, we present a new approach for RNA-sequencing data analysis based on dictionary learning. Dictionary learning is a sparsity enforcing method that has widely been used in many fields, such as image processing, pattern classification, signal denoising and more. We show how for RNA-sequencing data, the atoms in the dictionary matrix can be interpreted as modules of genes that either capture patterns specific to different types, or else represent modules that are reused across different scenarios. We evaluate our approach on four large datasets with samples from multiple types. A Gene Ontology term analysis, which is a standard tool indicated to help understanding the functions of genes, shows that the found gene-sets are in agreement with the biological context of the sample types. Further, we find that the sparse representations of samples using the dictionary can be used to identify type-specific differences.
There is an increasing interest in fusing data from heterogeneous sources. Combining data sources increases the utility of existing datasets, generating new information and creating services of higher quality. A central issue in working with heterogeneous sources is data migration: In order to share and process data in different engines, resource intensive and complex movements and transformations between computing engines, services, and stores are necessary.
Muses is a distributed, high-performance data migration engine that is able to interconnect distributed data stores by forwarding, transforming, repartitioning, or broadcasting data among distributed engines’ instances in a resource-, cost-, and performance-adaptive manner. As such, it performs seamless information sharing across all participating resources in a standard, modular manner. We show an overall improvement of 30 % for pipelining jobs across multiple engines, even when we count the overhead of Muses in the execution time. This performance gain implies that Muses can be used to optimise large pipelines that leverage multiple engines.
The Internet of Things (IoT) sparks a revolution in time series forecasting. Traditional techniques forecast time series individually, which becomes unfeasible when the focus changes to thousands of time series exhibiting anomalies like noise and missing values. This work presents CSAR, a technique forecasting a set of time series with only one model, and a feature-aware partitioning applying CSAR on subsets of similar time series. These techniques provide accurate forecasts a hundred times faster than traditional techniques, preparing forecasting for the arising challenges of the IoT era.
The selection of input data is a crucial step in virtually every empirical study. Experimental campaigns in algorithm engineering, experimental algorithmics, network analysis, and many other fields often require suited network data. In this context, synthetic graphs play an important role, as data sets of observed networks are typically scarce, biased, not sufficiently understood, and may pose logistic and legal challenges. Just like processing huge graphs becomes challenging in the big data setting, new algorithmic approaches are necessary to generate such massive instances efficiently. Here, we update our previous survey  on results for large-scale graph generation obtained within the DFG priority programme SPP 1736 (Algorithms for Big Data); to this end, we broaden the scope and include recently published results.
Mathematical optimization is at the algorithmic core of machine learning. Almost any known algorithm for solving mathematical optimization problems has been applied in machine learning and the machine learning community itself is actively designing and implementing new algorithms for specific problems. These implementations have to be made available to machine learning practitioners which is mostly accomplished by distributing them as standalone software. Successful well-engineered implementations are collected in machine learning toolboxes that provide a more uniform access to the different solvers. A disadvantage of the toolbox approach is a lack of flexibility as toolboxes only provide access to a fixed set of machine learning models that cannot be modified. This can be a problem for the typical machine learning workflow that iterates the process of modeling, solving and validating. If a model does not perform well on validation data, it needs to be modified. In most cases these modifications require a new solver for the entailed optimization problems. Optimization frameworks that combine a modeling language for specifying optimization problems with a solver are better suited to the iterative workflow since they allow to address large problem classes. Here, we provide examples of the use of optimization frameworks in machine learning. We also illustrate the use of one such framework in a case study that follows the typical machine learning workflow.
Network science methodology is increasingly applied to a large variety of real-world phenomena, often leading to big network data sets. Thus, networks (or graphs) with millions or billions of edges are more and more common. To process and analyze these data, we need appropriate graph processing systems and fast algorithms. Yet, many analysis algorithms were pioneered on small networks when speed was not the highest concern. Developing an analysis toolkit for large-scale networks thus often requires faster variants, both from an algorithmic and an implementation perspective. In this paper we focus on computational aspects of vertex centrality measures. Such measures indicate the (relative) importance of a vertex based on the position of the vertex in the network. We describe several common (and some recent and thus less established) measures, optimization problems in their context as well as algorithms for an efficient solution of the raised problems. Our focus is on (not necessarily exact) performance-oriented algorithmic techniques that enable significantly faster processing than the previous state of the art – often allowing to process massive data sets quickly and without resorting to distributed graph processing systems.
Big Data applications are characterized by processing an amount of data too huge to be stored. Cryptographic protocols are by construction supposed to define huge data spaces that cannot be handled by any attacker. Nevertheless, the task of protocol cryptanalysis is to properly select cryptographic parameter lengths that guarantee both efficiency and security. This requires to break cryptographic protocols and their underlying hardness assumptions for mid-sized parameters. But even for mid-sized parameters cryptographic search spaces are way too huge to be stored. This asks for technical solutions that traverse the search space without storing elements. As an appealingly simple example, we address the subset sum problem which lies at the heart of many modern cryptographic protocols designed to offer security even against quantum computers. In the subset sum problem, one obtains integers and an integer target t, and has to find a subset of the ’s that exactly sums to t. A trivial memory-less algorithm tests for all subsets, whether their sum equals t. It may come as a surprise that there exist memory-less algorithms significantly faster than . We give a survey on recent memory-less techniques, that apply but are not limited to the subset sum problem. We start by describing a general collision finding technique that was introduced in 1994 in the seminal work of van Oorschot and Wiener. Applied to subset sum the van Oorschot-Wiener technique leads to a -algorithm. This was improved in 2011 by Becker, Coron and Joux to using the representation technique. Recently, Esser and May presented a memory-less algorithm achieving using two-layered collision finding. These running times have to be compared to the optimal lower bound for collision finding algorithms.