SEARCH CONTENT

You are looking at 1 - 10 of 6,144 items :

  • Information Technology x
Clear All
Research and Applications

Abstract

Extracting information from large biological datasets is a challenging task, due to the large data size, high-dimensionality, noise, and errors in the data. Gene expression data contains information about which gene products have been formed by a cell, thus representing which genes have been read to activate a particular biological process. Understanding which of these gene products can be related to which processes can for example give insights about how diseases evolve and might give hints about how to fight them.

The Next Generation RNA-sequencing method emerged over a decade ago and is nowadays state-of-the-art in the field of gene expression analyses. However, analyzing these large, complex datasets is still a challenging task. Many of the existing methods do not take into account the underlying structure of the data.

In this paper, we present a new approach for RNA-sequencing data analysis based on dictionary learning. Dictionary learning is a sparsity enforcing method that has widely been used in many fields, such as image processing, pattern classification, signal denoising and more. We show how for RNA-sequencing data, the atoms in the dictionary matrix can be interpreted as modules of genes that either capture patterns specific to different types, or else represent modules that are reused across different scenarios. We evaluate our approach on four large datasets with samples from multiple types. A Gene Ontology term analysis, which is a standard tool indicated to help understanding the functions of genes, shows that the found gene-sets are in agreement with the biological context of the sample types. Further, we find that the sparse representations of samples using the dictionary can be used to identify type-specific differences.

Abstract

There is an increasing interest in fusing data from heterogeneous sources. Combining data sources increases the utility of existing datasets, generating new information and creating services of higher quality. A central issue in working with heterogeneous sources is data migration: In order to share and process data in different engines, resource intensive and complex movements and transformations between computing engines, services, and stores are necessary.

Muses is a distributed, high-performance data migration engine that is able to interconnect distributed data stores by forwarding, transforming, repartitioning, or broadcasting data among distributed engines’ instances in a resource-, cost-, and performance-adaptive manner. As such, it performs seamless information sharing across all participating resources in a standard, modular manner. We show an overall improvement of 30 % for pipelining jobs across multiple engines, even when we count the overhead of Muses in the execution time. This performance gain implies that Muses can be used to optimise large pipelines that leverage multiple engines.

Abstract

The Internet of Things (IoT) sparks a revolution in time series forecasting. Traditional techniques forecast time series individually, which becomes unfeasible when the focus changes to thousands of time series exhibiting anomalies like noise and missing values. This work presents CSAR, a technique forecasting a set of time series with only one model, and a feature-aware partitioning applying CSAR on subsets of similar time series. These techniques provide accurate forecasts a hundred times faster than traditional techniques, preparing forecasting for the arising challenges of the IoT era.

FREE ACCESS

Abstract

The selection of input data is a crucial step in virtually every empirical study. Experimental campaigns in algorithm engineering, experimental algorithmics, network analysis, and many other fields often require suited network data. In this context, synthetic graphs play an important role, as data sets of observed networks are typically scarce, biased, not sufficiently understood, and may pose logistic and legal challenges. Just like processing huge graphs becomes challenging in the big data setting, new algorithmic approaches are necessary to generate such massive instances efficiently. Here, we update our previous survey [35] on results for large-scale graph generation obtained within the DFG priority programme SPP 1736 (Algorithms for Big Data); to this end, we broaden the scope and include recently published results.

Abstract

Mathematical optimization is at the algorithmic core of machine learning. Almost any known algorithm for solving mathematical optimization problems has been applied in machine learning and the machine learning community itself is actively designing and implementing new algorithms for specific problems. These implementations have to be made available to machine learning practitioners which is mostly accomplished by distributing them as standalone software. Successful well-engineered implementations are collected in machine learning toolboxes that provide a more uniform access to the different solvers. A disadvantage of the toolbox approach is a lack of flexibility as toolboxes only provide access to a fixed set of machine learning models that cannot be modified. This can be a problem for the typical machine learning workflow that iterates the process of modeling, solving and validating. If a model does not perform well on validation data, it needs to be modified. In most cases these modifications require a new solver for the entailed optimization problems. Optimization frameworks that combine a modeling language for specifying optimization problems with a solver are better suited to the iterative workflow since they allow to address large problem classes. Here, we provide examples of the use of optimization frameworks in machine learning. We also illustrate the use of one such framework in a case study that follows the typical machine learning workflow.