Providing fault tolerance through invasive computing

Vahid Lari 1 , Andreas Weichslgartner 1 , Alexandru Tanase 1 , Michael Witterauf 1 , Faramarz Khosravi 1 , Jürgen Teich 1 , Jan Heißwolf 2 , Stephanie Friederich 3 ,  and Jürgen Becker 3
  • 1 Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Hardware/Software Co-Design, Cauerstr. 11, 91058 Erlangen, Germany, Germany
  • 2 Robert Bosch GmbH, Tuebingerstr. 123, 72762 Reutlingen, Germany, Germany
  • 3 Karlsruhe Institute of Technology (KIT), Institute for Information Processing Technologies (ITIV), Engesserstr. 5, 76131 Karlsruhe, Germany, Germany

Abstract

As a consequence of technology scaling, today's complex multi-processor systems have become more and more susceptible to errors. In order to satisfy reliability requirements, such systems require methods to detect and tolerate errors. This entails two major challenges: (a) providing a comprehensive approach that ensures fault-tolerant execution of parallel applications across different types of resources, and (b) optimizing resource usage in the face of dynamic fault probabilities or with varying fault tolerance needs of different applications. In this paper, we present a holistic and adaptive approach to provide fault tolerance on Multi-Processor System-on-a-Chip (MPSoC) on demand of an application or environmental needs based on invasive computing. We show how invasive computing may provide adaptive fault tolerance on a heterogeneous MPSoC including hardware accelerators and communication infrastructure such as a Network-on-Chip (NoC). In addition, we present (a) compile-time transformations to automatically adopt well-known redundancy schemes such as Dual Modular Redundancy (DMR) and Triple Modular Redundancy (TMR) for fault-tolerant loop execution on a class of massively parallel arrays of processors called as Tightly Coupled Processor Arrays (). Based on timing characteristics derived from our compilation flow, we further develop (b) a reliability analysis guiding the selection of a suitable degree of fault tolerance. Finally, we present (c) a methodology to detect and adaptively mitigate faults in invasive NoCs.

Purchase article
Get instant unlimited access to the article.
$42.00
Log in
Already have access? Please log in.


or
Log in with your institution

Journal + Issues

it - Information Technology is a strictly peer-reviewed scientific journal. It is the oldest German journal in the field of information technology. Today, the major aim of it - Information Technology is highlighting issues on ongoing newsworthy areas in information technology and informatics and their application. It aims at presenting the topics with a holistic view.

Search