Publications

Tolerating SEU faults in the RAW architecture

Abstract

This paper describes software fault tolerance techniques to mitigate SEU faults in the Raw architecture, which is a single-chip parallel tiled computing architecture. The fault tolerance techniques we use are efficient Checkpointing and Rollback of processor state, Break-pointing, Selective Replication of code and Selective Duplication of tiles. Our fault tolerance techniques can be fully implemented in the software, without any changes to the architecture, transparent to the user, and designed to fulfill run-time performance and throughput requirements of the system. We illustrate these techniques by mitigating matrix multiply kernel mapped on Raw. The proposed techniques are also applicable to other tiled architectures (and also parallel systems in general).

Date
January 1, 1970
Authors
Karandeep Singh, Adnan Agbaria, Dong-In Kang, Matthew French
Journal
3rd International Workshop on Dependable Embedded Systems