Publication:
Exploring Memory Error Vulnerability for Parallel Programming Models

Loading...
Thumbnail Image

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

SPRINGER INTERNATIONAL PUBLISHING AG

Research Projects

Organizational Units

Journal Issue

Abstract

Transistor size reduction and more aggressive power modes in HPC platforms make chip components more error prone. In this context, HPC applications can have a diverse level of tolerance to memory errors that may change the execution in different ways. As the tolerance to memory errors depends on write frequency and access patterns, different programming models may exhibit a different behavior in the rate of failures and alleviate the performance loss caused by the overhead of fault-tolerance mechanisms. In this paper, we explore how tolerant to memory errors are two main parallel programming models, message-passing and shared memory: we perform a memory vulnerability analysis and also conduct error propagation experiments to observe the effect of memory errors through program flow. Our results show the need for soft error resiliency methods based on memory behavior of programs, and the evaluation of the tradeoffs between performance and reliability.

Description

Citation

Endorsement

Review

Supplemented By

Referenced By