A High-Performance Fault-Tolerant Software Framework for Memory on Commodity GPUs

Naoya Maruyama; Akira Nukada; Satoshi Matsuoka

doi:10.1109/IPDPS.2010.5470473

論文・著書情報

タイトル

和文:
英文:	A High-Performance Fault-Tolerant Software Framework for Memory on Commodity GPUs

著者

和文:	丸山直也, 額田彰, 松岡聡.
英文:	Naoya Maruyama, Akira Nukada, Satoshi Matsuoka.

言語

English

掲載誌/書名

和文:
英文:	24th IEEE International Parallel and Distributed Processing Symposium (IPDPS'10)

巻, 号, ページ

出版年月

2010年4月

出版者

和文:
英文:

会議名称

和文:
英文:	24th IEEE International Parallel and Distributed Processing Symposium (IPDPS'10)

開催地

和文:
英文:	Atlanta, USA

公式リンク

http://www.ipdps.org/

DOI

https://doi.org/10.1109/IPDPS.2010.5470473

アブストラクト

As GPUs are used to accelerate HPC applications by allowing more flexibility and programmability, their fault tolerance is becoming much more important than before when they were used only for graphics. The current generation of GPUs, however, does not have standard error detection and correction capabilities, such as SEC-DED ECC for DRAM, which is almost always exercised in HPC servers. We present a high-performance software framework to enhance commodity off-the-shelf GPUs with DRAM fault tolerance. It combines data coding for detecting bit-flip errors and checkpointing for recovering computations when such errors are detected. We analyze performance of data coding in GPUs and present optimizations geared toward memory-intensive GPU applications. We present performance studies of the prototype implementation of the framework and show that the proposed framework can be realized with negligible overheads in compute intensive applications such as N-body problem and matrix multiplication, and as low as 35\% in a highly-efficient memory intensive 3-D FFT kernel.

Home

各種検索

サポート

T2R2について

関連リンク

論文・著書情報