John A. Gunnels Department of Computer Sciences The University of Texas at Austin Taylor Hall 2.124 Austin, TX 78712 gunnels@cs.utexas.edu |
Daniel S. Katz Jet Propulsion Laboratory California Institute of Technology Pasadena, CA 91109-8099 Daniel.S.Katz@jpl.nasa.gov |
Enrique S. Quintana-Ortí Dept. de Informática Universidad Jaume I 12080 Castellón Spain quintana@inf.uji.es |
Robert A. van de Geijn Department of Computer Sciences The University of Texas at Austin Taylor Hall 2.124 Austin, TX 78712 rvdg@cs.utexas.edu |
In this paper, we extend the theory and practice regarding algorithmic fault-tolerant matrix-matrix multiplication, C = A B , in a number of ways. First, we propose low-overhead methods for detecting errors introduced not only in C but also in A and/or B . Second, we show that, theoretically, these methods will detect all errors as long as only one entry is corrupted. Third, we propose a low-overhead roll-back approach to correct errors once detected. Finally, we give a high-performance implementation of matrix-matrix multiplication that incorporates these error detection and correction methods. Empirical results demonstrate that these methods work well in practice while imposing an acceptable level of overhead relative to high-performance implementations without fault-tolerance.