The goal in any reliability project has traditionally been to prevent errors of any kind. The metric MTTF (mean time to failure) in architecture is defined as any error that gets saved in the state. As the field has matured, researchers have discovered that not all errors cause a failure. They can be masked in the circuits, because not all inputs affect the final results. Thus, by protecting everything, rather than those errors that change the result, time and power get wasted.
This project takes proposes to explore this avenue - allowing errors that do not change the final result. In many applications, such as facial recognition or voice recognition, many of the data errors will not be noticed by the software, depending on the particular data. For example, if we one bit gets flipped in an incoming audio signal for voice recognition, it may not affect the result at all. The proper word may be recognized despite the error in one sample. A key observation, however, is that even these applications are not very resistant to control flow errors. For example, if the voice recognition software stops before it completes its analysis of the audio signal, that would most certainly result in the wrong word being recognized. So, we observe that it is better to have an error in the data than changing what work the computer does.
This project explores how to take advantage of this partial tolerance to unreliability. More efficient reliability mechanisms can be designed that are targeted only towards the important instructions, not all instructions. In even more tolerant applications, errors can be introduced into the system in order to speed up the system (allowing it to proceed without waiting for slow operations).
Evidence showing that some instructions are more important than others. We have a compiler that identifies instructions to protect. If errors are randomly inserted anywhere in the code, catastrophic failures (failures causing early termination) occur very often. If we protect a subset of the instructions, even with the same rate of error insertion, the number of catastrophic errors is reduced dramatically.
Darshan Thaker, Diana Franklin, John Oliver, Susmit Biswas, Derek Lockhart, Tzvetan Metodi, and Frederic T. Chong. ``Characterization of Error-Tolerant Applications when Protecting Control Data,'' 2006 IEEE International Symposium on Workload Characterization, October 2006
D. Thaker, D. Franklin, V. Akella, and F. Chong. "Reliability Requirements of Control, Address, and Data Operations in Error-Tolerant Applications," Workshop on Architectural Reliability, in conjunction with MICRO-2005, December 2005.