The document discusses hardware errors in computing systems and approaches to handling them in software. It covers several key points:
- Hardware errors can occur due to phenomena like radiation, thermal stress, and aging. Most errors are "masked" and do not affect software behavior.
- Past studies found that around 65% of permanent hardware errors corrupt the operating system state before crashing. Some components are more susceptible to errors than others.
- Challenges include detecting and correcting errors in software as well as handling errors in binary applications. Opportunities exist to avoid tracking harmless errors and leverage hardware-level concurrency.
- Existing fault tolerance approaches range from specialized hardware to software replication in research microkernels and commercial systems