Tolerant architecture
Introduction
Fault tolerance is the property that allows a system to continue functioning correctly in the event of failure of one or more of its components. If its performance decreases, the decrease is proportional to the severity of the failure, compared to a system naively designed so that even a small failure can cause a total system collapse. Fault tolerance is particularly sought after in high availability systems.
A fault-tolerant design is a system that is able to continue operating when any component of the system fails.,[1] possibly at a lower level, which is better than the system failing completely. The term is commonly used to describe computer-based systems designed to continue to a greater or lesser extent the operations it performs with, at best, a reduction in performance or an increase in response times for failing components. This means that the system does not stop due to a software or hardware failure. An example in another branch is that of a car designed to continue operating if one of its tires receives a puncture.
Fault tolerance is only a property of each machine, it can also characterize the rules according to which they interact. For example, the TCP protocol is designed to enable reliable two-way communication on a packet-switched network, even in the presence of communications links that are imperfect or overloaded. This is because at the communication ends packet loss, duplication, reordering and corruption can be expected, so these conditions do not damage the integrity of the data, and only reduce capacity by a proportional amount.
Error recovery in fault-tolerant systems can be characterized as forward or backward. When the system detects that an error has been made, "go forward" recovery takes the state of the system at that time and corrects it, so it can move forward. "Rollback" recovery recovers the system state to some of the earlier and correct version, for example using recovery points, and moves forward. Rollback recovery requires that operations between the checkpoint and detected errors can be unalterable. Some systems make use of both types of error recovery for different parts of the same error.
At the level of an individual system, fault tolerance can be achieved by anticipating exceptional conditions and creating the system to cope with the situation, and generally in order to self-stabilize so that the system converges towards an error-free state. However, if the consequences of a system failure are catastrophic, or the cost of making it sufficiently reliable is very high, the best solution may be to use some form of mirroring. In any case, if the consequence of a system failure is so catastrophic, the system must be able to use rollback to return to a safe mode. This is similar to rollback, but can be a human action if humans are present in the cycle.