Tolerant Architecture

Introduction

Fault tolerance is the property that allows a system to continue functioning correctly in the event of failure of one or more of its components. If its performance decreases, the decrease is proportional to the severity of the failure, compared to a system naively designed so that even a small failure can cause a total system collapse. Fault tolerance is particularly sought after in high availability systems.

A fault-tolerant design is a system that is able to continue operating when any component of the system fails.,[1] possibly at a lower level, which is better than the system failing completely. The term is commonly used to describe computer-based systems designed to continue to a greater or lesser extent the operations it performs with, at best, a reduction in performance or an increase in response times for failing components. This means that the system does not stop due to a software or hardware failure. An example in another branch is that of a car designed to continue operating if one of its tires receives a puncture.

Fault tolerance is only a property of each machine, it can also characterize the rules according to which they interact. For example, the TCP protocol is designed to enable reliable two-way communication on a packet-switched network, even in the presence of communications links that are imperfect or overloaded. This is because at the communication ends packet loss, duplication, reordering and corruption can be expected, so these conditions do not damage the integrity of the data, and only reduce capacity by a proportional amount.

Error recovery in fault-tolerant systems can be characterized as forward or backward. When the system detects that an error has been made, "go forward" recovery takes the state of the system at that time and corrects it, so it can move forward. "Rollback" recovery recovers the system state to some of the earlier and correct version, for example using recovery points, and moves forward. Rollback recovery requires that operations between the checkpoint and detected errors can be unalterable. Some systems make use of both types of error recovery for different parts of the same error.

At the level of an individual system, fault tolerance can be achieved by anticipating exceptional conditions and creating the system to cope with the situation, and generally in order to self-stabilize so that the system converges towards an error-free state. However, if the consequences of a system failure are catastrophic, or the cost of making it sufficiently reliable is very high, the best solution may be to use some form of mirroring. In any case, if the consequence of a system failure is so catastrophic, the system must be able to use rollback to return to a safe mode. This is similar to rollback, but can be a human action if humans are present in the cycle.

Tolerant Architecture

Introduction

At the level of an system, fault tolerance can be achieved by anticipating exceptional conditions and creating the system to cope with the situation, and generally in order to self-stabilize so that the system converges towards an error-free state. However, if the consequences of a system failure are catastrophic, or the cost of making it sufficiently reliable is very high, the best solution may be to use some form of mirroring. In any case, if the consequence of a system failure is so catastrophic, the system must be able to use rollback to return to a safe mode. This is similar to rollback, but can be a human action if humans are present in the cycle.

Replication

Fault tolerance is fundamentally dealt with in the following three ways:

• - Replication: provide multiple identical instances on the same system or subsystem, addressing tasks or requests from all of them in parallel, and choosing the correct outcome based on a quorum;

• - Redundancy: provide multiple identical cases on the same system and the ability to switch to one of the remaining cases in case of failure;

• - Diversity: provide multiple different implementations of the same specification, and use them as duplicate systems to address bugs in a particular application.

All RAID implementations, except RAID 0, are examples of a fault-tolerant data storage device that uses data redundancy.

A rigid fault-tolerant machine uses replicated elements running in parallel. At any time, all repetitions of each element must be in the same state. The same inputs are provided to each replica, always expecting the same expected results. The outputs of the replications are compared using an electoral circuit. A machine with two repetitions of each element is called dual modular redundancy (RMD). Circuit voting can only detect a discrepancy and recovery depends on other methods. A machine with three repetitions of each element is called triple modular redundancy (RMT). The circuit voting result can determine which replication is in error state when a two-to-one vote is observed. In this case, the circuit voting result may result in the correct result and reject the wrong version. After this, the internal state of the faulty mirror is assumed to be different from that of the other two, and the voting result of the circuit can change to a faulty mode. This model can be applied to any larger number of replications.

Rigid fault-tolerant machines are easier to make fully synchronous, with each of the gates of each of the replicas having the same state transition on the same clock edge, and the replicas' clocks being exactly in phase. However, it is possible to build systems that preach without this requirement.

Replaying in sync requires making your internal saved states the same. Which can be started from a fixed initial state, such as the reset state. Furthermore, the internal state of a replica can be copied to another replica.

A variant of RMD is pair and spare. Two replicated elements operate synchronously in tandem, with circuit voting detecting mismatches between their operations and issuing a signal indicating an error. Another couple works exactly the same way. A final circuit selects the output of the pair that is not proclaimed to be a mistake. Spare Pair requires four replicas instead of the RMT's three, but has been used commercially.

Disadvantages

The advantages of fault-oriented designs are obvious, while many of their drawbacks are not:

• - Interference with fault detection in the same component. To continue with the previous passenger vehicle example, it may not be obvious to the driver to realize when a tire has gone flat, with any of the fault-tolerant systems. This is usually handled with an "automatic fault detection system". In the case of the tire, an air pressure monitor detects the loss of pressure and notifies the driver. The alternative is "manual fault detection system", such as manually inspecting all tires at each stop.

• - Interference with fault detection in another component. Another variant of this problem is when fault tolerance in one component prevents fault detection in a different component. For example, if component B performs some operations based on the production of component A, fault tolerance in B may hide a problem in A. If component B is later changed (to a less fault-tolerant design), the system may suddenly fail, giving the impression that the new component B is the problem. Only after the system has been carefully studied will it become clear that the problem is actually with component A.

• - Reduced priority of error correction. Even if the operator realizes the fault, having a fault-tolerant system is likely to reduce the importance of fixing the fault. If faults are not corrected, this will lead to future system failures, when fault tolerance or the component fails completely when all redundant components have failed.

• - Difficulty of testing. For some critics of fault-tolerant systems, such as a nuclear reactor, there is no easy way to verify that backup components are functional. The most famous example of this is the Chernobyl disaster, where operators tested emergency backup by disabling primary cooling and secondary cooling. The backup failed, resulting in a nuclear meltdown of the reactor and massive release of radiation.

• - Cost. Both fault-tolerant components and redundant components tend to increase. This may be a simple economic cost or may include other measures, such as weight. Manned spacecraft, for example, have so many redundant and fault-tolerant components that their weight increases dramatically in unmanned systems, which do not require the same level of safety.

• - Substandard components. A fault-tolerant design may allow the use of substandard components, which could make the system inoperable. While this practice has the potential to mitigate increased costs, the use of multiple substandard components can reduce system reliability to a level equal to or even worse than a non-fault tolerant system.

Replication

Fault tolerance is fundamentally dealt with in the following three ways:

• - Replication: provide multiple identical instances on the same system or subsystem, addressing tasks or requests from all of them in parallel, and choosing the correct outcome based on a quorum;

• - Redundancy: provide multiple identical cases on the same system and the ability to switch to one of the remaining cases in case of failure;

• - Diversity: provide multiple different implementations of the same specification, and use them as duplicate systems to address bugs in a particular application.

All RAID implementations, except RAID 0, are examples of a fault-tolerant data storage device that uses data redundancy.

Disadvantages

The advantages of fault-oriented designs are obvious, while many of their drawbacks are not:

Navegación

Tolerant Architecture

Introduction

Tolerant Architecture

Introduction

Components

Redundancy

Criteria

Requirements

Replication

Disadvantages

Examples

References

Components

Redundancy

Criteria

Requirements

Replication

Disadvantages

Examples

References

Navegación

Tolerant Architecture

Introduction

Tolerant Architecture

Introduction

Components

Redundancy

Criteria

Requirements

Replication

Disadvantages

Examples

Related terms

References

Components

Redundancy

Criteria

Requirements

Replication

Disadvantages

Examples

Related terms

References