Chapter # 11
Fault Tolerance
Fault tolerance is the system’s ability to continue functioning without interruption, even when one or more of its components fail. It ensures that service remains available, recovers from errors completely and transparently, and handles malfunctions in a way that users may not even notice.
However, fault tolerance comes with trade-offs:
-
It may slow down the system,
-
Require more storage and hardware, and
-
Increase operational costs.
Thus, fault tolerance is always a balance between cost and the desired level of reliability.
Failure vs Error
These terms are related but not the same:
-
Error: A defect or flaw in the system's logic or state that might lead to a failure. Errors can exist without causing failure if they are caught or corrected in time.
-
Failure: Happens when the system deviates from its expected behavior, such as becoming unreachable or giving wrong results. A failure is the visible result of one or more errors.
In summary:
Error → may lead to → Failure
Phases of Fault Tolerance
Fault-tolerant systems usually go through the following phases:
-
Error Detection
The system must first identify that something is wrong. -
Damage limitation
Prevent the error from spreading to other parts of the system. -
Error Recovery
Fix the error or restore the system to a stable state to avoid failure.
Processor Faults
Processor faults occur when a processor behaves in an unexpected or incorrect manner. These faults can be categorized into three types:
-
Fail-Stop
-
The processor completely stops functioning and never responds again.
-
Other processors can detect that it has failed.
-
-
Slowdown
-
The processor continues to run, but in a degraded (slower) state.
-
It might eventually fail completely.
-
-
Byzantine Fault
-
The processor may:
-
Fail intermittently,
-
Run slowly or incorrectly, or
-
Appear to function normally while giving incorrect outputs on purpose.
-
-
This is the most unpredictable and difficult type to detect.
-
Network Faults
Network faults occur when processors are unable to communicate properly with one another. Common types include:
-
One-Way Link
-
A communication link where one processor can send messages, but the other cannot receive them.
-
-
Network Partition
-
A situation where a portion of the network becomes isolated, meaning that some processors can no longer communicate with others in a different segment.
-
Attributes of a Fault-Tolerant System
A fault-tolerant system is designed to operate reliably even when parts of it fail. Key attributes include:
-
Availability
-
The system is ready and able to perform its functions at a given moment.
-
A highly available system is accessible and operational almost all the time.
-
-
Reliability
-
The ability of the system to operate continuously without failure.
-
Measured over a period of time, not just at a single moment.
-
-
Safety
-
Even if the system fails, it does so in a way that doesn’t cause major harm or dangerous outcomes.
-
It also prevents the fault from affecting other parts of the system.
-
-
Maintainability
-
Faults or failures in the system can be easily detected, diagnosed, and fixed.
Types of Failure in Distributed Systems
| Type of Failure | Description |
|---|---|
| Crash Failure | The server completely halts, but all operations before the crash were performed correctly. |
| Omission Failure | The server fails to respond to requests. It can be broken down into two subtypes: |
| → Receive Omission | The server fails to receive incoming messages. |
| → Send Omission | The server fails to send outgoing messages. |
| Timing Failure | The server responds, but outside the expected time interval (either too late or too early). |
| Response Failure | The server responds, but the response is incorrect. This includes: |
| → Value Failure | The actual value returned is wrong. |
| → State Transition Failure | The server does not follow the correct control flow or state sequence. |
| Arbitrary Failure (Byzantine) | The server may give completely unpredictable or incorrect outputs, at any time. This is the most severe and difficult to handle failure type. |
Fault Tolerance Mechanisms in Distributed Systems
Distributed systems use various techniques to tolerate and recover from faults:
1. Replication-Based Fault Tolerance Technique
-
Data or services are replicated across multiple machines or servers.
-
Ensures that if one server fails, another can take over without stopping the entire system.
2. Process-Level Redundancy (PLR) Techniques
PLR means running extra copies of a process to catch and fix problems.
Main Points:
-
Works best for temporary (transient) faults that fix themselves.
-
Uses software tools to handle errors.
-
The system takes regular checkpoints (saves its state), so it can go back if something goes wrong.
-
PLR compares results from different process copies to make sure everything is running correctly.
3. Fusion-Based Redundancy Technique
-
This method is a cheaper and smarter alternative to full backups.
Main Points:
-
Uses fewer backup machines by combining them to cover many systems.
-
Costs less than making full copies of everything.
-
Good for systems where failures don't happen often.
-
Recovery takes more time and effort, but it’s worth it when problems are rare.
-
Comments
Post a Comment