Chapter # 11

May 25, 2025

Fault Tolerance

Fault tolerance is the system’s ability to continue functioning without interruption, even when one or more of its components fail. It ensures that service remains available, recovers from errors completely and transparently, and handles malfunctions in a way that users may not even notice.
However, fault tolerance comes with trade-offs:

It may slow down the system,
Require more storage and hardware, and
Increase operational costs.

Thus, fault tolerance is always a balance between cost and the desired level of reliability.

Failure vs Error

These terms are related but not the same:

Error: A defect or flaw in the system's logic or state that might lead to a failure. Errors can exist without causing failure if they are caught or corrected in time.
Failure: Happens when the system deviates from its expected behavior, such as becoming unreachable or giving wrong results. A failure is the visible result of one or more errors.

In summary:
Error → may lead to → Failure

Phases of Fault Tolerance

Fault-tolerant systems usually go through the following phases:

Error Detection
The system must first identify that something is wrong.
Damage limitation
Prevent the error from spreading to other parts of the system.
Error Recovery
Fix the error or restore the system to a stable state to avoid failure.

Processor Faults

Processor faults occur when a processor behaves in an unexpected or incorrect manner. These faults can be categorized into three types:

Fail-Stop
- The processor completely stops functioning and never responds again.
- Other processors can detect that it has failed.
Slowdown
- The processor continues to run, but in a degraded (slower) state.
- It might eventually fail completely.
Byzantine Fault
- The processor may:
  - Fail intermittently,
  - Run slowly or incorrectly, or
  - Appear to function normally while giving incorrect outputs on purpose.
- This is the most unpredictable and difficult type to detect.

Network Faults

Network faults occur when processors are unable to communicate properly with one another. Common types include:

One-Way Link
- A communication link where one processor can send messages, but the other cannot receive them.
Network Partition
- A situation where a portion of the network becomes isolated, meaning that some processors can no longer communicate with others in a different segment.

Attributes of a Fault-Tolerant System

A fault-tolerant system is designed to operate reliably even when parts of it fail. Key attributes include:

Availability
- The system is ready and able to perform its functions at a given moment.
- A highly available system is accessible and operational almost all the time.
Reliability
- The ability of the system to operate continuously without failure.
- Measured over a period of time, not just at a single moment.
Safety
- Even if the system fails, it does so in a way that doesn’t cause major harm or dangerous outcomes.
- It also prevents the fault from affecting other parts of the system.
Maintainability

Faults or failures in the system can be easily detected, diagnosed, and fixed.

Types of Failure in Distributed Systems

Type of Failure	Description
Crash Failure	The server completely halts, but all operations before the crash were performed correctly.
Omission Failure	The server fails to respond to requests. It can be broken down into two subtypes:
→ Receive Omission	The server fails to receive incoming messages.
→ Send Omission	The server fails to send outgoing messages.
Timing Failure	The server responds, but outside the expected time interval (either too late or too early).
Response Failure	The server responds, but the response is incorrect. This includes:
→ Value Failure	The actual value returned is wrong.
→ State Transition Failure	The server does not follow the correct control flow or state sequence.
Arbitrary Failure (Byzantine)	The server may give completely unpredictable or incorrect outputs, at any time. This is the most severe and difficult to handle failure type.

Fault Tolerance Mechanisms in Distributed Systems

Distributed systems use various techniques to tolerate and recover from faults:

1. Replication-Based Fault Tolerance Technique

Data or services are replicated across multiple machines or servers.
Ensures that if one server fails, another can take over without stopping the entire system.

2. Process-Level Redundancy (PLR) Techniques

PLR means running extra copies of a process to catch and fix problems.

Main Points:

Works best for temporary (transient) faults that fix themselves.
Uses software tools to handle errors.
The system takes regular checkpoints (saves its state), so it can go back if something goes wrong.
PLR compares results from different process copies to make sure everything is running correctly.

3. Fusion-Based Redundancy Technique

This method is a cheaper and smarter alternative to full backups.
Main Points:
- Uses fewer backup machines by combining them to cover many systems.
- Costs less than making full copies of everything.
- Good for systems where failures don't happen often.
- Recovery takes more time and effort, but it’s worth it when problems are rare.

Search This Blog

Abdul_ahad

Chapter # 11

Fault Tolerance

Failure vs Error

Phases of Fault Tolerance

Processor Faults

Network Faults

Attributes of a Fault-Tolerant System

Types of Failure in Distributed Systems

Fault Tolerance Mechanisms in Distributed Systems

1. Replication-Based Fault Tolerance Technique

2. Process-Level Redundancy (PLR) Techniques

3. Fusion-Based Redundancy Technique

Comments

Post a Comment

Popular posts from this blog

Video 1 The Power of Natural Stones: Why They Matter

14 "Exploring Seoul at Night – Street Food, Markets & Neon Lights!"

topic 07) AI Side Hustles That Actually Work