As I am going to work on applications of wireless sensor networks, fault tolerance will be a major topic for me. In fact I learned a lot from listening to the podcast and I am really looking forward to read the book, which I can do soon thanks to the computer science library of the ETH Zurich.
Besides giving clear definitions of the terms "fault", "error", "failure", "fault tolerance" and others the book deals with patterns on how fault-tolerant systems should be designed, how faults and errors can be detected, how the system can recover from errors, how errors can be mitigated and how faults should be treated. I really recommend it to everybody who concerns about making systems robust.
Update: As searching in and reviewing audio files is so cumbersome I decided to listen to the podcast once more and write down the gist of it. So here it comes, heavily quoted from the podcast.
part 1 - 4:00 term definitions
- fault: basic, actual defect
- failure: observed deviation from the requirements or the specification
- error: incorrect state caused by the fault that can lead to the failure
- fault tolerance: systems that are designed to tolerate faults to continue operation. This is done by detecting, isolating and processing faults and errors to keep them from becoming failures.
- reliability: probability that system will performed desired functionality during a specified period of time
- availability: proportion of time that system is available to perform work
- reliability and availability are means of measuring how well fault tolerance works.
part 1 - 12:02 fault-tolerant mind setThis section is about what you should have in mind if you design fault tolerant systems. The major advice is to keep yourself asking: "what could go wrong?".
Fault tolerance means more expense to develop and more resource consumption at runtime. Although building in fault tolerance means adding additional code to the system, a major rule is to keep it simple.
Software quality is necessary in order to achieve fault tolerance, but it is not sufficient. This is because software quality is about how well software is designed and built but fault tolerance is about what to put into the system. To achieve good quality common engineering practices such as defensive programming, formal verification and apply here, too. The book considers this practices as basics on which it builds.
In contrast n-versioned programming is considered impractical for a lot of things because it is too expensive.
part 1 - 21:08 shared context for the fault tolerance patternsBy definition there are no failures without a specification. So how your system should look like depends on the reliability and availability requirements.
The patterns are meant for soft real-time systems when nothing catastrophic happens if a deadline is missed.
There are patterns for both statefull and stateless systems. (I don't know why this is particularly important)
There is an underlaying assumption of external observers, which means that there are people or other systems which are interested in how well a target system is behaving. Nevertheless fault tolerance is achieved by the system itsel, because the time from detection of a fault or error to resumption of normal operation should be as short as possible.
Fault tolerant systems tend to be long-lived systems because the additional effort for making a system fault tolerant has to pay out.
part 1 - 28:40 architectural patternsThese patterns cover architectural questions in general
- units of mitigation: Define the blocks and boundaries about where you will contain and quarantine the errors and failures to avoid error propagationd
- correcting audits: The system is continuously checking the data in order to automatically correct them it in case of an error. For example structure properties and known correlations can be checked.
- redundancy: Having some parts of the system where you want explicitly to have multiple copies doing the same thing, such as storing the same information or performing the same computation.
- recovery block: A way to provide software redundancy by means of n-versioned programming. Try execute a code block and check the results against some acceptability criteria. If it does not pass the test use a different block of code.
- minimize human intervention: Once you detect the problem fix it automatically without waiting for or relying on an operator.
- maximize human participation: Sometimes people do know best, so provide means for experts to make useful contributions to the recovery. But keep them out of the critical path.
- maintenance interface: Have a dedicated maintenance and error reporting channel separated from the normal input output stream.
- someone in charge: For everything you do in trying to keep the system fault tolerant you should know what other part of the system including the operating personal is going to be monitoring that and initiate escalation in case it fails.
- escalation: Build a hierarchy of resolution strategies. For every resolution predefine at design time what to do next if it fails, so that at execution time the recovery can be as fast as possible.
- fault observer: Central part of the system that receives the error reports and passes them of to external observers or other interested parts of the systems.
- software update: Think of mechanism to provide software updates, as fault-tolerant systems tend to be long-lived systems.
part 2 - 1:35 detection patterns
These patterns are about detecting faults and errors.
- fault correlation: Classifying detection reports in order to find the best techniques to resolve the problems.
- error containment barrier: Firewalls between the units of mitigation in order to isolate errors.
- complete parameter checking: Similar to programming by contract, check every return code and all return values.
- system monitor: Create a part of the system that will watch the important things. This is a high level pattern which is implemented for instance with a watchdog or with heartbeat.
- heartbeat: Frequently either sending or requesting "I am ok"-messages, e.g. specific messages built into the system.
- watchdog: Third party monitor watching existing message flow and other information in order to decide if the system is ok.
- realistic thresholds: How to pick appropriate threshold values to use with the heartbeats.
- metrics: Detecting resource acquisition problems based on metrics. Hereby use existing information and things that are easy to compute, as when you need the information about overloads and performance problems the most is when you have the least amount of resources to spend computing things.
- checksuming: Compute some value that will tell you that the value you computed from is correct.
- voting: Given redundant results, decide which one is correct.
- routine maintenance, exercises and audits: During idle time run maintenance tasks, such as garbage collection or correcting audits.
- riding over transience: Don't waste time dealing with problems that clear up automatically.
- leaky bucket counter: A classic technique for implementing a counter to determine if a problem is happening often enough that it should be treated rather then so seldom that you can ignore it.
part 2 - 12:23 error recovery patterns
These patterns provide means to resume normal operation after an error by changing the internal state of the system.
- quarantine: Isolate the error quickly so that it does not propagate.
- concentrated recovery: All the resources you put towards solving the problem you dedicate them exclusively to make recovery as fast as possible.
- error handler: Concentrate the error handling code into one place. This increases complexity slightly but therefore you have one place to implement the strategy.
- restart: Typically at the end of a sequence of escalation the only possibility left is to restart the system.
- roll back: When encountering an error, roll back and retry.
- roll forward: When encountering an error skip the task and go on with the next one.
- return to reference point: Getting the system synchronized when an error occurs during an audit. (I am not sure if I got that right)
- limit retries: The someone in charge puts some limit on how many times to retry and escalates if needed.
- fail over: If one component stops working fail over to another one of a redundant set.
- checkpoint: How storing information can help you recovering faster.
part 2 - 23:07 error mitigation patterns
These patterns deal with fixing the fault or error in place.
- overload toolboxes: If all you have is a hammer, everything looks like a nail. Therefore have different techniques and thus different tools for different kinds of problems.
- deferrable work: If the system is in overload it must be working pretty well already if it doesn't have other errors, so defer some of that work that would check if it's working ok.
- reassess overload decision: Have the system periodically check that it's applying the correct techniques because in the meanwhile there might be more information available.
- queue for resources: Queuing requests during spikes or smooth arrival rate to the point where the process is able to handle it more efficiently.
- shed load: If you get too much work you have to discard some requests. Do this efficiently because this happens when you have little resources.
- fresh work before stale: If things do not happen fast enough, humans tend to resubmit their requests. So handle most recent requests first and detect the older duplicates later.
- finish work in progress: Defer new requests instead of aborting older ones which already consumed resources. (I am not sure if I got that right)
- marked data: Mark incorrect data so that other parts of the system don't use it, like IEEE NaN for floating point values.
- error correction code: Give some redundant information so that you can reconstruct the correct value.
part 2 - 32:55 fault treatment
These patterns are about treating the root cause of the fault after automatic recovery or mitigation
- let sleeping dogs lie: Do the cross benefit analysis of cost, the risk and the benefit of fixing a bug. If it's complicated to fix and does not cause a failure you might better leave it unfixed.
- reintegration: Have some predefined process or method to make fixes, for example test cases. Experiences have shown that failures in systems are either related to the hardware, the software or the procedure. This pattern reduces failures of the latter kind.
- reproduceable error: Try to really nail the problem down by reproducing the error before fixing something in the hurry.
- small patches: smaller patches result in fewer hiding places for other faults, thus avoiding fix-on-fix problem.
- root cause analysis: Keep on asking "why did this happen" until you think you have found the real root cause of the fault and the best level to fix the problem.
- revise procedure: Feel free to fix procedures and methods also to make them more robust.
part 2 - 39:48 epilogue
- How to work with patterns.
- Why Robert wrote the book.