Fault Tolerance
Goals:
Learn about:
- What reliability and availability are?
- How device faults can result in failures?
- What can we do to make a computer work fine even when parts of it have failed?
Dependability: It's the quality of delivered service, meaning a characteristic of delivered service that justifies relying on the system to provide that service.
Specified Service: what is the behavior of the system should look like
Delivered Service: Actual behavior. The behavior that we actually got out of the system.
System has components (modules): Each module has an ideal specified behavior that should be ideally getting. So each module has some sort of ideal behavior that we will want to expect for it, but of course, real modules maybe not always exhibit this ideal behavior.
So when we talk about things that make the system not be dependable, we're really taking about modules deviating from specified behavior and that causing the system to deviate, so that the delivered service no longer matches the specified service
Faults, Errors and Failures
When we say that something deviates from specified behavior, we're really talking about three different things: 1) faults, 2) errors, 3) failure.
Fault: A fault is when something in the system deviates from specified behavior.
Error: An error is when the actual behavior somewhere within the system differs from the specified behavior, the behavior that should be happening within the system.
Failure: The failure occurs when the system deviates from specified behavior for that system.
Example:
Things always start with a fault. And our fault example is a programming mistake. Let's say that programming mistake is an Add function that we wrote for our program. It works just fine in all cases, except when we give it to add 5 and 3 in which case returns 7 instead of 8. This type of fault we also call a latent error. It's not really an error until we do something like this 5+3 = 7(should be 8), but this type called latent error because it's only a matter of time when its going to be activated. So when we actually execute 5+3 and get 7 in some register, we now have an error.
If the error is a result of some sort of latent error like this, basically a programming error, we say that the fault has been activated or that we now have an effective error as opposed to a latent one. In our case we get an effective error once we call the add function with 5 and 3 and get 7 instead of 8, and then we put that value in some variable. We get the failure when the system deviates from specified behavior. For example, in this case, it might be that the time we were computing is the time to schedule a meeting for, and now we schedule a meeting for 7am instead of 8am, as expected. And this is basically the failure of the system because it didn't effectively schedule the meeting for when it was supposed to. It is important to note here that you need a fault of some sort in order to get an error, but not every fault becomes an error.
For the fault of this type, for example, to become an error, it needs to be activated. We need to actually use a function in a way that makes it produce incorrect results, even though it always had a fault. It always was faulty in this. Similarly, we can have an error and never get a failure. For example, if the value 7 was never used, and programs often to do this, then we have an error, a variable has the wrong value, but we don't get a failure as a result.
Another example would be something like this:
if(ADD(5,3) > 0)
if we check for example whether this function returns something greater than zero, then when it returns 7 instead of 8 and we store that in a register, compare it to zero, it still sees is larger than zero, now we got an effective error because a register held the wrong value but the only thing we did with this value still caused the program to function normally. So, in this case, we have an error, but we don't have a failure.
Quiz: Laptop Falls Down
Laptop (1) Falls out of my bag and
(2) hits pavement. Pavement ==> Fault
(3) develops a crack, then, ==> First Error
(4) crack expands during winter, so pavement,
(5) break, and ==> Failure event
(6) needs to be replaced
Remember:
Fault: a module deviates from specified behavior.
Error: actual behavior within the system deviatesfrom specified behavior
Failure: the system
deviates from the specified behavior.
.
Reliability:
There are other several other properties in fault tolerance in addition to dependability. One of them is reliability and unlike dependability, which is the property of the system that can we trust it to perform its function, reliability is something that we can measure.
To measure reliability we consider
Comments
Post a Comment