Last update PvD
Redundant Systems
Redundant systems are supposed to keep running even when a major failure occurs by having extra components in the system, that is components which would otherwise be superfluous (functionally redundant) except for Fault Tolerancy. There are several ways to achieve that, providing fault tolerancy in varying degrees. Major characteristics of redundant configurations are:
Methods to replace a failing unit or function are:
Note that
The main ways to couple functionally identical units –the processing relationship– are:
You also have to take the complexity of (management) software into account: by redundancy the hardware may have become more reliable, but –due to the complexity– software may have become less reliable (it will take longer to find all bugs before software reliability gets better than hardware reliability). The (software) complexity increases with the coupling: development/acquisition costs will be higher for tightly coupled components than for loosely coupled components.
From the point of view of a computer, there are only 3 important numbers: 0 (no other system), 1 (a single other system, don't have to think which one), and n (there are multiple other systems). From the architectural point of view this means 1 unit, 2 units, or n units.
Popular configurations with respect to redundancy are:
Note that one can apply a mix of above configurations. It is also possible to apply distinct replacement strategies at various levels. Typically, in a load-sharing strategy, one can have an additional (hot) spare (dimension 2n+1), and even a (cold) spare for various other units (dimension n+1). However, limit yourself to one or two redundancy mechanisms to restrict software complexity.
Example: the space shuttle contains 5 identical computers: 3 have identical software and are operated in a majority vote configuration; the fourth computer runs distinct software and checks whether the answers from the majority vote are within reasonable bounds (software fall-back). The fifth computer is a hardware spare (cold standby).
Also note that master/slave or front-end/back-end configurations are not redundant configurations (on the contrary).
Replacement | Coupling | ||
---|---|---|---|
Uncoupled | Loosely | Tightly | |
None | Cold stand-by | – | – |
Hard swap | Cold stand-by | – | – |
Stand-by | Cold stand-by | Warm stand-by, Buddies | Hot stand-by, Buddies, Majority voting |
Redistribution | Cold stand-by, Load sharing | Warm stand-by, Buddies, Load sharing | Hot stand-by, Buddies |
Degrade | Fall-back | Fall-back | – |
The cost/performance-ratio for the various configurations can only be indicated in a relative way as the number of units and the costs per unit are unknown. So it is more or less an indicator for efficiency: effective processing power of the configuration expressed in the power of a single unit, divided by (the costs for) the total number of units in the configuration. And the costs for other modules (e.g. management) is ignored, and so is software development (which may well be the major cost factor).
Configuration | Efficiency | Redundancy |
---|---|---|
Fall-back | 1 / 1 | minor |
Cold standby | N / (N+1) | reasonable |
Hot standby | N / 2N | good |
Majority voting | N / 3N | extremely good |
Load sharing | N / (N+1) | very good |
Buddies | 2N / 2N | very good |
The performance efficiency is not simply inversely proportional to the redundancy.
The appropriate configuration depends heavily on the characteristics of the service the system is supposed to deliver (and of course the required reliability). When a service request is context free (i.e. any unit may handle the request), load can be distributed straightforward (e.g. n+1 cluster configuation). Requests are preferably distributed in a cyclic way; when re-issuing a previously failed request, it will be automatically be assigned to some other unit. The failing unit has to be flagged for test/repair.
However, when there is a context (e.g. relevant data not available everywhere), the picture is more complex. In the extreme case that the context is divided into n subcontexts (e.g. transactions on a partitioned database), a request can only be handled by a single unit (1 out of n). When immediate backup must be available (hot standby), this leads to a 2n configuration.
When requests have a context, you may split the context into many small subcontexts and have multiple subcontexts per unit, and each subcontext in multiple units. This allows a request to be handled by m units (say 2 or 3) out of n. However, transaction assignment will be elaborate and the distribution of the subcontexts to ensure good dynamic load balance is not simple.
See Centralisation versus Distribution below.
A redundant system does requires some kind of management ('system defense', Fault Management) over all components. The general strategy to survive is:
The main issue for a redundant system is a so-called Single Point of Failure (SPoF): some component or service which is not redundant, so when that fails the whole system fails. Often it is a minor component or service which has been overlooked as being critical. In particular low level infrastructure (utilities: power, cooling, data bus, …) are prone to such mistakes. Be sure to investigate the redundancy of all services/facilities rquired for the system (or accept that risk if it is too expensive to solve).
In organisations a single specialist can be the vital vulnerability (he may quit, get ill, …).
If you are relying on physical separation between two parts of a redundant system to avoid common causes for failure, be sure that there is a considerable geographic distance between the two sites. Otherwise both sites share common risks like power fails, flooding, storms, riots, etc. And don't think that you won't get any flooding on a top floor; a leaky roof or bursted water pipe is sufficient. See also Risk Management.
Redundancy doesn't come cheap; it requires extra hardware and much more complex software. So if you don't really need it, avoid it.
Therefore the first question considers the availability requirements for the system in the client's application: what are the consequences for failing ? Usually there are only strict availability requirements for some essential functions, so redundancy for a small number of vital components. Is redundancy useful in this system, or are there more vulnerable common parts outside the system (typically power, environmental) ? It is extremely difficult to avoid all single points of failure ! There is a lot of reliability to be gained by other methods than replication.
The next question is how to achieve sufficient availability in your design.
What is the estimated availability of a non-redundant solution (system availability should be part of any design and carefully controlled; careless design modifications may have significant impact) ?
Perform a simple system availability calculation. Carefully assess component failure rates (estimates), failure interdependency and calculate the availability. Apply limits to (sub)system availability. What if the component with poorest availability figures is significantly improved.
Note that
Are redundant units sufficiently separated (physical/geographical diversity of power, cables, equipment) ? Duplicating a system at a single site won't provide protection for flooding or burning down both copies.
Software which has not been in use for at least 2 years has a worse error rate than hardware, so do not introduce hardware redundancy at the expense of software complexity if you are in a hurry. The trade-off is loss of business risk versus aquisition & operational costs (hardware, software development). Assuming that the system is well designed and built, robustness against a single point of failure will substantially improve reliability.
Why redundancy fails (likely order):
Redundancy is implicitly also a distribution issue (there is a striking parallel with Centralised versus Distributed organisations). This section compares distributed systems to a centralized one on some general characteristics/aspects:
Aspect | Central | Distributed |
---|---|---|
Acquisition costs1 | normal | more expensive |
Acquisition time | normal | longer due to complexity |
Operational costs2 | normal | (slightly) more expensive |
Performance3 | average | usually less |
Utilisation4 | normal | less |
Availability5 | average | very good |
Scalability | limited | much better |
Note that above list is in general true, but your specific case may differ on some points.
Note that we talk about 'replication': multiple identical units. 'Duplication' is the common case, but replication with a high number of small units (i.e. the load sharing cluster) may be more effective.
Note that there are perfectly good reasons for (physically) distributing a system apart from redundancy; usually it is (total) costs. Example: a city does not have a single huge telephone exchange, but multiple medium-sized exchanges: it provides reduction of subscriber line length (which presents major costs). Scalability can be a reason as well.
See also Central versus Distributed organisations/architectures.
A good alternative to a redundant system is often a 'robust' system: a system which is can survive adverse conditions but (essentially) not redundant. Use a reliable platform and spend effort to make more robust software (see Design for Survivability).
=O=