Basic Characteristics

Redundant systems are supposed to keep running even when a major failure occurs by having extra components in the system, that is components which would otherwise be superfluous (functionally redundant) except for Fault Tolerancy. There are several ways to achieve that, providing fault tolerancy in varying degrees. Major characteristics of redundant configurations are:

replacement strategy: what is the response when a unit/function fails; and
coupling: what is the relationship between units.

Replacement Strategies

Methods to replace a failing unit or function are:

None: the function or capacity of the failing unit has become unavailable and will remain so (until it is repaired in due time).
Restart/Reload: if the failure is not hardware related but likely to be a software problem, a restart (software is kept but reinitialized) or reload (a fresh copy of the software is loaded, the component is rebooted) may solve the problem. Some systems may have several restart capabilities to try to preserve ongoing transactions. In general, restart/reload is the first action to try.
Hard swap: a failing unit will be (manually) reassigned or replaced by a spare unit in relative short time (which is not always achievable). Reactivation will have significant delays.
Standby: an idle spare unit is available to take over from a failing active unit (potentially for several units).
Redistribution: redistribute the load to other active and fully functional units (reconfigure & retry).
Degrade: fall-back to a degraded function (a unit with lower performance and/or quality).

Note that

A basic assumption is that we are considering real-time systems (otherwise there would be ample time to manually replace or repair units except for systems which are out of reach such as satellites). So the response time to a failure is considered relevant here.
The Replacement Strategy has consequences for system management: when a unit has to be replaced because a part on that unit has failed, all other functions which are still operating on that unit will be temporary taken out-of-service as well.
Any (persistent) failure will increase system vulnerability until the failure is corrected (repaired).

Coupling

The main ways to couple functionally identical units –the processing relationship– are:

Uncoupled: the units work independent (considering their main function), coupling is non-existent or indirect (via another component, e.g. management function).
Failure of a single unit will lead to loss of service (data), but have no consequences for the other units. A requestor (or management function) may restart the request by re-issuing it to another unit. Response will be significantly delayed.
Such a configuration offers good performance (negligible overhead); the performance will decrease with each failure.
Loosely coupled: units work independent but inform each other (on the major steps) during the processing, they exchange limited data.
Failure of a single unit will only lead to a partial loss of service (data), and minor delay. Activation is commonly done by some management function (status unit); take-over delays are critical.
Performance is usually acceptable (significant overhead); it will decrease with each failure.
Tightly coupled: multiple units process the same data in parallel, nearly or fully synchronized. Output of the units can be applied (active), ignored (stand-by), compared, or used in majority vote.
Failure of a unit does not cause any loss of service (data) or delay, only increased vulnerability. Activation of take-over should be instantaneous and automatic.
Such a configuration offers the performance of a single unit only (not impacted by a failure, but never very efficient).
Note that with synchronous software processing, problem conditions will be identical; that is, units will crash perfectly synchronous on a bug. With intermittent problems the behavior is undefined.

You also have to take the complexity of (management) software into account: by redundancy the hardware may have become more reliable, but –due to the complexity– software may have become less reliable (it will take longer to find all bugs before software reliability gets better than hardware reliability). The (software) complexity increases with the coupling: development/acquisition costs will be higher for tightly coupled components than for loosely coupled components.

From the point of view of a computer, there are only 3 important numbers: 0 (no other system), 1 (a single other system, don't have to think which one), and n (there are multiple other systems). From the architectural point of view this means 1 unit, 2 units, or n units.

Common Configurations

Popular configurations with respect to redundancy are:

Fall-back (degrade): relatively simple and cheap. When it is not sensible (e.g. too costly) to fully replicate units. If software is the vulnerable spot, this is a very good solution (the degraded function may use other software).
Typically, if a subsystem doesn't get any response from its management system, it has (to fall-back) to manage itself for the time being.
Cold standby: the spare unit (uncoupled) is either idle or performing some low priority function. It has to be reconfigured/loaded to fulfill the failing function. The spare unit may act as spare for many others with distinct functions. Dimension (costs): one extra spare unit (n+1 units, or preferably ⌈n+10%⌉ units).
Warm standby: the spare unit is already loaded with the right software but is not kept updated on service progress (i.e. uncoupled or loosely coupled). Dimension (costs): one extra spare unit per functional type (n+1 units). The characteristics of this configuration are very similar to Cold standby, so this case not discussed any further.
Hot standby: each unit has a tightly coupled spare unit ready and waiting to take over (i.e. an active/standby-pair). Dimension as 2n units.
Majority voting: three tightly coupled units (commonly synchronised) where results are subjected to majority voting. Only applied for mission-critical parts (there is a basic problem if 1 unit fails and the 2 remaining units disagree). Dimension as 3n units.
Load sharing cluster: multiple uncoupled or loosely coupled units. Load is distributed over many units; when a unit fails the workload is ditributed over the remaining units. Such a configuration offers good performance (acceptable overhead); the performance will decrease with each failure. Management however is not trivial (can be a single point of failure); delays are critical. The nice thing about this configuration is its scalability (with potentially decreasing effectiveness due to overhead). Dimension as n+1, or (e.g.) ⌈n+10%⌉ units).
Buddies: pairs of loosely or tightly coupled units using mutual stand-by or load redistribution (can be considered a special mix of Stand-by & Load sharing). Specifically suited when a unit(-pair) controls other hardware. Scalability and Performance are also good. Dimension as n units (n even).

Note that one can apply a mix of above configurations. It is also possible to apply distinct replacement strategies at various levels. Typically, in a load-sharing strategy, one can have an additional (hot) spare (dimension 2n+1), and even a (cold) spare for various other units (dimension n+1). However, limit yourself to one or two redundancy mechanisms to restrict software complexity.
Example: the space shuttle contains 5 identical computers: 3 have identical software and are operated in a majority vote configuration; the fourth computer runs distinct software and checks whether the answers from the majority vote are within reasonable bounds (software fall-back). The fifth computer is a hardware spare (cold standby).

Also note that master/slave or front-end/back-end configurations are not redundant configurations (on the contrary).

Overview

Replacement	Coupling
Replacement	Uncoupled	Loosely	Tightly
None	Cold stand-by	–	–
Hard swap	Cold stand-by	–	–
Stand-by	Cold stand-by	Warm stand-by, Buddies	Hot stand-by, Buddies, Majority voting
Redistribution	Cold stand-by, Load sharing	Warm stand-by, Buddies, Load sharing	Hot stand-by, Buddies
Degrade	Fall-back	Fall-back	–

Performance

The cost/performance-ratio for the various configurations can only be indicated in a relative way as the number of units and the costs per unit are unknown. So it is more or less an indicator for efficiency: effective processing power of the configuration expressed in the power of a single unit, divided by (the costs for) the total number of units in the configuration. And the costs for other modules (e.g. management) is ignored, and so is software development (which may well be the major cost factor).

Configuration Efficiency Redundancy
Fall-back 1 / 1 minor
Cold standby N / (N+1) reasonable
Hot standby N / 2N good
Majority voting N / 3N extremely good
Load sharing N / (N+1) very good
Buddies 2N / 2N very good

Configuration	Efficiency	Redundancy
Fall-back	1 / 1	minor
Cold standby	N / (N+1)	reasonable
Hot standby	N / 2N	good
Majority voting	N / 3N	extremely good
Load sharing	N / (N+1)	very good
Buddies	2N / 2N	very good

The performance efficiency is not simply inversely proportional to the redundancy.

Load Distribution & Context

The appropriate configuration depends heavily on the characteristics of the service the system is supposed to deliver (and of course the required reliability). When a service request is context free (i.e. any unit may handle the request), load can be distributed straightforward (e.g. n+1 cluster configuation). Requests are preferably distributed in a cyclic way; when re-issuing a previously failed request, it will be automatically be assigned to some other unit. The failing unit has to be flagged for test/repair.

However, when there is a context (e.g. relevant data not available everywhere), the picture is more complex. In the extreme case that the context is divided into n subcontexts (e.g. transactions on a partitioned database), a request can only be handled by a single unit (1 out of n). When immediate backup must be available (hot standby), this leads to a 2n configuration.
When requests have a context, you may split the context into many small subcontexts and have multiple subcontexts per unit, and each subcontext in multiple units. This allows a request to be handled by m units (say 2 or 3) out of n. However, transaction assignment will be elaborate and the distribution of the subcontexts to ensure good dynamic load balance is not simple.
See Centralisation versus Distribution below.

System Management

A redundant system does requires some kind of management ('system defense', Fault Management) over all components. The general strategy to survive is:

Detect problems as soon as possible;
Confine the consequences;
Identify the problem component (diagnose);
Take component out of service;
Redistribute workload;
Repair faulty component (replace, reload);
Take component into service (after successful test), and redistribute load.

Pitfalls

The main issue for a redundant system is a so-called Single Point of Failure (SPoF): some component or service which is not redundant, so when that fails the whole system fails. Often it is a minor component or service which has been overlooked as being critical. In particular low level infrastructure (utilities: power, cooling, data bus, …) are prone to such mistakes. Be sure to investigate the redundancy of all services/facilities rquired for the system (or accept that risk if it is too expensive to solve).
In organisations a single specialist can be the vital vulnerability (he may quit, get ill, …).

If you are relying on physical separation between two parts of a redundant system to avoid common causes for failure, be sure that there is a considerable geographic distance between the two sites. Otherwise both sites share common risks like power fails, flooding, storms, riots, etc. And don't think that you won't get any flooding on a top floor; a leaky roof or bursted water pipe is sufficient. See also Risk Management.

Considerations

Redundancy doesn't come cheap; it requires extra hardware and much more complex software. So if you don't really need it, avoid it.

Therefore the first question considers the availability requirements for the system in the client's application: what are the consequences for failing ? Usually there are only strict availability requirements for some essential functions, so redundancy for a small number of vital components. Is redundancy useful in this system, or are there more vulnerable common parts outside the system (typically power, environmental) ? It is extremely difficult to avoid all single points of failure ! There is a lot of reliability to be gained by other methods than replication.

The next question is how to achieve sufficient availability in your design. What is the estimated availability of a non-redundant solution (system availability should be part of any design and carefully controlled; careless design modifications may have significant impact) ?
Perform a simple system availability calculation. Carefully assess component failure rates (estimates), failure interdependency and calculate the availability. Apply limits to (sub)system availability. What if the component with poorest availability figures is significantly improved.
Note that

availability figures for new equipment has to be calculated according to worst-case theoretical figures, and are much worse than later field experience will show.
in a proper redundant design, availability normally only has to consider the single (points of) failure.

Are redundant units sufficiently separated (physical/geographical diversity of power, cables, equipment) ? Duplicating a system at a single site won't provide protection for flooding or burning down both copies.

Software which has not been in use for at least 2 years has a worse error rate than hardware, so do not introduce hardware redundancy at the expense of software complexity if you are in a hurry. The trade-off is loss of business risk versus aquisition & operational costs (hardware, software development). Assuming that the system is well designed and built, robustness against a single point of failure will substantially improve reliability.

Why redundancy fails (likely order):

No management (no disaster recovery plan, no training);
Insufficient maintenance (major unit not yet replaced and standby unit now failing; all spares used);
Single point of failure in the system (bus, …) or infrastructure (power supply, …);
Reduced availability (instead of enhanced) due to complexity;
Insufficient spare capacity, lengthy take-over procedure;
Insufficient physical separation (the same earthquake/storm/flood/fire/riot/power-cut takes out all replicated units).

Centralisation versus Distribution

Redundancy is implicitly also a distribution issue (there is a striking parallel with Centralised versus Distributed organisations). This section compares distributed systems to a centralized one on some general characteristics/aspects:

Aspect Central Distributed
Acquisition costs¹ normal more expensive
Acquisition time normal longer due to complexity
Operational costs² normal (slightly) more expensive
Performance³ average usually less
Utilisation⁴ normal less
Availability⁵ average very good
Scalability limited much better
Notes:

Aspect	Central	Distributed
Acquisition costs¹	normal	more expensive
Acquisition time	normal	longer due to complexity
Operational costs²	normal	(slightly) more expensive
Performance³	average	usually less
Utilisation⁴	normal	less
Availability⁵	average	very good
Scalability	limited	much better

More hardware required, but individual components (development or buy-in) should be cheaper. Software –in particular system management– will be more complex and therefore likely more expensive and initially error-prone.
Operations & Maintenance is probably more expensive (more hardware & potentially geographical dispersed, however due to redundancy maintenance/replacement can be delayed).
Data management may be more complex (concurrent updates, consistency).
Price/performance ratio is in general less favourable for a distributed solution. Grosh's law states A performance increase factor n can be achieved at √n costs. Which is true for most systems in the performance range except for high-end and low-end systems (e.g. based on high-volume & cheap microprocessors).
Overhead in a distributed solution may seem more than in a centralised system, but that is theoretical; in practice it is often the reverse (in particular with organisations).
Utilization will be less in a distributed solution: idle resources (e.g. processing power) can not be redistributed.
In a centralised solution, specialisation (of equipment or functions) may pay off; in a distributed solution these would be underutilised (i.e. not cost-effective).
Overall system availability in a distributed system should be much better (in a proper design), but equiptment failure rate will increase (more components). Initial software reliability will be poor.

Note that above list is in general true, but your specific case may differ on some points.

Note that we talk about 'replication': multiple identical units. 'Duplication' is the common case, but replication with a high number of small units (i.e. the load sharing cluster) may be more effective.

Note that there are perfectly good reasons for (physically) distributing a system apart from redundancy; usually it is (total) costs. Example: a city does not have a single huge telephone exchange, but multiple medium-sized exchanges: it provides reduction of subscriber line length (which presents major costs). Scalability can be a reason as well.

A good alternative to a redundant system is often a 'robust' system: a system which is can survive adverse conditions but (essentially) not redundant. Use a reliable platform and spend effort to make more robust software (see Design for Survivability).

=O=