Last update   PvD

Risk Management

Risk management is about reducing or eliminating the negative impact of (potential) incidents.  It is not passively waiting for events to respond to, but proactive:  preparing and acting in advance.  The main axiom for risk management is be prepared !

In this document the overall Risk Management is split into 4 areas (each should result in a document):

  1. Risk Analysis: describing what the estimated risks and consequences are;
  2. Counter Strategy: describing how you plan to deal with risks;  and
  3. Incident Management Plan: describing how incidents will be handled;  and
  4. Disaster Recovery Plan: how to recover from (major) incidents.

Often a Risk Management Plan is used for Risk Analysis, Counter Strategy and Incident Management, but these are very dissimilar aspects of the same topic with a completely different focus:  Risk Analysis is a calm, academic-like exercise and Incident Management is a highly strenuous activity.
Sometimes a Contingency Plan or a Disaster Recovery Plan is used for both Incident Management and Disaster Recovery, but these two are also dissimilar activities:  fighting an incident, versus how to get back up and running again.  The time-scale and urgency of these activities are very different.  Here we treat the direct action of the Incident Management ('so much going on, no time to think') and the composed Disaster Recovery Plan ('everything takes so much time') as separate topics.


Risk Analysis

Identify and Analyze Risks

Analysis can best be done using the main aspects of any risk:

First of all one needs to identify and analyze the risks and their consequences.

Identify
To identify risks start by trying to make a context diagram of the subject, or at least a list of components and/or services to be Managed for Risks, and what not.  Get confirmation on the coverage (be aware of what is left out).
Failure Analysis
Carefully dissect the essential service(s) into the underlying equipment and subservices (e.g. infrastructure facilities, dependencies, human actions).  You can repeat this process for each of the components up to simple (physical and human) parts, and build a tree of failure causes:  the failure analysis tree.
This entails a lot of work and is not easy;  you need to do this only for vital (and potentially dangerous) systems.  If possible –again it won't be easy– you can attach probabilities to each branch;  even if the figures are rough estimates, it helps you to indicate priorities.
More likely you will use a simpler scheme and attach a likelyhood similar to very unlikely, unlikely, possible, likely and very likely.  Even with such a simple scheme it will not be easy to use more or less objective criteria.
Consequence Analysis
For each of the main components under consideration, specify the consequences for the organisation when they fail.  Be sure to include the –less obvious– consequential failures (i.e. secondairy failures caused by primary failures).  Use actual cases of (minor) failures –like a power failure– to obtain information on failure impact.
Here you will probably use a scheme to grade the consequences like limited, considerable, serious, very serious and catastrophical.

The aviation safety approach provides a very good example:  aircraft accidents and near-accidents are meticulously investigated, not to blame somebody but to uncover the cause and to recommend changes to avoid repetition in the future.  And these include human error !

James Reason hypothesizes that most accidents can be traced to one or more of four levels of failure:

Actual risks are fully dependent on your specific business and your specific circumstances, but at a general –more abstract– level you may consider:

Risks:


Consequences:

Apart from the above mentioned direct consequences there are the indirect consequences ('consequential damages') which usually come at considerable higher costs.  Typical are loss of production, loss of good name (e.g. image as a reliable supplier), and loss of market share.
Regarding production (including all working processes in an organisation), there is a similarity between risk management and quality assurance, in particular in the case for the reduction counter strategy mentioned below.  Good management is to exclude random mishaps.

Some risks/threats may ultimately be unsurvivable;  however, being prepared may buy you sufficient time to survive until the threat subsides (possibly causing victims elsewhere) or outside help comes to the rescue.

For example:  nowadays, all business processes run on computers.  A serious IT malfunction may have a tremendous impact on your business.  History shows that companies not prepared for a prolonged IT failure do not survive when such a failure occurs.

'Black Swans':  Black swans are very rare (in fact people long believed swans were always white).  But they do exist, and when you see them they are usually in a flock (i.e. with a lot).
The likelihood of an incident is usually inversely proportional to its severity (e.g. a hurricane is less likely than a severe storm, which is less likely than a light storm).  However, that is just on average !  Severe incidents may occur shortly after another, seemingly contradicting the average likelihood (e.g. two 100-year storms in subsequent years).  Such series are called Black Swans;  they are rare but they do exist.
In statistical analysis, the 'black swan' effect is also called the 'fat tail':  the likelihood of an incident is inversely proportional to its severity (i.e. asymptotical to zero), but apparently there is a bump.

'Grey Rhino's':  The underestimated consequences of an incident.  Seriously probable, and with massive impact.  A rhino is in general not aggressive, but when it attacks you it will trample you to death.


Risk Analysis document

The Risk Analysis document should describe:

The Risk Analysis document should be reviewed every year, or after major changes (in coverage or circumstances).  Check all assumptions.


Counter Strategy

When the likelihood of incidents and their consequences have been identified, it is time to plan how to deal with each risk:  risk mitigation.  It involves the planning for all the preventive measures to reduce the likelyhood of an incident, and/or reduce the consequences of an incident as identified in the Risk Analysis document.  Preferably you want to prevent a risk from occurring;  that would be the ultimate strategy but is not always achievable or less effective (money-wise).

As all measures take time and effort (money), you must be effective.  The following table –though very simplistic– demonstrates such an approach.
RiskConsequences
HighLow
LikelihoodHighPrevent1Counter2
LowContain3Neglect4
Notes:

  1. As Likelihood and Consequences are both high, such incidents should really be prevented from happening.  Here you will have to spend most effort.
  2. Likelihood is high but Consequences are low, so one should prepare with the emphasis on detecting the incident (and detecting it early) and activate counter measures.  If you detect the incident late it is not good, but consequences still remain limited.
  3. Likelihood is low but Consequences are high, one should prepare with the emphasis on reducing the consequences (contain consequences & recover using a Incident Management Plan).
  4. As Likelihood and Consequences are low, this category can be ignored.  However, unlikely incidents may have greater impact than expected ('grey rhino').  One should investigate whether likelihood or consequences can be decreased without much costs.  Preferably one may extend the measures required for one of the other cases to cover these case (i.e. more as an afterthought).

The counter measures broadly fall into the following categories:

The above strategies are not mutually exclusive.  In many cases you will use both Contain and Counter (e.g. you can use fire walls, but also fire fighters & sprinklers).  Also all counter measures have their own risk of failure, typically in the area of maintenance:  people tend to get less vigilant when nothing has happened for a long time, and (detection) equipment that has long been unused is likely to fail.

Swiss Cheese Model:  measures are intended to filter out risks, but the filters are not perfect (i.e. they leave some holes).  By having measures at various levels/steps, filtering is enhanced.  However, some cases/under some conditions an occurrence may get through (i.e. the holes in the Swiss cheese line up).  In particular if the risk is of human origin:  you then have an intelligent opponent (extremely so in the security domain).

Insurance is a specific measure to limit the financial consequences of a risk.  However, some remarks are applicable:

As all measures cost money (but so do the consequences), execution of the counter measures requires a business decision.  Initially Counter Measures will be a proposal where parts may or may not be implemented (but the consequences are flagged).
Here also the Counter Strategy document should be reviewed regularly, and at least when the Risk Analysis changes or after major changes in circumstances.


Incident Management

Incident Management is only concerned with 'what to do when disaster strikes'.
Incident Management is not considering the likelihood of risks, but assumes that an incident has occurred (it may even consider incidents deemed improbable or not addressed by the Risk Analysis).  It is assumed that the (preventive) measures by design have been taken (i.e. not part of the Incident Management plan). It involves an early signal (of an anomaly), detect (=identify) and isolate (=contain). 

A similar term to Incident Management is Problem Management, but that term is much broader and also used for non-vital operational incidents (e.g. SLA not met, product not up to specs).  Both terms have in common that they try to eliminate recurring incidents and minimise the effects of incidents that cannot be eliminated.

Considering that after a major incident there will be a lot of confusion (lacking information or worse:  rumours, misinformation) and a lot of people 'helping' and/or fleeing, the general picture will be of disarray at best, and panic at worse.  To combat that, an Incident Management plan must be like a war plan:  providing structure, effective focus and immediate action.  This is the reason why Incident Management is separated from the Risk Analysis document and the Recovery Plan.
Of course, the likelier a risk the more a predefined counter scenario ('combat plan') should be available.  But an Incident Management Plan should be more abstract than the discrete counter measures determined during the risk analysis:  incidents never occur exactly as foreseen and there will be unexpected circumstances, so the counter actions must have a broader scope.  This also implies that Incident Team members must be trained to act wisely when the incident is not unrolling 'according to the book'.

Most important for effective Incident Management is:

Obviously, communication is of vital importance to execute above points.


 

Incident Management Plan

The Incident Management Plan must:

The general strategy for action (i.e. apart from reducing the likelihood for an incident) is:

Often combating the cause and containing the consequences are distinct activities requiring different skills, so it will be possible to work with two separate groups without a lot of coordination.


Training

It is important that during an incident no time is wasted on discussions like 'how/why can this happen' or disagreement on the best counter action (that is for later evaluation):  do what you are told and fight the incident.  A short discussion initiated by the responsible person on potential cause and best counter actions is good, but than the responsible person must make a decision, and the rest must carry that out (e.g. don't discuss whether a fire is caused by a fallen candle or an electrical short — fight the fire !  On the other hand, to extinguish the fire it is important to know whether electricity is involved, or e.g. some flammable fluid).  Similarly, no discussions now on who is responsible for the cause, but support the person who is now responsible for counter measures.
All people involved must be aware of the above, and must have the Incident Management Plan at hand.  It is a good idea to also have the Risk Analysis document and its appendices close at hand as it may provide useful background information (e.g. construction and wiring diagrams, incident consequence schematics).

To achieve swift and effective counter action, these actions must be trained.  Although the training cases will differ from actual incidents, it is the only way to learn how to be effective in a very short time (which is vital).
Start slowly, with 'modest' incidents, and gradually make it more complex.
Regularly –at least once a year– repeat the training, and vary the nature of the incident.

And always evaluate afterwards (what should be improved;  what if something had turned out slightly different).  Avoid any internal 'politics' as that will backfire.

Though above focuses on physical disasters (as they are common for most businesses), it is similarly applicable for other incidents like product recalls:  you need at least the client/supplier team and the media team.


Remain Vigilant and Pro-active

All plans for any duration require maintenance.  That sounds trivial, but proves to be cumbersome.

As said before, some incidents may ultimately be unsurvivable.  However, being prepared may buy you sufficient time to survive until the threat subsides or outside help comes to the rescue.

Disasters are rarely caused by a single major incident;
usually it is coinciding cases of minor bad luck.

Murphy's Law

When disaster strikes losers have an excuse, but winners have a plan.


Selective Analysis

A nice anecdote about selective analysis, which might have resulted in the opposite of the desired effect.
During World War II the British Air Force had a team analysing the damage to the bomber planes when these returned to base.  They assumed that the German anti-aircraft gunners targeted the weak spots of the bomber, so the Air Force team ordered strengthening of these often hit spots.  Sounds totally sensible, doesn't it ?
In reality, the German anti-aircraft gunners were not targeting any particular spots on the bomber;  they were happy if they hit the bomber at all.  And the bombers hit at any vital spot didn't returned to base but went down (and weren't analysed).  So what the Air Force team did, was strengthening the bombers at non-vital places (making it heavier, less manoeuvrable, allowing less payload).


Recovery Plan

The Recovery Plan describes the actions to get to normal operations after a major incident as quickly as possible.  So it is not concerned with fighting the incident but the reduction of consequential (business) damages.  After the fire has been put out and the flooded areas have been drained, how to get back to normal business life.

Not knowing your particular business, it is impossible to provide an appropriate description what to do.  But the Risk Analysis document should give sufficient leads.  Below are just some examples.

Office
Offices are probably the easiest problem to solve;  usually you can rent office space nearby at a reasonable price.  Office equipment may take some more time but shouldn't be difficult either.  The main problem is that nobody knows where everybody is located now.  So you need some directories, maps, diagrams and directional signs.
And you may have lost a lot of information (e.g. about 'work on hand').
Telecommunications
For telephony & fax the phone company should be able to reroute the traffic to a new address (while keeping the old number).  The PABX and local loops will present more of a problem.
Data communications is slightly more difficult, and is actually part of the ICT topic.
IT
Assuming you have saved a recent backup of all computer data (at a remote site), you could restart the computer centre except you have no suitable room and no computers at all.  There are fall-back centres, but you should have a contract with them before the incident occurred, and you should have had trial runs there.  It requires compatible hardware & software, and a lot of parameter settings (e.g. think of the rules in the internet firewall).  This is certainly an area where you must do an exercise:  restore a backup on foreign computers and test your major applications, including telecommunications.
Manufacturing
This will probably take the longest to get operational.  As replacement for a building, tents may be used (they are big and can be set up surprisingly quick).  But the machinery will be a real problem.  And you won't have any stock.

In general you may have (mutual) agreements with neighbouring businesses to use or share part of their facilities.  Also most customers and suppliers will try to help you.  And there are specialised companies to clean-up the debris and recover usable components.

Above already suggests that you probably don't make a single Recovery Plan but one (scenario) for each major domain.  Similarly to the other documents, they must be reviewed and adapted at east once a year (to verify that conditions haven't changed and assumptions still hold).


=O=