Last update
PvD
Risk Management
Risk management is about reducing or eliminating the negative impact of (potential) incidents. It is not passively waiting for events to respond to, but proactive: preparing and acting in advance. The main axiom for risk management is be prepared !
In this document the overall Risk Management is split into 4 areas (each should result in a document):
- Risk Analysis: describing what the estimated risks and consequences are;
- Counter Strategy: describing how you plan to deal with risks; and
- Incident Management Plan: describing how incidents will be handled; and
- Disaster Recovery Plan: how to recover from (major) incidents.
Often a Risk Management Plan is used for Risk Analysis, Counter Strategy and Incident Management, but these are very dissimilar aspects of the same topic with a completely different focus: Risk Analysis is a calm, academic-like exercise and Incident Management is a highly strenuous activity.
Sometimes a Contingency Plan or a Disaster Recovery Plan is used for both Incident Management and Disaster Recovery, but these two are also dissimilar activities: fighting an incident, versus how to get back up and running again. The time-scale and urgency of these activities are very different.
Here we treat the direct action of the Incident Management ('so much going on, no time to think') and the composed Disaster Recovery Plan ('everything takes so much time') as separate topics.
Risk Analysis
Identify and Analyze Risks
Analysis can best be done using the main aspects of any risk:
- likelihood of an incident: the probability of occurrence of a particular risk; and
- consequences of that particular incident: the effects of an occurring risk, in our context specifically the business impact.
First of all one needs to identify and analyze the risks and their consequences.
- Identify
- To identify risks start by trying to make a context diagram of the subject, or at least a list of components and/or services to be Managed for Risks, and what not. Get confirmation on the coverage (be aware of what is left out).
- Failure Analysis
- Carefully dissect the essential service(s) into the underlying equipment and subservices (e.g. infrastructure facilities, dependencies, human actions). You can repeat this process for each of the components up to simple (physical and human) parts, and build a tree of failure causes: the failure analysis tree.
This entails a lot of work and is not easy; you need to do this only for vital (and potentially dangerous) systems. If possible –again it won't be easy– you can attach probabilities to each branch; even if the figures are rough estimates, it helps you to indicate priorities.
More likely you will use a simpler scheme and attach a likelyhood similar to very unlikely, unlikely, possible, likely and very likely. Even with such a simple scheme it will not be easy to use more or less objective criteria.
- Consequence Analysis
- For each of the main components under consideration, specify the consequences for the organisation when they fail. Be sure to include the –less obvious– consequential failures (i.e. secondairy failures caused by primary failures). Use actual cases of (minor) failures –like a power failure– to obtain information on failure impact.
Here you will probably use a scheme to grade the consequences like limited, considerable, serious, very serious and catastrophical.
The aviation safety approach provides a very good example: aircraft accidents and near-accidents are meticulously investigated, not to blame somebody but to uncover the cause and to recommend changes to avoid repetition in the future. And these include human error !
James Reason hypothesizes that most accidents can be traced to one or more of four levels of failure:
- active failures
- the unsafe acts themselves;
e.g. human error.
- and latent failures (contributing factors in the system that may have lain dormant for a long time):
- organizational influences;
e.g. reduction in training, focus on costs and not on safety.
- unsafe supervision;
e.g. team-up inexperienced operators during adverse conditions, novices on the night shift.
- preconditions for unsafe acts;
fatigued crew, improper communications procedures (I though 'the bridge is open' meant...), strange exceptions (one switch in a row operates just in the opposite way), etc.
Actual risks are fully dependent on your specific business and your specific circumstances, but at a general –more abstract– level you may consider:
Risks:
- flooding, fire, earthquakes, contamination, explosion, etc.
Don't think that the above list not applicable; flooding may be caused by an overflowing river but also by a sprung water pipe, and an explosion may be caused by a gas leak. And instead of an earthquake, you may be confronted with a structural weakness in a building (maybe after it was hit by a heavy truck) causing it to collapse.
- prolonged failure of infrastructure: transport (blocking of access roads); absence of water, electricity, gas, telecommunications (voice & data), etc.
- concurrent break-down of vital equipment, computer systems (e.g. due to a disastrous update);
- lack of particular material, loss of a vital supplier;
- strikes, riots;
- serious product (quality) problems;
- spilling of (toxic/inflammable/environmental hazardous) gas, liquids or particles;
These do not necessarily originate in your own organisation, but for example by a neighbouring plant. But the consequences apply to you as well.
- improper behaviour by personnel, financial irregularities, fraud, corruption/bribery, nepotism;
Everywhere where money goes around (i.e. every business), is a risk.
- key personnel not available (holiday, illness/epidemic, death, accident/killed/suicide on the premises, …), leaving the company (to a competitor/customer/supplier), on strike, …;
- cyber crimes, hacks, theft of sensitive data (privacy, trade secrets, …), DOS attacks, ransomware;
- confidential information/documents missing/stolen/leaking to the public, or to competitors;
- loss or bankruptcy of major customers (i.e. bad debt), take-over of supplier/partner/customer by competitor;
- extreme low share prices, lack of credit, hostile take-over;
- complex (and risky) operations with ignorant management (e.g. products that management doesn't understand);
- charismatic leader surrounded by yes-men (tunnel vision).
Consequences:
- an area or (part of) a building inaccessible (factory, office wing, vital space like central hall, store/boiler/airco/computer room).
- no power for equipment, no lighting, no computers, no cooling/heating, etc.
- major delay or inability to deliver products (publically visible, high-profile);
- product recall;
- serious PR problems;
- corporate melt-down.
Apart from the above mentioned direct consequences there are the indirect consequences ('consequential damages') which usually come at considerable higher costs. Typical are loss of production, loss of good name (e.g. image as a reliable supplier), and loss of market share.
Regarding production (including all working processes in an organisation), there is a similarity between risk management and quality assurance, in particular in the case for the reduction counter strategy mentioned below. Good management is to exclude random mishaps.
Some risks/threats may ultimately be unsurvivable; however, being prepared may buy you sufficient time to survive until the threat subsides (possibly causing victims elsewhere) or outside help comes to the rescue.
For example: nowadays, all business processes run on computers. A serious IT malfunction may have a tremendous impact on your business. History shows that companies not prepared for a prolonged IT failure do not survive when such a failure occurs.
'Black Swans': Black swans are very rare (in fact people long believed swans were always white). But they do exist, and when you see them they are usually in a flock (i.e. with a lot).
The likelihood of an incident is usually inversely proportional to its severity (e.g. a hurricane is less likely than a severe storm, which is less likely than a light storm). However, that is just on average ! Severe incidents may occur shortly after another, seemingly contradicting the average likelihood (e.g. two 100-year storms in subsequent years). Such series are called Black Swans; they are rare but they do exist.
In statistical analysis, the 'black swan' effect is also called the 'fat tail': the likelihood of an incident is inversely proportional to its severity (i.e. asymptotical to zero), but apparently there is a bump.
'Grey Rhino's': The underestimated consequences of an incident.
Seriously probable, and with massive impact.
A rhino is in general not aggressive, but when it attacks you it will trample you to death.
Risk Analysis document
The Risk Analysis document should describe:
- its coverage: what is the extend of the analysis (and what not covered, context diagram);
- all identified risks, their estimated likelihood and their consequences in a quantitative manner (it is not an exact science, but so is any business undertaking: cover the business impact, so ultimately express the consequences in money). It is probably practical to use some kind of categorisation. Pay extra attention to infrastructure, business and PR.
The Risk Analysis document should be reviewed every year, or after major changes (in coverage or circumstances). Check all assumptions.
Counter Strategy
When the likelihood of incidents and their consequences have been identified, it is time to plan how to deal with each risk: risk mitigation.
It involves the planning for all the preventive measures to reduce the likelyhood of an incident, and/or reduce the consequences of an incident as identified in the Risk Analysis document.
Preferably you want to prevent a risk from occurring; that would be the ultimate strategy but is not always achievable or less effective (money-wise).
As all measures take time and effort (money), you must be effective. The following table –though very simplistic– demonstrates such an approach.
Risk | Consequences
|
---|
High | Low
|
Likelihood | High | Prevent1 | Counter2
|
---|
Low | Contain3 | Neglect4
|
Notes:
- As Likelihood and Consequences are both high, such incidents should really be prevented from happening. Here you will have to spend most effort.
- Likelihood is high but Consequences are low, so one should prepare with the emphasis on detecting the incident (and detecting it early) and activate counter measures. If you detect the incident late it is not good, but consequences still remain limited.
- Likelihood is low but Consequences are high, one should prepare with the emphasis on reducing the consequences (contain consequences & recover using a Incident Management Plan).
- As Likelihood and Consequences are low, this category can be ignored. However, unlikely incidents may have greater impact than expected ('grey rhino'). One should investigate whether likelihood or consequences can be decreased without much costs. Preferably one may extend the measures required for one of the other cases to cover these case (i.e. more as an afterthought).
The counter measures broadly fall into the following categories:
- The counter measures to reduce the likelyhood of a risk by design (i.e. try to prevent the incident from happening):
- install fences and gates at the outer boundary, both physical and logical (e.g. a network firewall),
use inflammable material to avoid fire, etc.
- adapt rules, protocols, procedures and human behaviour (make them aware what is at stake).
If you set rules for people to reduce risks (e.g. 'no smoking', 'keep fireproof/watertight doors closed'), you also need surveillance/policing to enforce these rules.
- The counter measures to reduce the consequences of an incident by design (i.e. given the incident, contain the consequences);
The operative word here is compartimentalize: define compartments and create inner barriers (physical and logical) between them.
E.g. use firewalls to restrict fire and watertight doors to restric flooding to a single compartment, and use geographic separation (use a remote site as back-up).
There is probably some overlap with reducing the likelyhood of a risk.
The problem is that you have to assure that no incident may impact multiple compartments.
That is actually much harder to achieve than it seems.
For example, it implies that if you use a separate back-up centre, you need a considerable geographical distance between them to avoid a common risk (e.g. power loss, flooding, storm lock-out, riots).
Have backups (at least 2 versions stored at 2 distinct locations), sufficient capacity, in particular for infrastructure (if you have an extra (power) source, make sure each of the sources can provide all that is required).
Efficiency is the enemy for robustness.
If you rely on Redundancy, please check over there for pitfalls.
A special form to contain (financial) consequences is Insurance.
- The counter measures to reduce consequences of an incident by action (i.e. combat serious consequences).
Accept the incident but organize counter actions: damage control.
To be effective, the incident must be detected early, and counter measures must be largely pre-planned (fire & smoke detectors and/or patrolling, extinguishing equipment, fire-fighting teams).
This is more flexible approach, but poses a considerable risk in itself: human performance. You have to make sure that detection means are maintained and counter measures are regularly trained, and you have to consider poor behaviour (lazy/lax attitude, 'alarm fatigue' due to false alarms, etc). This subject is further elaborated in the section Incident Management Plan.
This will become the Incident Management Plan (see next section).
The above strategies are not mutually exclusive.
In many cases you will use both Contain and Counter (e.g. you can use fire walls, but also fire fighters & sprinklers). Also all counter measures have their own risk of failure, typically in the area of maintenance: people tend to get less vigilant when nothing has happened for a long time, and (detection) equipment that has long been unused is likely to fail.
Swiss Cheese Model: measures are intended to filter out risks, but the filters are not perfect (i.e. they leave some holes). By having measures at various levels/steps, filtering is enhanced. However, some cases/under some conditions an occurrence may get through (i.e. the holes in the Swiss cheese line up). In particular if the risk is of human origin: you then have an intelligent opponent (extremely so in the security domain).
Insurance is a specific measure to limit the financial consequences of a risk. However, some remarks are applicable:
- commonly only the direct consequences of a risk can be insured, not the consequential damages (or at exorbitant rates).
- insurances are not cheap; if you can suffer the consequences of a risk it is often cheaper not to insure but accept the costs yourself.
- insurance companies may require measures (such as Reduce/Contain/Counter mentioned above) to limit potential claims.
As all measures cost money (but so do the consequences), execution of the counter measures requires a business decision. Initially Counter Measures will be a proposal where parts may or may not be implemented (but the consequences are flagged).
Here also the Counter Strategy document should be reviewed regularly, and at least when the Risk Analysis changes or after major changes in circumstances.
Incident Management
Incident Management is only concerned with 'what to do when disaster strikes'.
Incident Management is not considering the likelihood of risks, but assumes that an incident has occurred (it may even consider incidents deemed improbable or not addressed by the Risk Analysis).
It is assumed that the (preventive) measures by design have been taken (i.e. not part of the Incident Management plan).
It involves an early signal (of an anomaly), detect (=identify) and isolate (=contain).
A similar term to Incident Management is Problem Management, but that term is much broader and also used for non-vital operational incidents (e.g. SLA not met, product not up to specs).
Both terms have in common that they try to eliminate recurring incidents and minimise the effects of incidents that cannot be eliminated.
Considering that after a major incident there will be a lot of confusion (lacking information or worse: rumours, misinformation) and a lot of people 'helping' and/or fleeing, the general picture will be of disarray at best, and panic at worse. To combat that, an Incident Management plan must be like a war plan: providing structure, effective focus and immediate action.
This is the reason why Incident Management is separated from the Risk Analysis document and the Recovery Plan.
Of course, the likelier a risk the more a predefined counter scenario ('combat plan') should be available. But an Incident Management Plan should be more abstract than the discrete counter measures determined during the risk analysis: incidents never occur exactly as foreseen and there will be unexpected circumstances, so the counter actions must have a broader scope.
This also implies that Incident Team members must be trained to act wisely when the incident is not unrolling 'according to the book'.
Most important for effective Incident Management is:
- Get an overview on what is going on (what has happened, and what are the immediate consequences and further risks);
However, if you wait for accurate or confirmed information, you will be late with actions.
- Activate and coordinate counter measures.
First priority is to contain the problem; preferably take more radical decisions and lift some restrictions later than less thorough measures (make at least 'no regret' decisions).
Obviously, communication is of vital importance to execute above points.
Incident Management Plan
The Incident Management Plan must:
- Clearly define domains of responsibilities both in physical/geographical way and in logical/functional way. For a part these domains have already been defined by the Risk Analysis. Hopefully, an incident will be limited to a single domain. Public Relations (PR) is a separate domain (see below).
- Define a command hierarchy for each of the domains, and for the overall responsibility (military style).
Define a single responsible person and a second in command for all the positions, and for key positions a third in command. The backup for any position must be clear (e.g. if the responsible person, his second and his third in command are not available, the command is transferred to the next higher level in the hierarchy which may appoint a person).
These persons should be recognisable, e.g. by a coloured helmet, cap or vest. Though that may sound childish, it provides visibility and authority to the person in command during the initial confusion, and helps to get coherent action and avoids panic.
During selection of these responsible persons, be aware that key personnel in the normal hierarchy are often barely capable of handling normal operations, and will fall short on handling a crisis due to lack of expertise and experience. A change in the normal organisation hierarchy does not necessarily changes the responsible person.
Identify key personnel, and make sure they are part of the incident management team. The maintenance man who knows where all the pipes and cables go and where the valves and switches are is more valuable then some director.
-
Define an information/communication network to connect between the domains of responsibilities and within the hierachy in each domain: which position communicates with which other position outside the own hierarchy (within a hierarchy it obviously is hierarchical).
Define which communication mechanism (and which backup) is going to be used.
Take into account that normal communication channels may be out of service or overloaded due to the incident.
The availability of a public address (PA) system could be very useful and may be required (i.e. for quick evacuation).
Good communications is required to fight the confusion and is vital for effective counter actions.
Be clear in all communications: tell what you know for a fact, and what you suspect, consider or assume; what you are already doing and about to do with all that. Don't use PR as 'Perception Remanipulation' if you want to look professional and stay credible.
Recommunicate the available information at regular intervals (e.g. after 2 hours) or when new facts become available or new major actions will be taken.
Consider loss of communication regarding both technical communication means and unavailable key personal: people should not wait for commands to put them into action, but do (what they think) is urgently required (if you are afraid people may do the wrong thing, have them trained).
-
Communication with the outside world, and specifically with the media, is a separate domain: have
- a team/person to communicate with the business contacts (clients and suppliers, i.e. you may need to agree with them on contacts beforehand);
- a person/team for public services (police, fire brigade, the mayor, hospitals and ambulances, …); and
- a team for the media/public (acknowledge that the perception by the public is vital; this is a subject on its own).
Off course there should be coordination between above teams, but nobody talks to the outside world except the designated team(s), in particular if a PR-sensitive incident has occurred.
- Define the locations where the people involved in Incident Management should gather, and how other people should be evacuated if required. You can reserve an incident room with facilities (PCs, documentation, etc.), but keep in mind that the planned room may not be available due to the incident. If such an incident room is very important to fight a disaster, you may need a backup at another location on-site, or even at a site off-premises.
For a petrochemical plant, the control room is in a bunker.
- Consider the use of remote sensors like fire detectors, water/gas pressure and voltage/current gauges, door/ventilation closed indicators, camera's, etc. If presented in a schematic way, they help to provide overview and avoid mistakes. For some infrastructure you may even want remote control (like remotely controlled valves).
- Make an 'escalation plan', describing at which level of threat or (potential) impact what actions are taken, like evacuation of building, involvement of or communications with higher management and/or other parties (ultimately public parties like police, fire brigade, mayor, media), etc.
- Consider the case that an incident takes a prolonged time to counter: people need to eat and rest, stocks run out, etc. You may want to have special facilities and minimum stock on various items for emergency purposes.
- Prepare several major scenarios to reduce discussions on what to do, and to avoid poor decisions: it will act as a template for a pre-planned response. And scenarios are useful as exercises. Of course with actual incidents such scenarios must be adapted on-the-fly (by the responsible person). Have at least a scenario for each of the major incident types and areas/domains, and have backup plans (at least a plan B) for them if things do not work out as planned (they usually don't).
- Have checklists for the people involved. It helps them to take the right actions and not make a 'stupid mistake' like forgetting to switch-off utilities (e.g. gas flow to the fire). This is not always unambiguous as turning off the power to a building will set everybody over there in the dark and disable potentially useful electric equipment (sensors & actuators, control, computers, …).
- Have a documentation set including drawings of buildings, construction diagrams and utility schematics. Make sure that they are kept updated. Consider a documentation set off-site (so not impacted by any local disaster).
The general strategy for action (i.e. apart from reducing the likelihood for an incident) is:
- prepare for incidents: make Incident Management scenarios and do exercises, in particular with variants on the foreseen scenarios;
- detect an incident as soon as possible (by detectors, surveillance & patrols, …). Be aware that detectors (e.g. video camera's) have 'blind spots' which form a risk (in particular when dealing with human adversaries: security). Test the detectors and the vigilance of the patrols regularly.
- sound the alarm which will activate the people involved in the Incident Management Plan, and provides time to evacuate others;
- contain the consequences of an incident ('isolate', damage control);
- combat the cause (fire fighters).
Often combating the cause and containing the consequences are distinct activities requiring different skills, so it will be possible to work with two separate groups without a lot of coordination.
Training
It is important that during an incident no time is wasted on discussions like 'how/why can this happen' or disagreement on the best counter action (that is for later evaluation): do what you are told and fight the incident. A short discussion initiated by the responsible person on potential cause and best counter actions is good, but than the responsible person must make a decision, and the rest must carry that out (e.g. don't discuss whether a fire is caused by a fallen candle or an electrical short — fight the fire ! On the other hand, to extinguish the fire it is important to know whether electricity is involved, or e.g. some flammable fluid). Similarly, no discussions now on who is responsible for the cause, but support the person who is now responsible for counter measures.
All people involved must be aware of the above, and must have the Incident Management Plan at hand. It is a good idea to also have the Risk Analysis document and its appendices close at hand as it may provide useful background information (e.g. construction and wiring diagrams, incident consequence schematics).
To achieve swift and effective counter action, these actions must be trained. Although the training cases will differ from actual incidents, it is the only way to learn how to be effective in a very short time (which is vital).
Start slowly, with 'modest' incidents, and gradually make it more complex.
Regularly –at least once a year– repeat the training, and vary the nature of the incident.
And always evaluate afterwards (what should be improved; what if something had turned out slightly different). Avoid any internal 'politics' as that will backfire.
Though above focuses on physical disasters (as they are common for most businesses), it is similarly applicable for other incidents like product recalls: you need at least the client/supplier team and the media team.
Remain Vigilant and Pro-active
All plans for any duration require maintenance. That sounds trivial, but proves to be cumbersome.
- Redo the Risk Analysis and the Counter Strategy whenever there are major changes, and verify at least once a year whether conditions have changed and assumptions still hold.
- Review the Incident Management Plan similarly.
- Assure that counter measures remain active by testing and training (detectors working, counter measures regularly trained, e.g. by trial alarms).
Have a yearly inspection/audit on the 'counter measures by design' (containment measures, category 3 in the table above) to see if the measures are still adequate and well maintained. Test them.
- Have an 'incident exercise' at least once a year (and initially more) to smooth out any shortcomings and provide the opportunity for all (newly) involved to learn their role. What is rarely done, is rarely done well.
- Keep track of 'minor incidents', analyse them and learn from them. Also use incidents occurring somewhere else as example. What if several of such incidents turn out less lucky and coincide ?
Always try to find the 'root cause': why did this incident happen in this way ? And –in particular for human errors– what can you do to avoid it happening again ?
- Don't keep relevant documents on the central computer only (if IT or power fails, you have nothing). Have hardcopies and/or copies on some laptops (with extra batteries).
As said before, some incidents may ultimately be unsurvivable. However, being prepared may buy you sufficient time to survive until the threat subsides or outside help comes to the rescue.
Disasters are rarely caused by a single major incident;
usually it is coinciding cases of minor bad luck.
Murphy's Law
- When something can go wrong, it will.
- It will do so at the most inconvenient moment.
When disaster strikes losers have an excuse, but winners have a plan.
Selective Analysis
A nice anecdote about selective analysis, which might have resulted in the opposite of the desired effect.
During World War II the British Air Force had a team analysing the damage to the bomber planes when these returned to base.
They assumed that the German anti-aircraft gunners targeted the weak spots of the bomber, so the Air Force team ordered strengthening of these often hit spots.
Sounds totally sensible, doesn't it ?
In reality, the German anti-aircraft gunners were not targeting any particular spots on the bomber; they were happy if they hit the bomber at all.
And the bombers hit at any vital spot didn't returned to base but went down (and weren't analysed).
So what the Air Force team did, was strengthening the bombers at non-vital places (making it heavier, less manoeuvrable, allowing less payload).
Recovery Plan
The Recovery Plan describes the actions to get to normal operations after a major incident as quickly as possible. So it is not concerned with fighting the incident but the reduction of consequential (business) damages. After the fire has been put out and the flooded areas have been drained, how to get back to normal business life.
Not knowing your particular business, it is impossible to provide an appropriate description what to do. But the Risk Analysis document should give sufficient leads. Below are just some examples.
- Office
- Offices are probably the easiest problem to solve; usually you can rent office space nearby at a reasonable price. Office equipment may take some more time but shouldn't be difficult either. The main problem is that nobody knows where everybody is located now. So you need some directories, maps, diagrams and directional signs.
And you may have lost a lot of information (e.g. about 'work on hand').
- Telecommunications
- For telephony & fax the phone company should be able to reroute the traffic to a new address (while keeping the old number). The PABX and local loops will present more of a problem.
Data communications is slightly more difficult, and is actually part of the ICT topic.
- IT
- Assuming you have saved a recent backup of all computer data (at a remote site), you could restart the computer centre except you have no suitable room and no computers at all. There are fall-back centres, but you should have a contract with them before the incident occurred, and you should have had trial runs there. It requires compatible hardware & software, and a lot of parameter settings (e.g. think of the rules in the internet firewall). This is certainly an area where you must do an exercise: restore a backup on foreign computers and test your major applications, including telecommunications.
- Manufacturing
- This will probably take the longest to get operational. As replacement for a building, tents may be used (they are big and can be set up surprisingly quick). But the machinery will be a real problem. And you won't have any stock.
In general you may have (mutual) agreements with neighbouring businesses to use or share part of their facilities. Also most customers and suppliers will try to help you. And there are specialised companies to clean-up the debris and recover usable components.
Above already suggests that you probably don't make a single Recovery Plan but one (scenario) for each major domain. Similarly to the other documents, they must be reviewed and adapted at east once a year (to verify that conditions haven't changed and assumptions still hold).
=O=