Category Archives: DR

A for Availability – BCP and DR failures

DR/ BCP effectiveness

A look into the A part of C I A. Namely A for Availability.

Often overlooked and not always tested with rigour. Closely coupled with Change management, Resource management and long term strategy.

Although this article talks about a power outage, the short-comings of the recovery and lessons learnt can be applied to all types of outages.

Recent BCP failures

Delta Airlines suffered an outage in August 2016 that cost them 150M USD, this was followed by another two day outage on the 29th and 30th January 2017, with 280 flights cancelled.

British Airways (BA) had a serious outage which coincided with the late May public holiday, 28th May 2017. Power problems to the data centre resulted in delays and cancellations 3 days later, affecting 75000 travellers. BA’s parent company IAG (International Airlines Group) shares fell 4%. Compensation is expected to cost in the region of 150M GBP.

Woman waiting for her flight at Heathrow T5, London, UK.

(c) Reuters

BCP/ DR failure in general

I’ve seen quite a few similar failures in various enterprises, large and small. They range from poor application of the disaster recovery (DR) process to the use of out-of-date processes.

The situation is usually exacerbated by poor communication between operational staff and decision makers. Often people on the ground are afraid of invoking business continuity plans (BCP) or DR plans.

There are many more contributing factors, such as unfamiliarity with the restore process, key people being away, changes made (such as system updates) whilst key people are away, unauthorised changes or poor configuration made to the failover or synchronisation mechanism of production systems .

Wonder what happened at BA ?

Reading between the lines and speculation on my part. Of course take with a pinch of salt etc. as I don’t have an inside view of IT operations there.

How it started

The catalyst for the outage was the power failure which was recovered from in minutes. Assuming the standard robust  architecture, it should have the resilience to withstand system outages to n+1, n being the number of parallel sites or systems.

Resilience and Diversity

This type of operation will have geographical and service diversity, so another instance of the database is running elsewhere and **should** be run in a failsafe synchronised mode, so both instances are updated at the same time and the change committed when both agree.

When one of the databases goes down, due to communication or power issues, the other should be able to detect and run independently and be the sole primary. When the other database comes back online, it should be able to determine it’s been offline and start a re-synchronisation with the other one.

Trust issues

Now they couldn’t trust the backup instance, that infers that the synchronisation wasn’t working or switched off. They had to rely on a full restore, which was likely to be at least 1 day old, so all the changes after the backup were lost, which was guaranteed to cause subsequent operation issues.

Was it a loss of corporate memory ?

Support system and corporate memory does have a bearing on the effectiveness of non-BAU (business as usual) tasks, so when there are no problems, indeed anyone or any team can look after it. It’s a different matter when things don’t go to plan. This is when in-depth knowledge and experience counts. Outsourcing has a reputation for not being able to capture or retain such valuable information, in reality only a proportion of knowledge is reflected in documentation.

When people leave, they always take some of that experience and knowledge with them.

Of course there are the black-swan type events, the one in a million incidents that occur and for those, recovery will still take longer than expected.

(For the power engineers, I’ve seen the root cause analysis of a power failure, where the windings of one the DRUPS units shorted, causing a cascade failure of PDUs downstream, although the other DRUPs were operating fine, the protection units in the PDUs detected it as a fault condition and disconnected the rest of the supplies. This was a situation that the client couldn’t really plan for, but the recovery was hampered by organisational short-comings, even though the DR was tested less than 12 months prior, which meant a 15 minute outage (the fault circuit was isolated and the other PDUs put online manually) lasted half a day.)

How to mitigate these potential risks ?
With robust BCP and DR:

  • Well defined and up-to-date business and operation processes
  • Appropriate level of resilience built into the human resources, infrastructure, processes and culture
  • Clear channels of decision making which allow for decisions to be made on the ground
  • Careful and considered outsourcing
  • Current and accurate business impact analysis (BIA)
  • Properly rehearsed and updated playbooks (comprehensive and regular testing)

Further reading and Sources to the British Airways IT Outage

https://www.ft.com/content/15cab698-4372-11e7-8519-9f94ee97d996

http://www.independent.co.uk/travel/news-and-advice/british-airways-system-shutdown-heathrow-gatwick-your-riights-a7760536.html

http://www.japantimes.co.jp/news/2017/05/28/business/global-british-airways-systems-failure-creates-travel-chaos-power-issue-blamed/#.WS1SqTOZNE4

https://www.theguardian.com/business/2017/may/30/british-airways-ba-owner-drops-value-it-meltdown