Category Archives: Availability

Massive Attack

There have been very public examples of cyber attacks, affecting organisations on a global scale. Despite prominence given to recent outbreaks, such as Wannacry in May 2017 and NotPetya in June 2017, the first recorded global malware outbreaks started much earlier.

Some history

John von Neumann wrote a paper on “Theory of self-reproducing automata” published in 1966 which described self-replicating artificial forms, how they would spread, mutate and be self-deterministic.

The first examples

1981 saw one of the first computer virus designed to infect Apple II PCs by Richard Skrenta. Five years later in 1986 saw one of the first PC virus, known as Brain as well as other monikers, this was written by Basit Alvi.

Why ?

There have been many more malware in the intervening years, the main difference between these and current malware infections is that the earlier instances were programmers showcasing their skill and ingenuity mainly for bragging rights and inter-peer competition. They were often created as proof of concepts and sometimes were released by mistake.

Interconnected

Another aspect worth mentioning is that the world has moved on since the 1970s. Personal computers then were not generally interconnected, they were stand alone and strictly the domain of hobbyists.

The Internet

The origins of the internet started in the 1960s as a USA government project, which was made a commercial prospect in 1983 with few private users. By 1995 there were 16 million users, by the turn of the new century more than 300 million users. A decade later, 2000 million users. We are on course for 4000 million users in 2017. Exponential growth in action, the effective network  proximity means that something that happens on the otherside of the world, can affect you milliseconds later.

There are not many commercial or governmental organisations which are not internet connected. Domestic connectivity has also mirrored this growth, which has been taken into the mobile and IoT space as well.

Ransomware

The first recorded instance of a ransomware was in 1989 written by Dr. Joseph Popp for PCs called the AIDS Info Disk which was a malware that demanded 189USD to be paid for license fees.

Ransomware is now a commercial enterprise, organised crime has seen the potential for great ROI (return on investment) for little risk.

There have also been rumours of nation state involvement in malware, which has been loosely substantiated by leaks, revelations and evidence from whistleblowers. They have been carefully crafted and targeted attacks. One such example is such Stuxnet, designed to damage centrifuges used by Iran in a uranium enrichment programme.

SWIFT

Attacks on the global interbank transfer service, SWIFT netted more than 80M USD in 2016. A similar heist was reported in Ecuador and an attempt at defrauding a bank in India this year.

More recently Wannacry and Petyta in May and June 2017. The last two has leveraged stolen malware, allegedly originating from USA’s National Security Agency (NSA).

So we are beginning to see a muddying of the waters between what is likely to be nation state campaigns and what is used by organised crime for their money raising efforts. Even the lines between nation state and organised crime may be blurred, as the two most recent global ransomware events have been attributed to various countries.

Who did it ?

Be mindful that attribution is not an exact science; this is where clues may be left to confuse and misdirect and definitely an area where

plausible deniability reigns.

So what does this all mean? Apart from plenty of mystery, intrigue, 007 and general dodginess all round.

How does it affect me and you ?

For the population at large and commerce, it means further disruption caused to our digital environment from a myriad of sources, be it an attack to demonstrate technical control for political purposes or for monetary gain, the fallout or collateral damage is likely to affect the rest of us.

What can we do about it ?

Many of these exploits take advantage of poor cyber hygiene. If basic guidelines on the use of internet based services, system maintenance and configuration were followed, the susceptibility to these attacks by organisations would be significantly lower and even if an organisation were to succumb to a cyber attack, recovery would be significantly quicker and be less damaging on operations.

Follow-up – “How to protect yourself from malware”

References

List of viri from Comodo

Wannacry – Symantec

Wannacry – The Independent

Petya or not ???? – Reuters

Destructionware, not ransomware – The Verge

More on NotPetya – TechCrunch

 

A for Availability – BCP and DR failures

DR/ BCP effectiveness

A look into the A part of C I A. Namely A for Availability.

Often overlooked and not always tested with rigour. Closely coupled with Change management, Resource management and long term strategy.

Although this article talks about a power outage, the short-comings of the recovery and lessons learnt can be applied to all types of outages.

Recent BCP failures

Delta Airlines suffered an outage in August 2016 that cost them 150M USD, this was followed by another two day outage on the 29th and 30th January 2017, with 280 flights cancelled.

British Airways (BA) had a serious outage which coincided with the late May public holiday, 28th May 2017. Power problems to the data centre resulted in delays and cancellations 3 days later, affecting 75000 travellers. BA’s parent company IAG (International Airlines Group) shares fell 4%. Compensation is expected to cost in the region of 150M GBP.

Woman waiting for her flight at Heathrow T5, London, UK.

(c) Reuters

BCP/ DR failure in general

I’ve seen quite a few similar failures in various enterprises, large and small. They range from poor application of the disaster recovery (DR) process to the use of out-of-date processes.

The situation is usually exacerbated by poor communication between operational staff and decision makers. Often people on the ground are afraid of invoking business continuity plans (BCP) or DR plans.

There are many more contributing factors, such as unfamiliarity with the restore process, key people being away, changes made (such as system updates) whilst key people are away, unauthorised changes or poor configuration made to the failover or synchronisation mechanism of production systems .

Wonder what happened at BA ?

Reading between the lines and speculation on my part. Of course take with a pinch of salt etc. as I don’t have an inside view of IT operations there.

How it started

The catalyst for the outage was the power failure which was recovered from in minutes. Assuming the standard robust  architecture, it should have the resilience to withstand system outages to n+1, n being the number of parallel sites or systems.

Resilience and Diversity

This type of operation will have geographical and service diversity, so another instance of the database is running elsewhere and **should** be run in a failsafe synchronised mode, so both instances are updated at the same time and the change committed when both agree.

When one of the databases goes down, due to communication or power issues, the other should be able to detect and run independently and be the sole primary. When the other database comes back online, it should be able to determine it’s been offline and start a re-synchronisation with the other one.

Trust issues

Now they couldn’t trust the backup instance, that infers that the synchronisation wasn’t working or switched off. They had to rely on a full restore, which was likely to be at least 1 day old, so all the changes after the backup were lost, which was guaranteed to cause subsequent operation issues.

Was it a loss of corporate memory ?

Support system and corporate memory does have a bearing on the effectiveness of non-BAU (business as usual) tasks, so when there are no problems, indeed anyone or any team can look after it. It’s a different matter when things don’t go to plan. This is when in-depth knowledge and experience counts. Outsourcing has a reputation for not being able to capture or retain such valuable information, in reality only a proportion of knowledge is reflected in documentation.

When people leave, they always take some of that experience and knowledge with them.

Of course there are the black-swan type events, the one in a million incidents that occur and for those, recovery will still take longer than expected.

(For the power engineers, I’ve seen the root cause analysis of a power failure, where the windings of one the DRUPS units shorted, causing a cascade failure of PDUs downstream, although the other DRUPs were operating fine, the protection units in the PDUs detected it as a fault condition and disconnected the rest of the supplies. This was a situation that the client couldn’t really plan for, but the recovery was hampered by organisational short-comings, even though the DR was tested less than 12 months prior, which meant a 15 minute outage (the fault circuit was isolated and the other PDUs put online manually) lasted half a day.)

How to mitigate these potential risks ?
With robust BCP and DR:

  • Well defined and up-to-date business and operation processes
  • Appropriate level of resilience built into the human resources, infrastructure, processes and culture
  • Clear channels of decision making which allow for decisions to be made on the ground
  • Careful and considered outsourcing
  • Current and accurate business impact analysis (BIA)
  • Properly rehearsed and updated playbooks (comprehensive and regular testing)

Further reading and Sources to the British Airways IT Outage

https://www.ft.com/content/15cab698-4372-11e7-8519-9f94ee97d996

http://www.independent.co.uk/travel/news-and-advice/british-airways-system-shutdown-heathrow-gatwick-your-riights-a7760536.html

http://www.japantimes.co.jp/news/2017/05/28/business/global-british-airways-systems-failure-creates-travel-chaos-power-issue-blamed/#.WS1SqTOZNE4

https://www.theguardian.com/business/2017/may/30/british-airways-ba-owner-drops-value-it-meltdown