By Yaniv Valik, VP Product Management & Customer Success, Continuity Software:
Like the name says, the “information technology” business thrives on – information. IT people, by definition, are supposed to know what keeps the IT system in an organization going, and what to do if there are disruptions. If they don’t know, who does?
A dangerous question, to be sure – because the numbers don’t reflect the confidence most organizations have in their IT teams. It turns out that in many cases, IT professionals are almost as much in the dark as everyone else in an organization when things don’t work as they are supposed to – with the result that organizations can lose time and (often large amounts of) money as the IT staff desperately tries to fix things, or at least find a workaround. According to the Disaster Recovery Preparedness Council, in fact, three out of four companies surveyed believed they were not prepared to deal with an outage – unpreparedness that, according to the study, could cost anywhere from a few thousand dollars to hundreds of millions of dollars, with nearly 20% of companies polled indicating losses of more than $50,000 to over $5 million.
The evidence for this state of affairs – anecdotal and otherwise – abounds. A simple Google search of “IT outages” will yield ample examples that they are an ongoing and expensive problem. And many of these outages are due to a lack of knowledge and understanding by personnel.
An October 2016 outage at King’s College London illustrates this perfectly. According to The Register, “hardware failure caused an HP 3PAR storage system, which was propping up the entirety of the UK university’s IT estate, to implode, taking out everything from payroll to shared drive access.” It took some two weeks to restore data and services, but “departments across the university found themselves facing ‘irretrievable data loss’ including archived research material as well as original data which had been funded with public money.”
In a subsequent review of the event leaked by the newspaper, the consulting firm that the university brought in to figure out what went wrong concluded that “the cause of the backup failure was due to the IT technical team not fully understanding the importance of the tape back ups within the overall backup system and not following the back up procedures completely.” In addition, “hand-offs and decisions not properly understood by team.” As far as the hardware failure itself was concerned, that was due to a firmware issue, which HP issued a fix for – but the fix was never installed.
A facile takeaway from this story is “the IT team messed up,” but that would be the wrong lesson. A closer examination of the report indicates that it was the complexity of the system, including the backup mechanism that was to “blame. The complex architecture at the time of the incident meant the complete failure of the data storage system required restoration from a variety of sources.” In addition, the report says, “the technology roadmap has a large number of initiatives within it. The volume of IT initiatives is overwhelming and the business stakeholders have found it difficult to help IT to prioritize appropriately.”
The KCL report actually echoes another study by University of Chicago, which examined hundreds of IT outages that occurred within a 7-year span from 2009 to 2015. The study, which analyzed over 1,000 articles and reports discussing those outages, sought to determine what the most common causes of IT outages were. Causes included network/power failures (which IT teams probably could not have anticipated), failures due to config/upgrade issues, failures due to security/ hacking issues, and more.
And while all those causes featured prominently as reasons for failure, the biggest reason was – the unknown. No fewer than 355 of the 597 outages studied had an “unknown” cause, which means that even post-mortem, recovery teams still had no idea of what the real cause of the problem was.
“Post-mortem” is a key term here. When the outage occurred, the reason was most likely “unknown” as well – otherwise, it’s reasonable to assume, the IT team would have intervened to correct the issue before it turned into an outage. The conclusions on what caused an outage in the U of C report – just like in the KCL outage – were taken after the damage was already done.
Seen from that perspective, the issue of outages takes on a different, and far more complex prism. The KCL review is full of terms like “didn’t know,” “weren’t familiar with,” “did not understand,” etc. – indicating that what’s really causing these outages is a lack of information. But the truth is that trying to understand modern IT systems requires a super-human effort – meaning that it is beyond the ken of humans. According to the review, the KCL IT teams “follow processes mechanically with a narrow focus on their own work,” while “too many initiatives compete for attention, and priorities are not clear.” I would contend that, given the huge number of things that could go wrong – in this case, according to the review, “modifications were made to the original design (e.g. hardware redundancy configuration) without understanding of consequences” – the companies that never experience outages are the ones that should be studied, to see how they avoid the huge number of pitfalls that could cause an outage.
That’s why essentially all outages are caused by unknown factors. The factor, if it can be traced down, will not be “known” until it is far too late, and there’s no reason to assume that another unknown factor will strike even if the first problem is resolved. The trick would be to get ahead of these unknowns before they can impact a system. Had the KCL team been able to anticipate the impact of modifying the backup system, for example, they would most likely have either found a fix that would shore up the new weakness their action (inadvertently) caused, or undone the changes they made – which were made, no doubt, for good reasons.
That is the whole point; upgrading hardware or software, expanding compute and storage capacity, adding new applications, or modernizing methodologies and technology stacks, are all done for “good reasons”, but teams cannot be always aware of the full consequences of their actions. There really is no way for them to be aware, given the complexity of modern systems. That’s why I would recommend taking the job away from the IT team, and giving it to – a robot.
The robot in this case is not a Hollywood humanoid with metal skin, but a computer system whose job it is to keep an eye on all processes, and analyze the impact of change on them. An intelligent analysis system will be able to see three moves ahead – such as what happens when we perform a configuration change in IT. Will this break a critical vendor data protection best practice? Will it concentrate too many resources in a single availability zone that could lead to a service meltdown? Was it tuned incorrectly – leading to prolonged service disruption when real life workloads increase? Will fixing one issue cause another dozen to crop up, as various compensatory “patches” applied over the years to surrounding components, to compensate for the then unknown root-cause, will now break? The only way to do this is with some outside force, preferably a very intelligent one that can handle lots of data (ie, an AI/big data solution), and can assess quality and predict impact of changes, alerting IT teams as to what will happen to C if A and B happen.
Any organization responsible for a significant amount of data is going to need a solution like this – and failing to install one means taking huge risks. An anecdote from the days of Hurricane Sandy illustrates this point quite well: IT teams, unsure of whether the storm would hit land or veer off into the sea (that was unclear for most of October 27, 2012, the day the storm hit), had a major decision to make: Do we do the smart thing and proactively move workloads over to our disaster recovery site – thus ensuring that our data and equipment don’t get fried, or do we take a chance that the storm or disaster will miss us and/or not be as bad as anticipated? The same thinking prevails in more recent anticipated disasters – such as the flooding in Sydney, Australia. There are no numbers on these, but the fact that major outages were reported indicated that many companies choose the latter, “it can’t be that bad” path. Does that mean that they were more comfortable in betting on the weather than on betting on their ability to quickly recover from a major problem? If the answer is yes – as it appears to be – then IT has a big problem that needs some immediate attention.