DR in the Cloud: Guide to Choosing the Right DR Solution

By Zeb Ahmed, CLOUD, XaaS & BCDR Leader – IBM:

Although Service Providers have been providing offsite Disaster Recovery solutions for a while now, “Disaster Recovery in the Cloud” is still a relatively new term for some of us. What does it actually mean? In most scenarios, it’s a hosting or a cloud provider using various replication technologies to offer disaster recovery services on its hosted platform. It may be as simple as a pure IaaS play where all you are getting is just infrastructure to be able to restore your workloads on or it could be a robust end to end managed solution where the provider not only manages the replication and the failover process, but also manages the DR run book and provides application layer management services as well. In either case, there are a few key considerations when looking for a DR solution and a cloud provider:

Who manages the disaster recovery solution? Do you have IT staff capable to manage the failover and failback process or would you rather rely on the provider to do so?

This is an important question as this decision might directly impact your RTO. Besides the obvious reason that whether you have in house technical expertise or not, there is also another consideration. Normally, service providers are hesitant to provide SLA’s around aggressive RTO for a provider managed DR solution. Reason is simple, if there was a disaster to happen impacting multiple customers, they want to ensure they have enough time available to failover all the impacted customers. So, if you have technical expertise, go for a solution that put you in control to declare and initiate failover/testing.

In case of a major natural disaster, does this provider have enough resource capacity on hand to meet all its customer’s obligations?

Nowadays, it is becoming common for providers to look into an elastic or a “pay-as-you-go” DR model where you only pay for CPU, memory, storage when you use it. Since in many cases DR is compliance driven, it may make sense to sign up for a similar solution. However, it does not make financial sense for service providers to have 100% of all resources reserved for all “pay-as-you-go” customers. This means that in a rare case where all their customers were to failover, they may not be able to accommodate everyone. This is something you want to talk and clarify with the service provider before moving forward.

SAN, Hypervisor or Agent based replication?

There are solutions available today where you can have SAN based, hypervisor based or agent based replication to the DR site. They all have pros and cons so make sure to weigh them when choosing a service provider and a solution.

Is your environment physical or virtual? If virtual, which hypervisor?

Most of the latest DR replication products focus on providing DR for virtual workloads. There are a few good solutions for physical workloads as well but in general DR for physical workloads can be more complicated than virtual. Moreover, DR Solutions for physical environments tend to be more expensive so my suggestion would be to virtualize as much as possible and areas where virtualization isn’t an option, look into native application based DR functionality such as DAG for Exchange, Log Shipping/Always On for SQL, and Data Guard for Oracle etc.

Moreover, not all hypervisor are supported by all solutions. Make sure to pass that information on to your provider to ensure that it is supported.

RPO/RTO. How old can you afford your data to be? And how long of a downtime can you really afford during a disaster?

Thanks to innovation, DR solutions have come a long way. With certain solutions, you are even able to achieve near zero second RPO’s. What’s most important to HR or Finance, may not be the most important workload to the business so make sure to have those internal business conversations to define your RPO and RTO requirements. This will help your service provider define the right solution for you.

Do you and your provider have adequate bandwidth available to be able to replicate data as quickly as possible to the disaster recovery site?

Make sure to look at your network connectivity reports and see how much bandwidth can be allocated towards the data replication process to a service provider. In order to ensure your data is being replicated continuously, look into the total amount of data that is being changed on a daily basis and base your bandwidth capacity on that. This data can be found in your daily backup differential reports. Also, bandwidth is crucial for failback purposes since with some of the solutions, you might have to failback the whole vmdk instead of just the differential data.

Is the solution capable of failback once your production site is up and running again?

Not all solutions are capable of automated failback. Moreover, some solutions require you to have considerable amount of downtime as well while failing back to the production site.

In addition to a full site production failover, are you also trying to prevent from data corruption or partial site failover? Do you need to have the ability to keep multiple restore points to be able to failover to a previous point in time?

From my experience in the DR space, there may be one or two instances where a customer had to do a full site failover. In most cases, DR failover is initiated due to data corruption or file deletion etc. In these scenarios it is crucial to have the “rollback” functionality where you can restore to a previous point in time. Make sure to ask for this option when looking into a solution.

Would you like to be able to test your disaster recovery plan frequently?

Having a DR plan is not enough. It is imperative that you carry out tests to ensure the integrity of the data present on the DR site. Look into a solution that allow you to have non-disruptive testing mechanism where you have the ability to test frequently.

Do you need your replication process to be application aware to ensure application consistency when failed over?

In many cases, companies have applications that need application aware replication to ensure application consistency. If this applies to you, looks into options such as VSS and LVM based replication options.

How will the environment be accessed in the event of a failover? DNS, Networking etc.

This is probably the one question that I get asked almost every time, and for the right reasons. Many IT folks are concerned about the way networking will be handled once the environment is failed over. With certain solutions, you can preconfigure all the subnet and IP schema in the cloud so that your virtual machine can keep the same IP address when failed over. Also, make sure to have multiple connectivity options available to you such as Site-to-Site VPN and Client-to-Site VPN for your remote users. Keep in mind, that you may not have access to your office in case of a disaster so having the ability to SSL VPN into the DR site is crucial. Many customer also like to have Domain Controller, DNS and Active Directory preconfigured running live on the DR site to not have to worry about hostname resolution and authentication during a live failover event.

The list of questions goes on but the goal should be really to spend some time understanding and defining your business and IT disaster recovery requirements. Your disaster recovery solution would only be as good as the solution requirements you pass on to your service provider.

Zeb Ahmed is a Senior Manager Product Management for IBM with responsibility for overseeing and managing the Backup and Disaster Recovery portfolio and partner eceosystem for IBM Cloud.