Unless you were living under a rock yesterday (February 28), you probably know that Amazon Web Services had a severe outage in one of its Regions. This Amazon outage knocked down some of the largest websites that people use every day. I can imagine there are a lot of IT managers with sore throats this morning from all the yelling they did yesterday. Was the yelling justified? Well, in terms of frustration yes, but in terms of following guidelines maybe not. Whether or not this outage affected you or your business directly, it is a great reminder that even with the best infrastructure, things can go wrong. And when they inevitably do, how do you plan to react and handle each part of your lead ecosystem?
This post digs into the specific Amazon outage and touches on what lead companies should do if the services their business relies on go down. If you want to skip the technical stuff, and you are a lead company, skip to the last section.
I am writing this before Amazon has given their full explanation of the outage so there may be a bit of guesswork here. All that we know right now is that Amazon's S3 storage system died and took many of Amazon's other services along with it. Amazon eats its own dog food so many of its other services rely on S3 to function. The true disaster here was not that S3 went down, but that S3 went down in all the Availability Zones in the Virginia Region.
Amazon has data centers all over the world. These data centers are organized into Regions. In the United States there are Regions in Northern California, Oregon, Virginia and Ohio. Each Region is made up of multiple Availability Zones. These Availability Zones are basically independent data centers located in different areas of the Region. They are far enough apart to not be taken down by a single disaster (power failure for example) but close enough so that the communication between the Availability Zones is incredibly fast. One Availability Zone going down is manageable. Each going down can be devastating.
Though not yet for our core software, boberdoo.com has been using Amazon Web Services in one way or another since it first opened in 2006. For years now, Amazon has been preaching that you can build a fault-tolerant infrastructure by spreading your software across Availability Zones within a single Region. When you launch web servers you launch them in different Availability Zones. Databases, load balancers, DNS and more are set up the same way. If you follow Amazon's guidelines, one of the Availability Zones can go totally offline and your infrastructure should be able to keep on ticking. It might take some scaling up and reconfiguration, but that can be automated and you should survive.
The problem yesterday may be similar to a problem Amazon had a couple years back when they did a software update and pushed it to all of the Availability Zones at the same time. Simply put, it went badly. Having been to many Amazon conferences and learning much about the way they do business, I would be a little surprised if this was the same situation because Amazon wants us all to believe that they will not do things that could take down multiple Availability Zones. They are supposed to be independent. Going back to the screaming IT managers then, it is possible the employees did things “right” in terms of what Amazon has taught us but unfortunately sometimes things still go South.
I am not here to say there is a magical way to build a perfectly fault-tolerant system. It is just not possible. If you think yours is, I am sorry, but you are wrong. Virtually every system, app or website depends on so many other companies now for bits and pieces of its infrastructure that you just cannot assume even the best design will not have problems.
Lead companies need to prepare for the day when all or part of their lead business goes down. Things do break which is why every service provider has no liability clauses for lost revenue. Neither Amazon nor anyone else is going to be writing you a check for leads a vendor sent you when your system was down or for the money you lost on Google Adwords while your lead capture website was dead. You should budget for business interruption insurance to help cover losses during an outage.
Some applications that are transactional in nature are very hard to balance across different Regions. If you have an application doing thousands of transactions a second, you've got a lot of data to send back and forth across the country if you are using data centers in Oregon and Virginia for example. However, Amazon does offer cross Region replication now for most of its services. After having their issue in 2015 and then again yesterday, I think this is something everyone will reexamine starting today.
For lower volume systems, and non-transactional systems, the options are vast. Your lead gathering website may be an example of this. If your website is fairly static in nature, you could set the website up in multiple Regions and even use multiple providers such as both Amazon Web Services and Microsoft Azure. DNS can be used to balance the traffic regionally and most of the advanced DNS providers now have features for running health checks on down locations (Amazon Route53, dnsmadeeasy.com, Cloudflare.com).
If your website is up but your lead system is down, you should also set up your forms to save the lead data locally so they can be reinserted into your lead system at a later time. Does this mess things up with click tracking, conversion pixels, thank you page content, etc.? Yes, of course it does but it is better than nothing.
Finally, you should build a checklist of what to do when X part of your ecosystem goes down. If your website is down who at your company needs to do what, item by item, so you are minimizing the damage? Which employee is tasked to shut off paid search? Are their vendors sending traffic that need to be called? Is it a day when lead buyers normally do returns and now they can’t? Start your list off with what is going to cost you the most money and work down. If this list is going to cover a lot of people, make sure a vacation policy is set so that a person on that list must have a backup in place while they are gone. While yesterday’s Amazon outage is a somewhat rare event, it does happen and will continue to happen. Plan now for what you will do when it does.
boberdoo.com is a lead distribution software provider for the lead generation industry. We have been building solutions for lead generators, online marketers and advanced publishers for over 17 years.