Amazon Web Services (AWS) offers Service Level Agreements (SLAs) for each of their services, which typically promise very high availability, often upwards of 99.9% uptime. That’s pretty good, and probably more than enough assurance for the average customer. And yet, Amazon is not immune to natural disasters or catastrophic human errors. In February 2017, one such human error led to a widespread S3 outage in the us-east-1 region, which then led to the cascading failure of many other AWS services that depend on S3. While this outage only lasted several hours, it isn’t hard to imagine a different scenario (targeted attack or natural disaster) that could lead to a much longer time to recovery. Again, depending on your tolerance for total downtime, it may or may not be worth the time and expense to prepare for such an eventuality.
One of Simple Thread’s larger clients operates multiple mission-critical systems that can afford only minimal downtime and absolutely no data loss. In response to this, we recently decided to move one such AWS-hosted system to a multi-region, pilot light architecture, with a standby system at the ready in the case of a prolonged outage in the primary region. The upgrade took a while to build and test, and I won’t go into all of the details here. However, there were a few key design decisions, as well as implementation tips and gotchas that might be useful to others looking to build out a similar failover system.
Modules Are Your Friend
At the time, the project existed only as a single-region system, deployed in a staging environment, and maintained in a Terraform IaC code repository. The goal of this initiative was to augment the codebase in order to build a new production system, which would include a complete mirror of existing resources in a separate AWS region. Thus, most of our terraform resources just needed to be cloned to the new failover region. At first glance, it may seem easiest to simply copy the bulk of the terraform code, and change the resources’ region settings. But that’s a lot of code duplication and not something we wanted to sign up to maintain. Instead, we moved to using submodules that could each be called with their own AWS provider. For the resources that needed to be cloned, we created a reusable multi-region
module. Then we configured AWS providers for each region, and called the module twice, once with the primary provider, and again with the failover provider.
provider "aws" {
region = var.primary_region
alias = "primary"
}
provider "aws" {
region = var.failover_region
alias = "failover"
}
module "primary-region" {
source = "./multi-region-module"
…
providers = {
aws = aws.primary
}
}
module "failover-region" {
source = "./multi-region-module"
…
providers = {
aws = aws.failover
}
}
Global Services
Of course, some AWS services are global and not associated with a particular region. One popular misconception is that S3 is a global service. It is true that the bucket namespace is global; an S3 bucket’s name must be unique across all existing S3 buckets in all regions. However, when creating a new S3 bucket, you do specify a region, and that is where all of the bucket’s objects will be held. This has implications both for data availability in the event of an outage, and also data retrieval latency, no different from any other regional service. So if your application has critical data in one or more S3 buckets, be sure to also clone these resources to the failover region.
In our case, our single-region
module consisted primarily of IAM and Cloudfront resources. This module created only the global resources, and so it was only called once using the primary region’s provider.
module "single-region" {
source = "./single-region-module"
…
providers = {
aws = aws.primary
}
}
Easy enough, right? But these global resources don’t exist in a vacuum. In most instances, they are created so that they can be referenced by other resources. And many of those resources were in our multi-region
module. Resources in one module can’t be referenced directly from another module. So how do you use them? That answer lies in module input variables and output values.
Inter-Module Communication
This point is probably best explained with an example. Let’s say we want to create an aws_ecs_task_definition
resource in our multi-region
module. This task definition requires an execution_role_arn
attribute. And that role is an IAM resource that exists in our single-region
module.
single-region-module/ecs-iam.tf
resource "aws_iam_role" "ecs_execution_role" {
assume_role_policy = <<EOF
{
Policy text…
}
EOF
}
First we have to “export” the information we need from the single-region
module. Note that we are only outputting the ARN attribute of the resource, since that’s all we need. If you need multiple attributes, you can also output the entire resource and reference individual attributes in the downstream module:
single-region-module/outputs.tf
output "aws_iam_role_ecs_execution_role_arn" {
value = aws_iam_role.ecs_execution_role.arn
}
Then in the root module, we need to “catch” this output value and pass it into the multi-region
module:
main.tf (root module)
module "primary-region" {
source = "./multi-region-module"
# Value from single region module
aws_iam_role_ecs_execution_role_arn = module.single-region.aws_iam_role_ecs_execution_role_arn
…
}
You can also pass that same value into the failover region module the same way. Next we need to “import” this value into the multi-region
module:
multi-region-module/main.tf
# Input from single region module
variable "aws_iam_role_ecs_execution_role_arn" {}
Finally, we can reference this variable in the multi-region
task definition:
mutli-region-module/application.tf
resource "aws_ecs_task_definition" "app_server" {
family = "${var.environment}-app-server"
execution_role_arn = var.aws_iam_role_ecs_execution_role_arn
…
}
And it works the same way going the other direction. You can output values from both the primary and failover modules and pass both into the single-region
module:
main.tf (root module)
module "single-region" {
source = "./single-region-module"
# Value from primary region module
aws_s3_bucket_uploads_primary = module.primary-region.aws_s3_bucket_uploads
# Value from failover region module
aws_s3_bucket_uploads_failover = module.failover-region.aws_s3_bucket_uploads
…
}
Clearly Terraform is not just applying your configuration one module at a time seeing as how we can have data dependencies between modules in both directions. In my experience, Terraform does an excellent job of sorting out the order of dependencies and handling them seamlessly, but on occasion, you may need to add a depends_on
hint to give Terraform a helping nudge. Just keep an eye out for anything that is going to cause a blatant cyclic dependency issue.
Data Replication
There was one area where these inter-module dependencies became a little harder to overcome, and that was in the realm of data replication. Our application holds data in three places: an Aurora Postgres RDS cluster, an Elasticache replication group, and a few S3 buckets. Each of these resources needed to be cloned to the failover region so it seemed that the resources belonged in the multi-region
module. But they also needed to replicate to one another (primary region resource replicating data to its failover region counterpart). Terraform seemed to have issues when dealing with these tightly coupled resources that were generated from the same module code but with different providers. So instead, we moved these resources and their replication strategies into the root module, configuring each with its own region-specific provider.
For the Postgres database, since we were already configured for an Aurora cluster, it was a natural fit to use Aurora Global Database for replication. The most important pieces of this configuration are shown below:
- The
aws_rds_global_cluster
doesn’t have thesource_db_cluster_identifier
specified. - The primary
aws_rds_cluster
has itsglobal_cluster_identifier
pointed at the global cluster ID. - The failover
aws_rds_cluster
also has itsglobal_cluster_identifier
pointed at the global cluster ID - The failover
aws_rds_cluster
has itsreplication_source_identifier
pointed at the primary cluster - Finally, the failover
aws_rds_cluster
depends on the primary cluster instance
## Global Database
resource "aws_rds_global_cluster" "api_db_global" {
provider = aws.primary
…
}
## Primary Cluster
resource "aws_rds_cluster" "api_db" {
provider = aws.primary
global_cluster_identifier = aws_rds_global_cluster.api_db_global.id
…
}
resource "aws_rds_cluster_instance" "api_db" {
provider = aws.primary
…
}
## Failover Cluster
resource "aws_rds_cluster" "api_db_failover" {
provider = aws.failover
global_cluster_identifier = aws_rds_global_cluster.api_db_global.id
replication_source_identifier = aws_rds_cluster.api_db.arn
…
depends_on = [
aws_rds_cluster_instance.api_db
]
}
resource "aws_rds_cluster_instance" "api_db_failover" {
provider = aws.failover
…
}
Keep in mind that this configuration was designed to build a new production system from scratch, with no existing database to begin with. If instead you wish to update a deployed system with an existing database cluster, or create the primary database from an existing snapshot, the configuration would be slightly different, as shown below:
- The
aws_rds_global_cluster
would have itssource_db_cluster_identifier
pointed at the primary cluster ID. - The primary
aws_rds_cluster
wouldn’t have theglobal_cluster_identifier
specified.
For our Elasticache replication, we went with another AWS end-to-end cross-region replication solution: Global Datastore. This too was a reasonably straightforward Terraform adjustment. We added a new aws_elasticache_global_replication_group
resource with its primary_replication_group_id
pointed at the primary replication group, and then a failover aws_elasticache_replication_group
with its global_replication_group_id
pointed to the new global replication group ID.
## Primary Replication Group
resource "aws_elasticache_replication_group" "redis" {
provider = aws.primary
…
}
## Global Replication Group
resource "aws_elasticache_global_replication_group" "redis_global" {
provider = aws.primary
global_replication_group_id_suffix = "${var.environment}-redis-global-datastore"
primary_replication_group_id = aws_elasticache_replication_group.redis.id
}
## Failover Replication Group
resource "aws_elasticache_replication_group" "redis_failover" {
provider = aws.failover
global_replication_group_id = aws_elasticache_global_replication_group.redis_global.global_replication_group_id
…
}
One thing to note: both the Global Database and Global Datastore services only support a subset of the RDS and Elasticache instance types. So if you’re currently using rather small instances, you may need to step up to a larger machine type in order to take advantage of these replication services. You can read more about the supported instance types here and here.
Onward to Resilience
And that about sums it up. With these changes, you should have a truly fault tolerant system in place. So you can reassure your client that the next time an AWS employee fat-fingers a console command, or one of their data centers finds itself underneath eight feet of flood waters, the application’s high availability, mission-critical data, and treasure trove of cat videos will be able to weather the storm.
We’re always interested in hearing how other people are building security and resilience into their systems, so by all means, let us know what you think!
Loved the article? Hated it? Didn’t even read it?
We’d love to hear from you.