Configuring cloud infrastructure automatically is something you want to automate if you’re doing anything but trivial systems. My current project involves setting up infrastructure in multiple AWS regions. This has the same structure with environment specific variances such as CIDR blocks and instance sizes. Managing this across multiple regions manually would be tedious, involve significant effort and likely have many errors. To avoid this we have automated using Hashicorp’s Terraform tool. Our usage of this tool has gone through a few revisions and will likely continue to evolve. This structure comes from our experience using the Terraform as well as reading about other people’s experience. As such this structure is not unique or our own creation but it is working for us and gives us a base for improvement.
Our first version had separate Terraform projects for each component that we deployed to AWS. A separate repository stored variable files that configured each project for specific environments. This structure worked well as a first version. There were however a number of things that we wanted to improve:
- Terraform has inbuilt support for saving state to S3 and other data stores but this is tied to the root module (the directory you run the Terraform project from). Sharing this directory across multiple environments means you can’t use this support.
- Each project being entirely self-contained means that projects need to supply a lot of configuration variables that contain environment configuration that Terraform already has stored in state files.
- To perform any action requires multiple command line options to be specified which is tedious to remember and a potential source of errors
- Having multiple projects means that setting up an environment requires running multiple Terraform projects. It also requires remembering the state of various projects and the order in which they must be run. This is the kind of manual work that Terraform can eliminate.
Our second version is largely a restructure of the first that takes advantage of Terraform’s module capability. It combines a number of components that were previously separate projects so that everything required for our service is constructed by one project.
As a Terraform module is essentially a directory with Terraform files in it the new version is mostly a set of conventions for directory structure. It has the form:
- common contains modules for shared variables. We currently have two:
- ecs contains the AMI IDs for ECS across all the AWS regions. This gives us a common place to update them
- vpc contains the VPC IDs, subnets and availability zones for the VPCs into which our service is deployed in each reason. Ideally we would pull in the state files from the Terraform project that builds the VPC but we haven’t yet got those available. Until that is the case this module provides a common location to update those details.
- modules contains the modules that actually build infrastructure. Each may reference modules from common. Links between modules in this directory are supported by defining input variables and outputs. The links themselves are not performed in this directory.
- environments contains a root module for each environment. A main.tf file here pulls in modules from the modules directory, passing any necessary variables between them. A terraform.tfvars file contains the variables that are specific to each environment (defined in a variables.tf file). We have directories for each environment we configure plus a test environment that is used to verify that changes before applying them to a production environment.
- Because each environment generally wants to link the modules identically we have the templates directory. We have templates for each collection of modules. In each template directory are the .tf files requires to make it work. These are then symlinked into the relevant environment directory so that each is kept consistent. The terraform.tfvars file is not included for obvious reasons.
- configure.sh is a script at the base of the structure that sets up the Terraform remote configuration for each environment directory and does a terraform get on each directory so they’re immediately ready for use. This ensures that the remote configuration is consistent and each to set up.
By using this structure setting up an environment is as simple as running terraform apply in the appropriate environment directory with no other command line arguments. Terraform then does everything necessary to shape the environment as configured. This includes in our case creating SNS notification topics, SQS queues, cloudwatch metrics and alarms, an ECS cluster and a service deployed to that cluster.
This structure is a lot harder to use incorrectly and eliminates the need for a custom tool we had to save state to S3. This is particularly handy as we had compatibility issues between this tool and Terraform 0.7.0 and changing the Terraform structure was easier than fixing the tool. However there’s still a number of things to do.
The service that we are using this to deploy receives data from our primary system via an SQS queue. The current structure creates this queue. Unfortunately this means that we can no longer use Terraform to tear down an environment. This is possible with our service as it does not store data itself. But if we tear down the Terraform structure the queue will go with it and this will break our primary system. The problem here is a failure to understand boundaries and ownership of resources.
In this case the primary system should actually be publishing to an SNS topic that is created by a Terraform project specific to that system. The primary system is unlikely to be deployed by Terraform any time soon but we can certainly use Terraform to manage the resources it uses to interact with other services. Having it publish to an SNS topic created by Terraform would mean we could pull in the Terraform state file for the primary system project. The Terraform project for our service could then create a queue specific to itself and automatically subscribe itself to the SNS topic using the details from the state file.
This kind of structure ensures that the primary service cannot be broken by changing the deployment of our service. The dependency between the two services is represented the same way in the deployment system as it is between the actual services and can be use Terraform’s capability to bring in state files to remove the need to copy resource IDs into variable files.
Working out where the boundaries between Terraform projects should be is also something that we will continue to refine. The current structure combines creating the ECS cluster and the service to deploy to that cluster. We are considering consolidating our clusters as an organisation. In that case we would probably want to separate the cluster definition from the service. We would likely wish to keep the SNS and Cloudwatch elements used for monitoring with the service in that case as they are specific to the service currently. There isn’t really a right answer here, in many respects the organisation here will reflect our own organisational structure and responsibilities.