meshBlog

Testing Infrastructure as Code (IaC): The Definitive Guide 2020

By Wulf Schiemann13. March 2020

In this blog post we’re going to explain if and how Infrastructure as Code should be tested. We’ll illustrate 5 examples with Terraform – the tool we use here at meshcloud – and tell you what to look for in IaC test tooling.

Here are the topics we will touch on. Let’s dive right in:

1. What is Infrastructure as Code?

2. Do I Even Need to Test IaC?

3. The IaC Testing Usefulness Formula

4. The Developer Effect

5. The 3 Essential Types of Testing Infrastructure as Code

6. 4 Practical Examples for IaC Testing

7. The 3 Key Factors for Choosing Your Test Tooling

What Exactly is Infrastructure as Code?

To make sure we are starting from the same point: What do we mean by Infrastructure as Code (IaC)? IaC describes the specification of required infrastructure (VMs, storage, networking) in text form. We define a target state, which can be easily adapted, duplicated, deleted and versioned. IaC relies on modern cloud technologies and enables a high degree of automation. With these new trends, infrastructure development has become much more similar to application development. This raises the interesting question if and how you can test infrastructure code?

Do I Need to Test IaC at All?

The question is more relevant than you might think. Modern IaC tools like terraform or pulumi already have a lot of checks built-in and respond with detailed error messages. In order to better asses the usefulness of IaC testing for you, consider these factors:

  1. Number of components (e.g. vms, managed services, loadbalancers, etc.)
  2. Number of environments (e.g. dev, stg, prod)
  3. Number of Rollouts which affect Infrastructure (daily releases vs. monthly)

The IaC Testing Usefulness Formula

We believe all three factors – components, enviroments and number of rollouts – are equally important, which results our IaC Testing Usefulness Formula:

Usefulness of IaC Testing = Components x Environments x Rollouts

The Developer Effect

In practice we observed another effect which significantly increases the number of rollouts. Lovingly dubbed the "devEffect" at meshcloud it occurs during the initial set-up of IaC or the modification of existing components. Because developing new IaC is often done in a trial and error manner, developers will rollout their changes around 100 times more often. For this reason alone it can be worth it to write IaC tests.

The 3 Essential Types of Testing Infrastructure as Code

  1. Static and local checks
  2. Apply and destroy tests
  3. Imitations tests

1. Static and Local Checks

Strictly speaking these are not tests, but simple checks. However we still want to include them in this list as they are an easy and fast way to ensure the correct set-up of your infrastructure.

Idea: Analyze configurations and dependencies before execution.

Goal: Fast detection of static errors. Errors that occur dynamically during execution are not covered.

Examples: Completeness check, field validation, reference analysis.

But everything is better with code examples, so let’s have a look. Consider the following terraform file to create a small VM.

resource google_compute_instance test {
  name         = "hello-terratest"
  machine_type = "f1-micro"

  boot_disk {
    initialize_params {
      image = "ubuntu-1804-lts"
    }
  }

  network_interface {
    network = "default"
    access_config {}
  }
}

The alert reader already noticed the absence of the machine_type field. You didn’t? That is exactly our point: You don’t have to. Terraform automatically informs us about the missing field.

Example of a Terraform Error Message: Missing required argument. The argument "machine_type" is required, but no definition was found.
Terraform nicely outputs an error message.

This can be taken even a step further in the form of autocomplete. Using the terraform language server and VSCode for example, it automatically prompts me to enter the missing field and its type.

Screenshot showing the terraform autocomplete feature in VSCode.
Using the terraform language server helps avoiding errors.

Additionally we can also check for static dependencies. Consider the following terraform code. We assume this is the only code in the module. It references a google compute target pool which does not exist within the context. This is a total lifesaver in projects with hundreds of dependencies.

resource google_compute_forwarding_rule test {
  name   = "hello-terratest"
  target = google_compute_target_pool.test.self_link
}
Screenshot depicting resulting terraform error: Reference to undeclared resource.
Terraform identifies the undeclared resource.

While static checks are already a big step-up from traditional infrastructure deployments, they do not account for no dynamic fields and dependencies. This is where our next category Apply and Destroy Tests comes into play

2. Apply and Destroy Tests

With Apply and Destroy Tests we can go one step further and also identify dynamic errors.

Idea: Roll out the infrastructure for a short time, test it and destroy it again immediately afterwards.

Goal: Check dynamic fields and dependencies.

Examples: IP addresses, IDs, random generated passwords.

For these type of tests we are currently working with terratest. Terratest is a collection of go libraries that simplify the interaction with IaC providers and well-known cloud providers. For the following example we are using the terraform module. We can see that terratest easily integrates into terraform and is able to send commands and extract output from them.

package test

import (
    "testing"

    "github.com/gruntwork-io/terratest/modules/terraform"
)

func TestApply(t *testing.T) {
    t.Parallel()

    tfOptions := &terraform.Options{
        TerraformDir: "./terraform/apply_test",
    }

    defer terraform.Destroy(t, tfOptions)

    terraform.InitAndApply(t, tfOptions)

    // further validations go here

}

Up- and downsides to consider:

+ Helps with complex resources with many dependencies
+ Test scenario easily transferable to other resources
Long duration of the test in relation to the benefit

3. Imitations Tests

Idea: Imitate required operations of the application on an infrastructure level.

Goal: Check if infrastructure is configured correctly.

Examples: Network test, I/O tests.
package test

Sketch of a simple network test setup.
Example of a network test.

    import (
        "testing"
        "time"

        "github.com/gruntwork-io/terratest/modules/gcp"
        "github.com/gruntwork-io/terratest/modules/ssh"
        "github.com/gruntwork-io/terratest/modules/terraform"
    )

    var gcpProject string = "terratest-talk"
    var instanceName string = "public-1"

    func TestConfig(t *testing.T) {
        t.Parallel()

        tfOptions := &terraform.Options{
            TerraformDir: "./terraform/config_test",
        }

        t.Cleanup(func() {
            terraform.Destroy(t, tfOptions)
        })

        terraform.InitAndApply(t, tfOptions)

        pubVM := gcp.FetchInstance(t, gcpProject, instanceName)

        // generate dynamic ssh key and temporarily add it to our machine,
        // so we can log in to execute the test
        keyPair := ssh.GenerateRSAKeyPair(t, 2048)
        pubVM.AddSshKey(t, "terratest", keyPair.PublicKey)
        host := ssh.Host{
            Hostname:    terraform.Output(t, tfOptions, "public_ip"),
            SshUserName: "terratest",
            SshKeyPair:  keyPair,
        }
        // we have to wait for AddSshkey to finish. Don't do this in production!
        time.Sleep(3 * time.Second)

        // Check that we can access transit
        _, err := ssh.CheckSshCommandE(t, host, "ping -c 3 10.0.1.2")
        if err != nil {
            t.Fatalf("Could not connect to transit from public: '%v'", err)
        }

        // Check that we cannot access private
        _, err = ssh.CheckSshCommandE(t, host, "ping -c 3 10.0.2.2")
        if err == nil {
            t.Fatalf("Could connect to private from public, which is not allowed")
        }

    }

Up- and downsides you want to consider

+ Simulation of real scenarios at infrastructure level
+ Easy Integration in apply and destroy tests
Often requires to temporarily disable security measures for the execution of the test ⇒ requires additional tests to ensure security measure were not disabled permanently

4 Practical Examples Suited for IaC Testing

  1. Network structures (see Imitation test above)
  2. VM-Configs (e.g. directory structures, filesystem permissions)
  3. Cloud specific resources (e.g. policies, firewalls, managed load-balancers)
  4. Kubernetes Rollouts (e.g. number of pods after deployment)

The 3 Key Factors for Choosing Your Test Tooling

  1. Good integration into the IaC provider itself
  2. Integration with low level tooling (http client, ssh, openssl, …)
  3. Re-try, repeat intervals adjustable

Let us know what you think and don’t forget sharing this guide!

To learn more about the meshcloud platform, please get in touch with our sales team or book a demo with one of our product experts. We’re looking forward to get in touch with you.

Written by Simon Bein and Wulf Schiemann